Hi, I'm Joshua




Word Counts are a mess

First Published: 2024-05-20, Last Updated: 2024-05-20

Why are word counts different between different software

Background

Recently I had to submit a 'correct' word count along with a university assignment. This proved very difficult as each software I used gave entirely different word counts from each other, and there was no way I was going to try manually count a roughly 600 word document.

So I decided to do a test and check why each software has different word counts.

How I tested.

I first started with a warm-up: a simple DOCX document with a 100 word story.
This acted as a litmus test to check if there weren't any significant issues with the software.

Then came the benchmark: a DOCX document with every feature I could think of which has a word count of 219 (counted manually).
This DOCX document was then converted into a DOC document (via Word Desktop) to see if there were any significant differences.

Limitations and Expectations:

Any word count generated from the benchmark will be wrong.
This is by design as the document contains many tricks to try and trip up software.
The main point is to find the differences in how each software measures word counts through a stress test.

That being said, let me list and explain some certain unknowns
(Things that shouldn't count as a word, but as long as the software is consistent is counted as correct within reason).

They include

  • Bullet points (& Numbered lists): Should the bullet or the number count as a word? > I think they shouldn't
  • Headers and footers: Should they be counted once or for each page? > Technically should count for each page, but as a student, I would prefer if they don't
  • Do words/separated/by/slashes/count as separate words or one big word?
  • Do words.separated.by.dots.count as separate words or one big word? > These two inclusions may seem pedantic, but they are important as they signify if URLs or Acroynms count as a single word or multiple.

Results

What were the actual results:

.DOCX

  • Word (Desktop App)(Office 365 Enterprise): 170
  • Word (Website): 126
  • Libre: 187
  • Google Docs: 135
  • Apple Pages: 154

.DOC

  • Word (Desktop App)(Office 365 Enterprise): 166
  • Word (Website): Not Supported at all
  • Libre: 187
  • Google Docs: 131
  • Apple Pages: 154

Let's talk about these figures:

Word (Desktop + Online)

Notable Behaviours

  1. Word (Desktop+Online) don't count words from headers and footers
  2. Word (Desktop+Online) since Word 2003 has counted bullets as a separate word (Source)
  3. Word Online does not count words in equations, captions, WordArt, and textboxes.

Notes

Right off the bat, there was a significant descrepancy between Word (Desktop) and Word (Online).
I had expected it already, as this was one of the problems for my uni assignment.
A quick Google search corroborates my testing on Word Online as it does not count words in text boxes, headers, footers and SmartArt (Source).

During the testing, I found a strange quirk in Word Online:
Inexplicably, Word Online doesn't counts bullets from bulletpoints, but it does count numbers from numbered lists.
This is quite likely to be a bug, but it is quite peculiar, as given this behaviour, I am not sure if algorithm was intended to count list markers or not.

Libre Office

Notable Behaviours

  1. Almost all inclusive > Almost everything that I added in the document that should count towards an word count is counted
  2. The sole exception I could find were equations

Notes

From my testing, Libre Office counts almost everything in the document, which is quite impressive, it even counted the textbox in the Header. I only found 3 flaws in its counting.

  • Words in equations were not counted.
  • Citation marks were counted.
  • The page number is counted.

To clarify what I meant by Citation marks, it is the number next to a text (e.g. study1).
This behaviour along with counting page numbers is unique among all the software tested and in my personal opinion is incorrect as I consider those punctuation.

Finally, Although Libre Office counts the most out of all software, it has a rather annoying bug where the word count is not counted when undoing an operation/ , the word count after an undo is only updated after some other thing is updated. This was a rather major annoyance in testing Libre.

Google Docs

Notable Behaviours

  • Bullets/Numbers, Headers/Footers, Foot/Endnotes are not counted
  • Slash and Dot separated words are counted separately
  • Words in drawings are not counted.

Notes

Google Docs is rather unique in that it seems the most removed from any other software tested
i.e. Google Docs seems to do its own thing and how it handles compatibility with .DOCX is different from everything else.

Some of the ways it handles .DOCX differently are

  • Textboxes and WordArt are imported as a 'Drawing', a construct unique to Google Docs afaik.
  • Captions are not imported.
  • Tables nested in Textboxes are not imported and are missing.

Now the quirks in its wordcounts:

  • Similar to the Word suite, it doesn't count header/footers.
  • Uniquely, it doesn't count words in foot/Endnotes

And the kicker:
It counts links and Acroynms as separate words,
i.e. this link: https://www.youtube.com/watch?v=dQw4w9WgXcQ which counts as 1 word in Word counts as 7 words in Google Docs.
This is highly disruptive as this would massively mess with anyone writing documents in Word, who is also working with another person on Google Docs and is especially annoying as all the software I tested counts the words in Bibliographies, so any documents with links cited would suddenly show up with an inflated wordcount.

Apple Pages

Notable Behaviours

  • Bullets/Numbers, Headers/Footers are not counted
  • Equations are not counted
  • A lot of jank

Notes

In comparison to Google Docs, Apple Pages seems much tamer in how it imports elements from .DOCX documents.
The only issue I had with imports were tables nested in textboxes, which imported the table as a text representation.
This is rather minor, given that Pages gives an warning in advance.

What is rather egregious however, is how it inconsistently applies its counting.

  • For some reason, it somehow counted text in textboxes nested inside the header, but it didn't count any other words in the header.
  • The title in a table in a table of contents is counted, but the table directory isn't.
  • Links and Acroynms are counted separately (similar to Google Docs)
  • Uniquely, Emoji are counted as words, and not as a char as with the other software
    • What this means is that 🥡🦪 🐈‍⬛ is counted as 3 words with Apple pages and 2 words in other software.
  • However, Apple Pages doesn't count mathematical symbols at all, but it does count Greek letters as symbols

Conclusion

Now that this exercise is finished, I don't know what to feel about the results.
It is great that the differences in word count is finally quantifiable and tested, but ultimately it doesn't solve anything as everyone uses vastly different software.

The irony, is that this still won't be of any significant help for my assignment, as it uses Canvas's speedgrader, which I have been unable to prod a wordcount number out of.

The universal solution is to have wordcounts with a margin of ± some number x,
and to NEVER require a precise wordcount to be written inside the document.

Questions

If you have questions, feel free to send them to jchu634@keshuac.com

Appendices

AP 1.1: Footnote: .DOC

I had intended to write about .DOC as I had expected there to be a significant discrepancy between .DOC and .DOCX, but that has not materialised in my testing.
The only difference is in the conversion, equations are transformed into an image.
So all the wordcounts are offset by the 4 words in the equations.
The sole application with no change to word count (Libre Office) only has no difference because it never supported equations in word counts.

The only quirk I found with .DOC compatibility was with Google Docs,
Unlike with .DOCX, it managed to import the picture caption, but it imported it incorrectly as shown below
(And yes, it was distorted in the document too). Distorted Picture Caption

AP 1.2: Footnote: .ODT

I forgot this format existed until I wrote this blogpost,
I may get around to testing it in the future.

AP 1.3: Benchmark Files

All of files used in the testing are freely and publicly available at https://github.com/jchu634/WordCountTesting
Feedback is very much welcome!

AP 1.4: Results Table:

AP 1.4.1: .DOCX

Header/FooterTextbox in HeaderHeadingsTextListNumbered listTable Of ContentsTablesCitation MarkEquationsLinksSlashSeperatedWordsAcroynmsBibliographiesCaptionsFootnotesEndNotesWordArtTextboxCommentsPage NumbersSub/SuperScript
Word (Desktop)✔️✔️✔️✔️✔️✔️✔️One WordOne WordOne Word✔️✔️✔️✔️✔️✔️One Word
Word (Online)✔️✔️✔️✔️✔️One WordOne WordOne Word✔️✔️✔️One Word
Libre✔️✔️✔️✔️✔️✔️✔️✔️✔️One WordOne WordOne Word✔️✔️✔️✔️✔️✔️✔️One Word
Google Docs✔️✔️✔️ (Bullets Don't Count)✔️ (Bullets Don't Count)✔️✔️✔️SeparateSeparateSeparate✔️❌ (Not Imported)❌ (imported as drawing)❌ (imported as drawing)One Word
Apple Pages✔️✔️✔️✔️ (Bullets Don't Count)✔️ (Bullets Don't Count)✔️/❌ (Title counts, but not the table)✔️SeparateSeparateSeparate✔️✔️✔️✔️✔️✔️One Word

AP 1.4.2: .DOC

Header/FooterTextbox in HeaderHeadingsTextListNumbered listTable Of ContentsTablesCitation MarkEquationsLinksSlashSeperatedWordsAcroynmsBibliographiesCaptionsFootnotesEndNotesWordArtTextboxCommentsPage NumbersSubScript
Word (Desktop)✔️✔️✔️✔️✔️✔️N.A.One WordOne WordOne Word✔️✔️✔️✔️✔️✔️
Word (Online)N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.
Libre✔️✔️✔️✔️✔️✔️✔️✔️✔️N.A.One WordOne WordOne Word✔️✔️✔️✔️✔️✔️
Google Docs✔️✔️✔️ (Bullets Don't Count)✔️ (Bullets Don't Count)✔️✔️N.A.SeparateSeparateSeparate✔️❌ (Imported, But with a quirk)❌ (imported as drawing)❌ (imported as drawing)
Apple Pages✔️✔️✔️✔️ (Bullets Don't Count)✔️ (Bullets Don't Count)✔️/❌ (Title counts, but not the table)✔️SeparateSeparateSeparate✔️✔️✔️✔️✔️✔️One Word