Hi, I'm Joshua




Word Counts are a mess

First Published: 2024-05-20, Last Updated: 2024-11-10

Why are word counts different between different software

Background

Word counts are a mess.
Once in a Uni assignment, I had to submit an accurate word count along with the assignment. This proved impossible as each document software gave different word counts and there was no way I was going to manually count an approx ~600 word document.
Afterwards, I was curious as to why each software gave different word counts, so I decided to do some testing.

This blogpost is a collation of my findings.

Testing Methodology:

My testing was not quite scientific, but I tried to at least be consistent.
I standardised using document formats .DOCX and .DOC as it is quite popular due to Microsoft Office's dominance and is well supported by most software.
I then chose a selection of popular office document software then tested each software (Word Desktop, Word Online, Libre Office, Google Docs, Apple Pages).

For each software, I first started with a sanity check: a simple DOCX document with a 100 word story.
This acted as a litmus test to check if there weren't any significant issues with the software.

Then came the benchmark: a DOCX document with every feature I could think of which has a word count of 219 (counted manually).
This DOCX document was then converted into a DOC document (via Word Desktop) to see if there were any significant differences.

Limitations and Expectations:

This benchmark is not a real-world document, any word count generated from the benchmark will be wrong.
This is by design as I added as many tricks to try and trip up software to find edge cases.
The main point of this exercise is to find the differences in how each software measures word counts through a stress test and not to find the actual word count of the document as that is quite subjective.

That being said, let me list and explain some certain unknowns
(Things that shouldn't count as a word, but as long as the software is consistent is counted as correct within reason).

They include

  • Bullet points (& Numbered lists): Should the bullet or the number count as a word? > Personally I think they shouldn't
  • Headers and footers: Should they be counted once or for each page? > I think it technically should count for each page, but as a student, I would prefer if they don't
  • Do words/separated/by/slashes/count as separate words or one big word?
  • Do words.separated.by.dots.count as separate words or one big word? > These two inclusions may seem pedantic, but they are important as they signify if URLs or Acroynms count as a single word or multiple.

Results

Now here we get to the meat of the blogpost, the results of the testing.

.DOCX

  • Word (Desktop App)(Office 365 Enterprise): 170
  • Word (Website): 126
  • Libre: 187
  • Google Docs: 135
  • Apple Pages: 154

.DOC

  • Word (Desktop App)(Office 365 Enterprise): 166
  • Word (Website): Not Supported at all
  • Libre: 187
  • Google Docs: 131
  • Apple Pages: 154

Observations

Now that we have the results, I have to say that I am quite surprised by the results.
There is a massive range between the word counts (126-187), and supprisingly, there is no difference between .DOC and .DOCX.
The difference in Google Docs and Word Desktop is due to lack of equations in the .DOC due to the conversion process.
Because of this, I will only be discussing the .DOCX results, lets have some fun and look at the quirks of each software:

Word (Desktop + Online)

Discussion

A rather interesting quirk is that Word (Desktop) and Word (Online) have different word counts.
This quirk was the main motivation for this blogpost, as this caused much grief in my assignment. In my testing and research, I found that Word Online doesn't count words in text boxes, headers, footers and SmartArt.
I found this rather strange, as I felt the main point of Word Online was to compete with Google Docs with its collaborative features. And for students which are a major target for the Office suite, this inconsistency is quite annoying as groups may be working on the same document in Word Online and Word Desktop, and the word count would be different between the two.

As an aside, during testing I found a strange quirk in Word Online:
Inexplicably, Word Online doesn't counts bullets from bulletpoints, but it does count numbers from numbered lists.
This is almost certainly a bug, but I have no idea if the intended behaviour was to count the bulletpoints or to not count list markers at all.

Notable Behaviours

  1. Word (Desktop+Online) don't count words from headers and footers
  2. Word (Desktop+Online) since Word 2003 has counted bullets as a separate word (Source)
  3. Word Online does not count words in equations, captions, WordArt, and textboxes.

Libre Office

Discussion

Libre Office is rather impressive, as it caught almost every trick I threw at it. (Even the textbox in the header!)
I could only find 3 flaws in its wordcount, and some of those flaws are subjective whether they should be counted or not.

  1. Words in equations were not counted.
  2. Citation marks were counted.
  3. The page number is counted.

The first flaw I feel is valid as even if equations are not counted, words in equations should be counted or else I could just write an entire essay in an equation and bypass any wordcount requirements.

The second flaw, although more subjective, I feel is definitely incorrect as I feel that citation marks should be considered punctuation and should not be counted.
At the very least, this behaviour is uniqe among all the software tested.
(Citation Marks are the numbers next to a text which indicate citations, e.g. study1)

Finally I can't quite articulate an explanation for the third flaw, but it is "just wrong".

Note: Although Libre Office has the most inclusive wordcount algorithm, I have to say it was the most annoying for me to use and test
as it has the rather annoying behaviour of not updating the word count after undoing anything.

Notable Behaviours

  1. Almost all inclusive > Almost everything that I added in the document that should count towards an word count is counted
  2. Equations are not counted
  3. Sometimes has unique ideas on what should be counted as a word

Google Docs

Discussion

Before I get to wordcounts, I have to note that Google Docs has its own unique approach to .DOCX compatibility.
Google Docs, I assume was never meant to compete directly with Word, but rather to offer collaboration as its main selling point
because of that it often has its own implementations of many DOCX features.

Some of the ways it handles .DOCX are:

  • Textboxes and WordArt are imported as a 'Drawing', a construct unique to Google Docs afaik.
  • Captions are not imported.
  • Tables nested in Textboxes are not imported and are missing.

Now the quirks in its wordcounts:

  • Similar to the Word suite, it doesn't count header/footers.
  • Uniquely, it doesn't count words in foot/Endnotes

But here's the real kicker:
It counts links and Acroynms as separate words,
i.e. this link: https://www.youtube.com/watch?v=dQw4w9WgXcQ which counts as 1 word in Word counts as 7 words in Google Docs.
This is especially egregious as it would inflate the word count of any documents with links, which is especially bad as it would count links in a Bibliography or a references section.

Notable Behaviours

  • Bullets/Numbers, Headers/Footers, Foot/Endnotes are not counted
  • Slash and Dot separated words are counted separately
  • Words in drawings are not counted.
  • .DOC compatibility:
    • The software with a .DOC compatibility quirk, it managed to import the picture caption, but it imported it distorted and incorrectly as shown below
      Distorted Picture Caption

Apple Pages

Discussion

In comparison to Google Docs, Apple Pages seems much tamer in how it imports elements from .DOCX documents.
The only issue I had with imports were tables nested in textboxes, which imported the table as a text representation.
This is rather minor, as Pages gives an warning in advance.

That's where the good news ends as Apple Pages is inconsistent in its word count.

  • It somehow counted text in textboxes nested inside the header, but it didn't count any other words in the header.
  • The title in a table in a table of contents is counted, but the table directory isn't.
  • Links and Acroynms are counted separately (Same as Google Docs)
  • Apple Pages doesn't count mathematical symbols at all, but it does count Greek letters as symbols

Notably, Apple Pages is the only software which counts emoji as words, and not as a char.\

  • What this means is that 🥡🦪, 🐈‍⬛ is counted as 3 words with Apple pages and 2 words in other software.

Notable Behaviours

  • Bullets/Numbers, Headers/Footers are not counted
  • Equations are not counted
  • Links and Acroynms are counted separately
  • Emoji are counted as words
  • Strange Table of Contents behaviour

Conclusion

Now that this exercise is finished, I don't know quite what to feel about the results.
I don't know which software is the most correct, as ultimately many of the decision on what is or isn't a word is debatable.\

Ironically, I discovered that this still hasn't fixed the issue with my original Uni assignment as they use Canvas's speedgrader,\ which does not expose its wordcount to students.

Ultimately, the best solution is to build in a margin of error into wordcount requirements
and to NEVER require a precise wordcount to be written inside the document.

Questions

If you have questions, feel free to send them to jchu634@keshuac.com

Appendices

AP 1.1: Benchmark Files

All of files used in the testing are freely and publicly available at https://github.com/jchu634/WordCountTesting
Feedback is very much welcome!

AP 1.2: Results Table:

AP 1.2.1: .DOCX

Header/FooterTextbox in HeaderHeadingsTextListNumbered listTable Of ContentsTablesCitation MarkEquationsLinksSlash/Seperated/WordsAcronymsBibliographiesCaptionsFootnotesEndNotesWordArtTextboxCommentsPage NumbersSub/SuperScript
Word (Desktop)✔️✔️✔️✔️✔️✔️✔️One WordOne WordOne Word✔️✔️✔️✔️✔️✔️One Word
Word (Online)✔️✔️✔️✔️✔️One WordOne WordOne Word✔️✔️✔️One Word
Libre✔️✔️✔️✔️✔️✔️✔️✔️✔️One WordOne WordOne Word✔️✔️✔️✔️✔️✔️✔️One Word
Google Docs✔️✔️⭕ (No Bullets)⭕ (No Bullets)✔️✔️✔️SeparateSeparateSeparate✔️❌ (Not Imported)❌ (imported as drawing)❌ (imported as drawing)One Word
Apple Pages✔️✔️✔️⭕ (No Bullets)⭕ (No Bullets)✔️/❌ (Only Title counts)✔️SeparateSeparateSeparate✔️✔️✔️✔️✔️✔️One Word

AP 1.4.2: .DOC

Header/FooterTextbox in HeaderHeadingsTextListNumbered listTable Of ContentsTablesCitation MarkEquationsLinksSlashSeperatedWordsAcroynmsBibliographiesCaptionsFootnotesEndNotesWordArtTextboxCommentsPage NumbersSubScript
Word (Desktop)✔️✔️✔️✔️✔️✔️N.A.One WordOne WordOne Word✔️✔️✔️✔️✔️✔️
Word (Online)N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.N.A.
Libre✔️✔️✔️✔️✔️✔️✔️✔️✔️N.A.One WordOne WordOne Word✔️✔️✔️✔️✔️✔️
Google Docs✔️✔️⭕ (No Bullets)✔️ (Bullets Don't Count)✔️✔️N.A.SeparateSeparateSeparate✔️❌ (Imported, But with a quirk)❌ (imported as drawing)❌ (imported as drawing)
Apple Pages✔️✔️✔️⭕ (No Bullets)⭕ (No Bullets)✔️/❌ (Only Title counts)✔️SeparateSeparateSeparate✔️✔️✔️✔️✔️✔️One Word