Word Counts are a mess
First Published: 2024-05-20, Last Updated: 2024-05-20
Why are word counts different between different software
Why are word counts different between different software
Recently I had to submit a 'correct' word count along with a university assignment. This proved very difficult as each software I used gave entirely different word counts from each other, and there was no way I was going to try manually count a roughly 600 word document.
So I decided to do a test and check why each software has different word counts.
I first started with a warm-up: a simple DOCX document with a 100 word story.
This acted as a litmus test to check if there weren't any significant issues with the software.
Then came the benchmark: a DOCX document with every feature I could think of which has a word count of 219 (counted manually).
This DOCX document was then converted into a DOC document (via Word Desktop) to see if there were any significant differences.
Any word count generated from the benchmark will be wrong.
This is by design as the document contains many tricks to try and trip up software.
The main point is to find the differences in how each software measures word counts through a stress test.
That being said, let me list and explain some certain unknowns
(Things that shouldn't count as a word, but as long as the software is consistent is counted as correct within reason).
They include
What were the actual results:
Let's talk about these figures:
Right off the bat, there was a significant descrepancy between Word (Desktop) and Word (Online).
I had expected it already, as this was one of the problems for my uni assignment.
A quick Google search corroborates my testing on Word Online as it does not count words in text boxes, headers, footers and SmartArt (Source).
During the testing, I found a strange quirk in Word Online:
Inexplicably, Word Online doesn't counts bullets from bulletpoints, but it does count numbers from numbered lists.
This is quite likely to be a bug, but it is quite peculiar, as given this behaviour, I am not sure if algorithm was intended to count list markers or not.
From my testing, Libre Office counts almost everything in the document, which is quite impressive, it even counted the textbox in the Header. I only found 3 flaws in its counting.
To clarify what I meant by Citation marks, it is the number next to a text (e.g. study1).
This behaviour along with counting page numbers is unique among all the software tested and in my personal opinion is incorrect as I consider those punctuation.
Finally, Although Libre Office counts the most out of all software, it has a rather annoying bug where the word count is not counted when undoing an operation/ , the word count after an undo is only updated after some other thing is updated. This was a rather major annoyance in testing Libre.
Google Docs is rather unique in that it seems the most removed from any other software tested
i.e. Google Docs seems to do its own thing and how it handles compatibility with .DOCX is different from everything else.
Some of the ways it handles .DOCX differently are
Now the quirks in its wordcounts:
And the kicker:
It counts links and Acroynms as separate words,
i.e. this link: https://www.youtube.com/watch?v=dQw4w9WgXcQ which counts as 1 word in Word counts as 7 words in Google Docs.
This is highly disruptive as this would massively mess with anyone writing documents in Word, who is also working with another person on Google Docs and is especially annoying as all the software I tested counts the words in Bibliographies, so any documents with links cited would suddenly show up with an inflated wordcount.
In comparison to Google Docs, Apple Pages seems much tamer in how it imports elements from .DOCX documents.
The only issue I had with imports were tables nested in textboxes, which imported the table as a text representation.
This is rather minor, given that Pages gives an warning in advance.
What is rather egregious however, is how it inconsistently applies its counting.
Now that this exercise is finished, I don't know what to feel about the results.
It is great that the differences in word count is finally quantifiable and tested,
but ultimately it doesn't solve anything as everyone uses vastly different software.
The irony, is that this still won't be of any significant help for my assignment, as it uses Canvas's speedgrader, which I have been unable to prod a wordcount number out of.
The universal solution is to have wordcounts with a margin of ± some number x,
and to NEVER require a precise wordcount to be written inside the document.
If you have questions, feel free to send them to jchu634@keshuac.com
I had intended to write about .DOC as I had expected there to be a significant discrepancy between .DOC and .DOCX, but that has not materialised in my testing.
The only difference is in the conversion, equations are transformed into an image.
So all the wordcounts are offset by the 4 words in the equations.
The sole application with no change to word count (Libre Office) only has no difference because it never supported equations in word counts.
The only quirk I found with .DOC compatibility was with Google Docs,
Unlike with .DOCX, it managed to import the picture caption, but it imported it incorrectly as shown below
(And yes, it was distorted in the document too).
I forgot this format existed until I wrote this blogpost,
I may get around to testing it in the future.
All of files used in the testing are freely and publicly available at https://github.com/jchu634/WordCountTesting
Feedback is very much welcome!
Header/Footer | Textbox in Header | Headings | Text | List | Numbered list | Table Of Contents | Tables | Citation Mark | Equations | Links | SlashSeperatedWords | Acroynms | Bibliographies | Captions | Footnotes | EndNotes | WordArt | Textbox | Comments | Page Numbers | Sub/SuperScript | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Word (Desktop) | ❌ | ❌ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ✔️ | One Word | One Word | One Word | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | One Word |
Word (Online) | ❌ | ❌ | ✔️ | ❌ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | One Word | One Word | One Word | ✔️ | ❌ | ✔️ | ✔️ | ❌ | ❌ | ❌ | ❌ | One Word |
Libre | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | One Word | One Word | One Word | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ✔️ | One Word |
Google Docs | ❌ | ❌ | ✔️ | ✔️ | ✔️ (Bullets Don't Count) | ✔️ (Bullets Don't Count) | ✔️ | ✔️ | ❌ | ✔️ | Separate | Separate | Separate | ✔️ | ❌ (Not Imported) | ❌ | ❌ | ❌ (imported as drawing) | ❌ (imported as drawing) | ❌ | ❌ | One Word |
Apple Pages | ❌ | ✔️ | ✔️ | ✔️ | ✔️ (Bullets Don't Count) | ✔️ (Bullets Don't Count) | ✔️/❌ (Title counts, but not the table) | ✔️ | ❌ | ❌ | Separate | Separate | Separate | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | One Word |
Header/Footer | Textbox in Header | Headings | Text | List | Numbered list | Table Of Contents | Tables | Citation Mark | Equations | Links | SlashSeperatedWords | Acroynms | Bibliographies | Captions | Footnotes | EndNotes | WordArt | Textbox | Comments | Page Numbers | SubScript | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Word (Desktop) | ❌ | ❌ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | N.A. | One Word | One Word | One Word | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ❌ |
Word (Online) | N.A. | N.A. | N.A. | N.A. | N.A. | N.A. | N.A. | N.A. | N.A. | N.A. | N.A. | N.A. | N.A. | N.A. | N.A. | N.A. | N.A. | N.A. | N.A. | N.A. | N.A. | N.A. |
Libre | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | N.A. | One Word | One Word | One Word | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ✔️ | ❌ | |
Google Docs | ❌ | ❌ | ✔️ | ✔️ | ✔️ (Bullets Don't Count) | ✔️ (Bullets Don't Count) | ✔️ | ✔️ | ❌ | N.A. | Separate | Separate | Separate | ✔️ | ❌ (Imported, But with a quirk) | ❌ | ❌ | ❌ (imported as drawing) | ❌ (imported as drawing) | ❌ | ❌ | ❌ |
Apple Pages | ❌ | ✔️ | ✔️ | ✔️ | ✔️ (Bullets Don't Count) | ✔️ (Bullets Don't Count) | ✔️/❌ (Title counts, but not the table) | ✔️ | ❌ | ❌ | Separate | Separate | Separate | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ❌ | One Word |