Every once in a while, I see the following claim made:
There are more exceptions to ‘I before E except after C’ than there are words that follow the rule, so it doesn’t count as a useful rule of English.
I first read this in a fact book in high school, and — holding a grudge from an especially low grade I received on a middle school spelling test about that very rule — I immediately believed the criticism.
But now that I’m a mature adult, I’ve gained some distance from that completely unfair and really dumb 6th grade spelling test (that really shouldn’t have counted ’cause I was sick the day before but whatever). And I find myself wondering how ineffective this rule actually is.
[Note: There is a longer version of this rule, but it’s less often quoted, and so I’m going to ignore it]
This raises the question: how do we measure the accuracy of such a rule?
The common criticism (expressed in the quote) is that there are more words which break the rule than follow it. Lets name this metric for reasons which, if not already, will soon become obvious.
Then if negative, we can definitively say that this is a bad rule for English grammar.
After half a second’s thought, however, we can see why this metric is called : not all words are used at equal frequencies. For all we know, most of the words which break this rule are obscure Latin names of chemicals that nobody ever really comes across, but are technically English. (this isn’t the case, but you get the point)
What we really care about is how accurate this rule is in practice. That is, for each instance we come across to deploy the rule in written language, how often does it give you the correct answer.
Lets assume we know how to spell everything in English, but when we come across an “ei” or “ie” in a word, we have no specific information about the arrangement. We can only rely on the rule “I before E except after C”. How accurate is our writing?
Simple math gets us most of the way there
The total number of rule applications is simply the number of instances of either “ie” or “ei” when we write. We’ll denote this as follows:
The total number of correct applications of the rule is the number of times where words in language follow the rule (obviously). See the image below for a highly technical representation of this formula.
We can also write this as follows:
Putting this all together, we have
At this point, we have a rule, but we still haven’t addressed what data set to apply this metric to. Ideally, we would be able to search through some corpus of correctly spelled words, weighted by their use frequency.
I’m not a computational linguist, so this point only a few options occurred to me here.
The first was to use Google NGram. This would have been great, but I’m looking for substrings within words, and NGram searches for prevalence of whole words.
There are lovely corpora of words with frequency data attached (http://norvig.com/ngrams/) which are completely free, but these contain misspellings, which is renders it virtually useless by itself.
Finally, there are the beautiful top tier word frequency data sets (http://www.corpusdata.org/) with perfectly parsed data from the entirety of Wikipedia. This, however, costs money, so I’m ruling it out.
[Note: I could have downloaded all of Wikipedia and turned it into a text file on my own, but this seemed hard, and I have better things to do with my life]
My solution was to combine the free options: use a dictionary package, remove the misspelled words from the free frequency data, and perform the calculation on that (here’s a link to the data for the curious: http://norvig.com/ngrams/count_1w.txt).
Finally, after all this, I found an accuracy of ….
I before E except after C: 76.4%
Now this doesn’t seem like an entirely terrible rule after all.
This is not to say that this is a particularly desirable score, but it’s markedly better than the “over half of cases wrong” accusation that is thrown around so often.
Something interesting (which has been noticed by others) is that the rule “I before E” with no reference to C at all has a higher accuracy:
I before E: 80.99%
Lets see if we can do better. Is there a letter that can take C’s place to improve accuracy?
C comes in near the bottom, in 21st place. That’s not good. You can literally choose a letter at random to inset into “I before E except after _”, and have a ~77% chance at a better spelling rule.
On the positive side, H seems to be the clear winner here, at 86.8% accurate. But a lot of this advantage is simply due to the existence of the word ‘their’. If I incorporate this fact, I can produce the most accurate variation I’ve seen yet.
Summing up, I would like to propose a permanent replacement to ‘I before E except after C’ (which yielded a measly 76.4% accuracy) :
I before E, also ‘their’: 87.9%