A Dumb Battleship Algorithm

A while ago my friend Chad suggested that we make algorithms to play battleship on sets of random boards and see whose was better. This was all inspired by a blogpost Chad had read from datagenetics, where Nick Berry applied battleship algorithms to random boards and saw which ones worked better than others.

I only had a few free hours, so I decided that I wanted to make an algorithm that was as good as possible, under the strict constraint that it was also as simple and dumb as possible. I wanted to avoid complicated math, difficult combinatorics, and any algorithm that would require walking down some huge decision tree I would have to manually populate.

As basic as the underlying idea was, the project raised a number of interesting ideas, and put the pros and cons of neural networks (NN’s) in sharp relief.

I Before E, also ‘Their’

Every once in a while, I see the following claim made:

There are more exceptions to ‘I before E except after C’ than there are words that follow the rule, so it doesn’t count as a useful rule of English.

I first read this in a fact book in high school, and — holding a grudge from an especially low grade I received on a middle school spelling test about that very rule — I immediately believed the criticism.

But now that I’m a mature adult, I’ve gained some distance from that completely unfair and really dumb 6th grade spelling test (that really shouldn’t have counted ’cause I was sick the day before but whatever).  And I find myself wondering how ineffective this rule actually is.

[Note: There is a longer version of this rule, but it’s less often quoted, and so I’m going to ignore it]

This raises the question: how do we measure the accuracy of such a rule?

The common criticism (expressed in the quote) is that there are more words which break the rule than follow it. Lets name this metric $M_{stupid}$ for reasons which, if not already, will soon become obvious.

$M_{stupid} \equiv (\text{Words Following Rule}) - (\text{Exceptions})$

Then if $M_{stupid}$ negative, we can definitively say that this is a bad rule for English grammar.

After half a second’s thought, however, we can see why this metric is called $M_{stupid}$: not all words are used at equal frequencies. For all we know, most of the words which break this rule are obscure Latin names of chemicals that nobody ever really comes across, but are technically English. (this isn’t the case, but you get the point)

What we really care about is how accurate this rule is in practice. That is, for each instance we come across to deploy the rule in written language, how often does it give you the correct answer.

Lets assume we know how to spell everything in English, but when we come across an “ei” or “ie” in a word, we have no specific information about the arrangement. We can only rely on the rule “I before E except after C”. How accurate is our writing?

Simple math gets us most of the way there

$Accuracy = \frac{\text{Number of Correct Rule Applications}}{\text{Total Number of Rule Applications}}$

The total number of rule applications is simply the number of instances of either “ie” or “ei” when we write. We’ll denote this as follows:

$Total = (ie + ei)$

The total number of correct applications of the rule is the number of times where words in language follow the rule (obviously). See the image below for a highly technical representation of this formula.

We can also write this as follows:

$Correct = ((ie - cie) + cei)$

Putting this all together, we have

$Accuracy = ((ie - cie) + cei)/(ei + ie)$

At this point, we have a rule, but we still haven’t addressed what data set to apply this metric to. Ideally, we would be able to search through some corpus of correctly spelled words, weighted by their use frequency.

I’m not a computational linguist, so this point only a few options occurred to me here.

The first was to use Google NGram. This would have been great, but I’m looking for substrings within words, and NGram searches for prevalence of whole words.

There are lovely corpora of words with frequency data attached (http://norvig.com/ngrams/) which are completely free, but these contain misspellings, which is renders it virtually useless by itself.

Finally, there are the beautiful top tier word frequency data sets (http://www.corpusdata.org/) with perfectly parsed data from the entirety of Wikipedia. This, however, costs money, so I’m ruling it out.

[Note: I could have downloaded all of Wikipedia and turned it into a text file on my own, but this seemed hard, and I have better things to do with my life]

My solution was to combine the free options: use a dictionary package, remove the misspelled words from the free frequency data, and perform the calculation on that (here’s a link to the data for the curious: http://norvig.com/ngrams/count_1w.txt).

Finally, after all this, I found an accuracy of ….

I before E except after C: 76.4%

Now this doesn’t seem like an entirely terrible rule after all.

This is not to say that this is a particularly desirable score, but it’s markedly better than the “over half of cases wrong” accusation that is thrown around so often.

Something interesting (which has been noticed by others) is that the rule “I before E” with no reference to C at all has a higher accuracy:

I before E: 80.99%

Lets see if we can do better. Is there a letter that can take C’s place to improve accuracy?

C comes in near the bottom, in 21st place. That’s not good. You can literally choose a letter at random to inset into “I before E except after _”, and have a ~77% chance at a better spelling rule.

On the positive side, H seems to be the clear winner here, at 86.8% accurate. But a lot of this advantage is simply due to the existence of the word ‘their’. If I incorporate this fact, I can produce the most accurate variation I’ve seen yet.

Summing up, I would like to propose a permanent replacement to ‘I before E except after C’ (which yielded a measly 76.4% accuracy) :

I before E, also ‘their’: 87.9%