Ho-Chunk spell checker

A spell-checker for a small corpus

Ho-Chunk (Hoocąk) is a Siouan language without a traditional writing system, although it has been written in a variety of ways by people in and outside the tribe. Such writing systems have included various phonetic approximations in early settler wordlists, the adoption of the Great Lakes Algonquin syllabics by Ho-Chunks from the Fox1, and more recently a version of the International Phonetic Alphabet. Because so many systems have been employed over time and because the language was traditionally only spoken, current language learners are often encouraged to either just spell words as they hear them or ignore diacritical marks. Of course, this is exacerbated by larger structural issues like the lack of a standard Ho-Chunk language keyboard for Android or iOS.

It seems like the development of a spell-checker could serve to standardize newly written documents and possibly older historical texts. To start, it’s worth reviewing other spell-checker solutions.

Norvig Solution

A widely known solution was given by Peter Norvig in 20072. Some of the process is repeated here. Essentially, we seek to find the most likely candidate \(c\) for a given (possibly mis-spelled) word \(w\) $$ \argmax_{c} P(c|w) $$

By Bayes theorem: $$ P(c|w) = P(w|c)P(c)/P(w) $$ And because the probability of \(w\) is a constant we instead find the most likely candidate for $$ \argmax_{c} P(w|c)P(c) $$

We proceed by making an estimate for \(P(c)\) by looking at a large corpus of text. We create a look-up table for every word that appears in the text along with the number of times it appeared, the count. These will serve as our candidate words. Because all of our candidates are drawn from the same text, the probability is just proportional to the count for the candidate word.

In practice it would be impractical to consider how each word in our text could be mis-spelled to give the word we are considering, so instead we restrict ourselves to candidate words that are close the word under consideration. In this case we define closeness as the number of edits it would take to get to the correct candidate (Damerau-Levenshtein distance3). If \(w\) is in our dictionary already, it would need \(0\) edits. Norvig defines \(1\) edit as any one of the following: a deletion of one letter, the insertion of one letter, the transposition of two adjacent letters, and the replacement of one letter with any from the alphabet. So if any single edit (deletion, insertion, transposition, or replacement) of \(w\) leads to a known word in our dictionary, we include it as a candidate. We can determine \(2\)-edit candidates by applying all possible edits on all of single edits of \(w\) and determining which are in our dictionary. We repeat in the same manner for larger numbers of edits.

Finally, we must determine the probability of typing \(w\) given a candidate. Here Norvig makes a greatly simplifying model: we assume that any \(0\)-edit candidate is infinitely more probable than any \(1\)-edit candidate, and any \(1\)-edit candidate is infinitely more probable than any \(2\)-edit candidate, etc.

So in practice, if \(w\) is in our dictionary, we treat it as the correct candidate. If not, we consider the viable \(1\)-edit candidate with the highest counts. If there is no viable \(1\)-edit candidate, we consider viable \(2\)-edit candidates choosing the candidate with the highest count. Because larger numbers of edits require higher computation, we stop the process at \(2\) and treat the word as correct if no viable \(2\)-edit candidates are available.

The python code for this method is available as a Jupyter Notebook at the link4.

Symspell Solution

A separate algorithm given by Wolf Garbe was developed improving on the speed of the Norvig method by 1000\(\times\)5. The process is streamlined by only considering deletions as edits and by creating a look-up table of all single edit deletions of each dictionary word along with the words that could give rise to it by a single deletion. For instance in Ho-Chunk, the word (tee - lake) may give rise to the single deletion words (te, ee). In our deletion dictionary under the entry (ee) we keep other words that would give rise to it by single deletion, so that we’d have ({ee : ree, pee, cee, nee, hee, xee}).

Our candidate \(c\) for a given word \(w\) are then:

Comparison Edits
\(c\) == \(w\) (no edits)
\(c\) == deletion(\(w\)) (deletion of letter)
deletion(\(c\)) == \(w\) (addition of letter)
deletion(\(c\)) == deletion(\(w\)) (replacement/transposition)

Application

A problem is readily apparent, which is that Ho-Chunk does not have a large corpus of digitized text to make estimates of word frequency. For a text corpus, so far I have included the weather-reports given by Hųųwąxetega (~8000 words), my own stories found on this website (~5000 words), and the example sentences from the Hoocąk lexicon (~24000 words). This is about 30\(\times\) fewer words than that used by Norvig, and a bit biased towards the weather. Ideally the initial corpus could be used to bootstrap other known texts with different spelling conventions.

There are some historical sources that would constitute a fairly large textual corpus: The Four Gospels, Acts, Genesis, and Exodus translated into the Winnebago Language and the stories transcribed by Radin in The Winnebago Tribe. The former is a phonetic transcription and the latter vary depending on the transcriber, although many have been rewritten and put online in something close the the IPA system.

ideas

A couple ideas for expanding the corpus and dealing with more common spelling errors include:

A work in progress

This is currently still a work in progress and can be found on github. The bulk of the source code comes from 4 but I have personally added some methods as suggested by the Symspell algorithm. The Symspell method appears to be about 10-100\(\times\) faster in my implementation and about as accurate on single-edit words generated from known words, ~95%. On two-edit words the accuracy drops to ~85%.

Data

Obviously, data is the key component to this method. If you would like to request a copy of the data, feel free to contact me.



  1. Great Lakes Algonquin Syllabics ↩︎

  2. How to Write a Spelling Corrector, Peter Norvig (2007). ↩︎

  3. Damerau-Levenshtein distance ↩︎

  4. Norvig Jupyter Notebook ↩︎

  5. 1000x Faster Spelling Correction Algorithm, Wolf Garbe (2012). ↩︎