Ho-Chunk spell checker

A spell-checker for a small corpus

Ho-Chunk (Hoocąk) is a Siouan language without a traditional writing system, although it has been written in a variety of ways by people in and outside the tribe. Such writing systems have included various phonetic approximations in early settler wordlists, the adoption of the Great Lakes Algonquin syllabics by Ho-Chunks from the Fox¹, and more recently a version of the International Phonetic Alphabet. Because so many systems have been employed over time and because the language was traditionally only spoken, current language learners are often encouraged to either just spell words as they hear them or ignore diacritical marks. Of course, this is exacerbated by larger structural issues like the lack of a standard Ho-Chunk language keyboard for Android or iOS.

It seems like the development of a spell-checker could serve to standardize newly written documents and possibly older historical texts. To start, it’s worth reviewing other spell-checker solutions.

Norvig Solution

A widely known solution was given by Peter Norvig in 2007². Some of the process is repeated here. Essentially, we seek to find the most likely candidate $c$ for a given (possibly mis-spelled) word $w$ $\argmax_{c} P(c|w)$

By Bayes theorem: $P(c|w) = P(w|c)P(c)/P(w)$ And because the probability of $w$ is a constant we instead find the most likely candidate for $\argmax_{c} P(w|c)P(c)$

We proceed by making an estimate for $P(c)$ by looking at a large corpus of text. We create a look-up table for every word that appears in the text along with the number of times it appeared, the count. These will serve as our candidate words. Because all of our candidates are drawn from the same text, the probability is just proportional to the count for the candidate word.

In practice it would be impractical to consider how each word in our text could be mis-spelled to give the word we are considering, so instead we restrict ourselves to candidate words that are close the word under consideration. In this case we define closeness as the number of edits it would take to get to the correct candidate (Damerau-Levenshtein distance³). If $w$ is in our dictionary already, it would need $0$ edits. Norvig defines $1$ edit as any one of the following: a deletion of one letter, the insertion of one letter, the transposition of two adjacent letters, and the replacement of one letter with any from the alphabet. So if any single edit (deletion, insertion, transposition, or replacement) of $w$ leads to a known word in our dictionary, we include it as a candidate. We can determine $2$ -edit candidates by applying all possible edits on all of single edits of $w$ and determining which are in our dictionary. We repeat in the same manner for larger numbers of edits.

Finally, we must determine the probability of typing $w$ given a candidate. Here Norvig makes a greatly simplifying model: we assume that any $0$ -edit candidate is infinitely more probable than any $1$ -edit candidate, and any $1$ -edit candidate is infinitely more probable than any $2$ -edit candidate, etc.

So in practice, if $w$ is in our dictionary, we treat it as the correct candidate. If not, we consider the viable $1$ -edit candidate with the highest counts. If there is no viable $1$ -edit candidate, we consider viable $2$ -edit candidates choosing the candidate with the highest count. Because larger numbers of edits require higher computation, we stop the process at $2$ and treat the word as correct if no viable $2$ -edit candidates are available.

The python code for this method is available as a Jupyter Notebook at the link⁴.

Symspell Solution

A separate algorithm given by Wolf Garbe was developed improving on the speed of the Norvig method by 1000 $\times$ ⁵. The process is streamlined by only considering deletions as edits and by creating a look-up table of all single edit deletions of each dictionary word along with the words that could give rise to it by a single deletion. For instance in Ho-Chunk, the word (tee - lake) may give rise to the single deletion words (te, ee). In our deletion dictionary under the entry (ee) we keep other words that would give rise to it by single deletion, so that we’d have ({ee : ree, pee, cee, nee, hee, xee}).

Our candidate $c$ for a given word $w$ are then:

Comparison	Edits
$c$ == $w$	(no edits)
$c$ == deletion( $w$ )	(deletion of letter)
deletion( $c$ ) == $w$	(addition of letter)
deletion( $c$ ) == deletion( $w$ )	(replacement/transposition)

Application

A problem is readily apparent, which is that Ho-Chunk does not have a large corpus of digitized text to make estimates of word frequency. For a text corpus, so far I have included the weather-reports given by Hųųwąxetega (~8000 words), my own stories found on this website (~5000 words), and the example sentences from the Hoocąk lexicon (~24000 words). This is about 30 $\times$ fewer words than that used by Norvig, and a bit biased towards the weather. Ideally the initial corpus could be used to bootstrap other known texts with different spelling conventions.

There are some historical sources that would constitute a fairly large textual corpus: The Four Gospels, Acts, Genesis, and Exodus translated into the Winnebago Language and the stories transcribed by Radin in The Winnebago Tribe. The former is a phonetic transcription and the latter vary depending on the transcriber, although many have been rewritten and put online in something close the the IPA system.

ideas

A couple ideas for expanding the corpus and dealing with more common spelling errors include:

Build words from Hoocąk lexicon entries using common suffixes and combinations of suffixes
- this may help include a large number of modified words not seen in the current corpus
- ordering of multiple suffixes and caveats regarding metaphony may lead to incorrect entries
- the Norvig method treats any supplied word (including misspelled words) as a correct spelling
Include common letter substitutions as a separate subset of errors with higher probability
- could include similar sounds such as $g\rightarrow k, b\rightarrow p, ž\rightarrow š, c\rightarrow j$
- could include substituting double vowels
Generate probabilities of errors from historical sources
- may require manually finding visible errors
Employ a Hidden Markov Model to estimate word probability
- transition probabilities of letter to letter can be generated from text corpus
- emission probabilities are harder to determine (need a corpus of known spelling errors), but especially in determining deletions

A work in progress

This is currently still a work in progress and can be found on github. The bulk of the source code comes from ⁴ but I have personally added some methods as suggested by the Symspell algorithm. The Symspell method appears to be about 10-100 $\times$ faster in my implementation and about as accurate on single-edit words generated from known words, ~95%. On two-edit words the accuracy drops to ~85%.

Data

Obviously, data is the key component to this method. If you would like to request a copy of the data, feel free to contact me.