spell checking

How to create spell check?

Have you ever wondered what the underlying technology and math behind a spell checker are? I did, so I even built my own spell checker. Also, with this technology in mind, you can do many other things. It is useful to apply it to a voice recognition api. If you are getting commands from voice, and you cannot match the word someone said perfectly, maybe after checking how distant it is to commands, you can do something reasonable. This thing, I would like now to talk about, can have many uses. For example when what you want to create application with voice commands, you may use some voice recognition API or some voice recognition technology. But because of noice or user’s bad English command that technology you are using recognized is not what you are expecting as possible command. Then you may use edit distance to assume what speaker wanted to say.

Edit distance

Main problem here is to calculate the distance between the word you get, and possible words or a word you are expecting. So we are measuring the distance between the two words or strings by calculating how many edit operations are needed to be applied to one string to become the other one. Edit operations are character insertions, character deletion and character change. We could assign to all of these operations some value, for example for insertions and deletions, usual value is 1, and for editing is 2 (it is combined from deletion and insertions). Then we calculate this sum. This type of calculating edit distance is referred to as Levenshtein edit distance. Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other.

Mathematically, the Levenshtein distance between two strings a, b is given by \operatorname{lev}_{a,b}(|a|,|b|) where

\qquad\operatorname{lev}_{a,b}(i,j) = \begin{cases}   \max(i,j) &, \min(i,j)=0 \\   \min \begin{cases}           \operatorname{lev}_{a,b}(i-1,j) + 1 \\           \operatorname{lev}_{a,b}(i,j-1) + 1 \\           \operatorname{lev}_{a,b}(i-1,j-1) + [a_i \neq b_j]        \end{cases} &, \text{ else} \end{cases}

Note that the first element in the minimum corresponds to deletion (from a to b), the second to insertion and the third to match or mismatch, depending on whether the respective symbols are the same.

For example, the Levenshtein distance between “kitten” and “sitting” is 3, since the following three edits change one into the other, and there isn’t way to do it with a fewer than three edits:

  1. kitten → sitten (substitution of “s” for “k”)
  2. sitten → sittin (substitution of “i” for “e”)
  3. sittin → sitting (insertion of “g” at the end).

The Levenshtein distance has several simple upper and lower bounds that are useful in applications which are applied with many of them. These include:

  • It is always at least the difference of the sizes of the two strings.
  • It is at most the length of the longer string.
  • It is zero if and only if the strings are equal.
  • If the strings are the same size, the Hamming distance is an upper bound on the Levenshtein distance.
  • The Levenshtein distance between two strings is no greater than the sum of their Levenshtein distances from a third string (triangle inequality).

Minimum edit distance

When it comes to creating a spell checker, we need a bit more than just the edit distance between 2 words or 2 strings. What we need to calculate is the minimum edit distance for that language. We need to have a corpus of words of a particular language with most of the forms, or a mechanism that knows how to create all the forms of a word. Then we need to calculate the edit distance of typed word and similar word (if same form does not exist). We need to find the minimum edit distances for typed word, and that would be words that we can recommend to a user of our spell checking system. It would be the same for other applications. If we are looking for a word, and we don’t get a match, we will look for a word that has the minimal edit distance to that word, and most probably the noise made our expected word look that different. Still we might make errors, but it also depends on the corpus we are comparing with, and the quality of other technology used.


Where to go from here?

I hope that I was clear when explaining these issues. If you have any problems understanding it, please try to learn more,  for example on wikipedia. Also, if you still have any problems, you may post a comment or contact me.

There are also some advanced topics, for example, there are further generalized DNA sequence alignment algorithms such as the Smith–Waterman algorithm, which make an operation’s cost depend on where it is applied.

Born in Bratislava, Slovakia, lived in Belgrade, Serbia, now living in Manchester, UK, and visitng the world. Nikola is a great enthusiast of AI, natural language processing, machine learning, web application security, open source, mobile and web technologies. Looking forward to create future. Nikola has done PhD in natural language processing and machine learning at the University of Manchester where he worked for 2 years. In 2020, Nikola moved to Berlin and works in Bayer Pharma R&D as a computational scientist.

Twitter LinkedIn Google+ YouTube Xing  

Liked it? Take a second to support Nikola Milošević on Patreon!

Leave a Reply

Your email address will not be published. Required fields are marked *