I thought that I’d posted about this project a while back (2006) but I can’t seem to find any entries now. So I must have thought I did, but didn’t actually do it.
Working for my employer adding the necessary terms for their particular needs but not allow just any one to do it. After looking into several existing server side spelling checkers, we couldn’t find a simple solution so I developed one.
So why am I bringing it up now? Well, we’re thinking of putting it into a project at the OpenNFT community. I’ll let OpenNTF explain what it is – the following is from their About page at OpenNFT.org:
“OpenNTF is a site devoted to getting groups of individuals all over the world to collaborate on Lotus Notes/Domino applications and release them as open source. Our mission is ‘to provide a framework for the community to develop open source applications for IBM Lotus Notes and Domino which may be freely distributed.’ Using open source applications can help organizations reduce the costs of software development and maintenance.”
The project was interesting in many respects. I learned about how SoundEx codes are created and used and also how Metaphones are used. SoundEx codes are simpler in the results. Words that sound alike will have the same SoundEx codes. SoundEx codes are generated by taking the first letter of the word and then removing all the vowels from the rest of the word. A 3 digit number is then generated to represent the remaining letters. So, for example, the SoundEx for “Chuck” is “C200” and “Chuck’s” is also “C200”. This helps when a word is not recognized and the suggestions for “Chck” would include both “Chuck” and “Chuck’s”. The problem with simply using SoundEx is that it won’t suggest words that begin with other letters. So we also use Metaphone codes or more precisely, Double-Metaphone codes.
A Metaphone is another “encoding” method, using an phonetic algorithm, but accounts for letters that sound similar to have the same values. For example, in English, a “ph” has the same sound as “F” and sometimes “gh”. Metaphones can be more tedious to create because there are a lot of if this is in there, then do that, otherwise check if that is in there and do this. I was able to find a pretty good Metaphone generation routine for a Visual Basic and adjust it for my needs. The Metaphones for “Chuck “and “Chuck’s” are different, being “XK” and “XKS” respectively. However, “XK” is also the same Metaphone code for these words: swage, swag, check, cheek, cheeky, chic, chichi, chick, chock, choke, choky, shaky, shakya, sheikh, shock, and shuck.
So a good spelling checker needs to combine at least these two encoding methods to return a list that may contain the actual word you meant to use. Simply using the SoundEx code or the Metaphone codes isn’t enough. After doing some research on the web, I added a couple of more algorithms for finding words. I found a paper on the net titled “SSCS: A Smart Spell Checker System Implementation Using Adaptive Software Architecture” by Deepak Seth and Mieczyslaw M. Kokar. In it they describe a few algorithms for adding, removing, doubling and shifting of letters in words to find alternatives.
Then there is how to narrow down the suggestions so that you don’t return 100s or 1000s of words to suggest. Another algorithm is use to filter down the suggestions to make more sense and to weigh them so that the word your actually after is closer to the top of the list. Computing the Levenshtein Distance, (the number of changes required to go from one word to another) you can weigh the results of your suggestions against the unrecognized word and filter your suggestions. The fewer the keystrokes to go from the unrecognized word to the suggested word rates a higher position in the suggestions list.
So after I figured out some encoding for finding words, I needed a place to store and organize the vocabulary. Being a Domino developer, I decided to simply use a Notes Database file and organize the words using views. This works great if when using a fast server and have a strong connection. Its also very easy to create the user interface for those that will be maintaining the vocabulary lists.