display | more...
Designing a spell checker, or at least something that comes close isn't incredibly difficult. The main challenge in doing this requires the realization of what is involved in a spell checker. For me, this came in a biology class in college that was discussing the possible mutations in DNA. Realizing that the base pairs of DNA and a string of characters isn't that far apart, the following idea was born.

  1. Given two strings where 'Orig' is the word being tested for correct spelling and the 'Test' is the word that is the word being examined for a match against 'Orig'.

  2. Create a integer array called 'Work' - the length of this is the greater of the two lengths for Orig and Test.
    • This should be initialized to a value of '-10'. This initial value designates 'Empty'. Any value will do, though it needs to be more than 2 units away from 0.

  3. Iterate through the array 'Test', character by character from the start.
    1. Let 'i' be the position of the character in the 'Test' word.
    2. Iterate through the 'Orig' word and compare the test character with the character in 'Orig'.
      1. Let 'j' be the position of the match (if any) that is found in Orig with the test character.
      2. Assign Work[j] = i. This identifies the position within the 'Orig' word where the 'Test' character matched.
      3. Furthermore, change Orig[j] to a null of some sort - not necessarily the C null, a space will do.
  4. After the loop through 'Test', examine 'Work'. Recall that 'Work' is an integer array and contains the positions of various matches.
    • The 'Score' for the word is initially 0.
    • For each match (not 'Empty') increment the score by 1.
    • For each pattern of {1,2} - where two numbers are in sequence, this is a perfect match. Increment the score by 15.
    • For each pattern of {1,3} - where two numbers are separated by a single missing value - this is the mark of a deletion. In the {1,3} case, the {2} character was deleted and the 3 shifted down. Increment the Score by 1.
    • For each pattern of {2,1} - where two numbers are in sequence, but reversed, this is the mark of a reversal or transposition (such as typing 'hte' rather than 'the'). Increment the score by 3.
    • For each pattern of {1,?,2} - where there is a gap between two values in adjacent sequence, this is the mark of an insertion. Here, the '?' (representing any value) was inserted between the '1' and '2', pushing the '2' back one. Increment the score by 5.
    • For each pattern of {1,?,3} - where there is a gap between two values in sequence, this is the mark of a substitution. A substitution occurs when one letter is substituted for another. Increment the score by 3.
    • For each spot in the Work array with the value of 'Empty', decrement the score by 2.

    Ok, so thats the essence of it. How does this work in actuality?

    • Orig: fubar
    • Test: foobar
    • Work: _,_,_,_,_,_
    • Iterating through Test we get:
      1. (f-0): 0,_,_,_,_,_
      2. (o-1): 0,_,_,_,_,_
      3. (o-2): 0,_,_,_,_,_
      4. (b-3): 0,_,3,_,_,_
      5. (a-4): 0,_,3,4,_,_
      6. (r-5): 0,_,3,4,5,_ (final)
    • Iterating through Work we get:
      • Matches: 4
      • 3,4: 15
      • 4,5: 15
      • 3,?,5: 3
      • empty: -4
      • Final Score: 33

    The benchmarks are fairly reasonable too:
    On a low end Pentium, examining the word 'sophlices', examining a 111,284 word dictionary file (1,100,059 characters total), this runs in about 3 seconds (with some other code to return the best '5' matches). Of course, this could be optimized to say "if the word exists exactly in the dictionary file don't try to fix it". By no means is my implementation of this the best possible one. In addition, by removing the overhead of reading the dictonary file off of disk for each comparison, a fair amount of time will be saved.

    The results returned from the above search are (in order of highest score to lowest, best 5):

    1. Sophocles
    2. splices
    3. slices
    4. basophile
    5. licensor

    Several Most of the assumptions made here are untested and just guesses about how the code should work. The numbers are rather arbitrary that weight to particular types - feel free to experiment.

Log in or register to write something here or to contact authors.