What now seems like long ago, I was writing my first
CGI. This was
before I even heard of
perl (nor had most of the world) and
CGIs were
written in
C. This CGI was a
jargon file searcher
and I knew that
my
seplling sucked and I needed a spelling
approximator
in it to find near matches.
While pondering this in a class titled "Life In The Past",
the lecture was about the genetic code and how mutations happen.
Within a strand of DNA, there are several types of mutations that can
happen. It is possible for a base pair to be deleted, or inserted,
or one substituted for the other. A substitution in the gene may not
actually cause a mutation because the same amino acid is represented
by several sequences. However, a insertion or deletion of a base pair
will cause the entire sequence to be shifted, possibly coding a
radically different proteins.
At this point, I realized that a string of characters is not that different
from a string of base pairs (well, a bit larger
alphabet) and these
are the mistakes that happen in spelling.
Well, there's actually one more type of spelling error that people make,
and that is the transposition of two characters 'hte' instead of 'the' -
its not that uncommon.
- Given two lower case strings (the 'key' and the 'matcher')
- Allocate an array of integers that is the length of the longest one,
initialized to '-10'
- Iterate through the matcher char by char:
- 'mpos' is the position of the matching character in the matching string
- 'kpos' is the first position of the matching character in the key string
- Array[kpos] = mpos;
- The character at 'kpos' is set to a space or null (thus preventing
it from being matched again.
At this point, there is an array with a bunch of numbers in it.
Walk the array looking for the patterns:
- 1 2 - a perfect match (15 points)
- 1 3 - a deleted character (1 point)
- 2 1 - transposed characters (3 points)
- 1 ? 2 - an inserted character (5 points)
- 1 ? 3 - a substituted character (3 points)
Count the number of matches (an array value > 0) in the array
(1 point each)
Count the number of '-10' in the array (-2 points each)
Return the sum of the points
What you say!!
And all this means?
Lets take two words - 'foobar' and 'fubar' and get a number with 'fubar'
as the key.
matcher: foobar
key: fubar
array: 1_456_
scores:
pat score
45 = 15
45 = 15
4?6 = 3
match = 4
nomatch = -4
------------
33
The value of this is in comparing the key to itself. This number can
be calculated from the length (l) of the word (assume l > 2) :
(15 * (l-1)) + (3 * (l-2)) + l
Thus, for a five character word ('foobar'), the value would be:
(15 * 4) + (3 * 3) + 5 = 74
So how fast is this? And how good are the
matches?
Realize the time I am measuring is that of user time on a Unix system.
This more closely represents the actual amount of time spent
crunching
rather than opening files (part of the system time). Because a Unix
machine is preemptive multitasking, the wall time (how the 'apparent'
time to run) is not the best indicator because other things happen (such
as the java chatterbox).
Run against the standard linux dictionary (45407 words) on a
moderate speed
system (so its 3 years old) looking for the word 'thisisatest',
it takes 0.8 user seconds to come up with the list:
- distastes
- substrates
- thistle
- instigates
- dissipates
Run against a larger dictionary, the time scales linearly -
the 111284 word dictionary takes about three times longer
(2.1 user seconds) and produces the list:
- sophisticates
- thiosulfate
- thionate
- thiophosphate
- distastes
So maybe 'thisisatest' is not the best choice - still, you can 'see' it
trying to get a match with a word that does not exist. Furthermore, the
entire thing should have a look at the values assigned and the 'key'
vs 'matcher' role.