display | more...

The goal of the Levenshtein distance, named after the Russian scientist Vladimir Levenshtein (1) (also called edit distance or edition distance) is to measure the similarities between two strings (sequences of elements).

One string can be transformed into another thanks to three operations : insertion of a character, deletion of a character and substitution of a character by another. A sequence of operations which transform one string into another is called an edit sequence. The Levenshtein distance is the smallest edit sequence. The algorithm described below gives both the distance and all the shortest edit sequences.

The most frequent usage of this algorithm is probably spell checking. A basic spell checker will lookup every word of a text in a dictionary. If the word is present then it is correctly spelled. Otherwise it gives the user a list of words of the dictionary, those with the smallest distance to the misspelled word. Search engines such as Google also use this method (or an optimized version of it) on keywords submitted to give more accurate results.

Since this method measures similarity, it is also used in the areas of shape recognition, speech recognition (probably using soundex techniques as well), plagiarism detection or even in the field of Computational Biology with DNA analysis (DNA is a sequence of A, C, G and T proteins).

The basic exponential algorithm

Suppose you wish to compute the distance d(s1,s2) between two strings s1 and s2. The idea is to express d(s1,s2) as a function of d(p,q) where p and q are smaller substrings of s1 and s2 in order to get a recursive algorithm (which will end since a string has a positive length).

  • If s1=="" (resp. s2==""), the distance is obviously the number of characters of s2 (resp. s1).
  • If s1!="" and s2!="", then those strings can be written s1=s1'+c1 and s2=s2'+c2 where c1 (resp. c2) is the last character of s1 (resp. s2).
    • If c1==c2, then obviously d(s1,s2) = d(s1',s2')
    • If c1!=c2, then d(s1,s2) = min( d(s1',s2'+c2), d(s1'+c1,s2') )

The last point needs to be elaborated on. If the last character of s1 and s2 isn't the same, then the shortest edit sequence from s1 to s2 involves either deleting the last character of s1 (ie. d(s1',s2'+c2)) or deleting the last character of s2 (ie. d(s1'+c1,s2')). Since we are interested in the minimum, this explains the use of the min function.

Because it is needed to compute two distances on shorter strings at almost each step, the complexity of this algorithm is exponential which renders it only of theoretical interest.

A nicer algorithm : Needleman Wunsch algorithm

There exists a dynamic programming version of the algorithm known as the Needleman Wunsch algorithm which has a complexity of O(n2) if n is the size of the strings s1 and s2 which makes it far more practicable than the above algorithm obtained by induction.

The idea is to notice that in the above algorithm some distance computations can be performed several times. Say you wish to compute d(bbcd, abdc), then since the last character differs, you will need d(bbcd,abd) and d(bbc,abdc). The first leads to compute d(bb,ab) and d(bbc,a). The second d(b,abd) and d(bb, ab). As you can see, d(bb, ab) is computed more than once which is a waste of time.

Notations : s(i..j) is the substring formed of characters i through j of string s. s(i) is the character number i of s.

The enhanced algorithm proceeds by filling a two dimensional matrix M, where M(i,j) is the distance between substrings s1(1..i) and s2(1..j).

  • Create a matrix M of size (|s1|+1, |s2|+1).
  • M(i,0) = i for i=0..|s1| since we know those distances
  • M(0,j) = j for j=0..|s2| since we also know those distances
  • For the rest of the matrix, M(i,j) = min of :
    • M(i-1, j-1) + 1 if s1(i)!=s2(j), M(i-1,j-1) otherwise. This corresponds to the substitution of characters s1(i) and s2(i).
    • M(i-1, j) + 1. This corrsponds to the insertion of character s1(i) or equivalently the deletion of s2(j).
    • M(i,j-1) + 1. This corresponds to the insertion of character s2(i) or equivalently the deletion of s1(i).

Levenshtein distance in action

One important thing to notice is that the costs of a deletion, insertion or substitution can be changed to have a different value than 1. For example say you wish to correct typos, it might be a good idea to suppose that missing characters are very frequent (you don't press the key hard enough because you type too fast), substitutions are frequent (you press the wrong key) and insertions are rare. The cost of a deletion would then be 3, that of a substitution 2 and that of an insertion 1.

You can even further extend this method by assigning variable costs for substitution based on the metric distance between the keys.

The Unix command diff uses a Levenshtein-distance-like algorithm to produce edit scripts. The typical use of diff is to apply patches to source code files. This is done by applications such as CVS. Instead of packaging the whole new file, the CVS server only sends a small diff file which instructs the CVS client which lines to remove, add or modify in the source code.

The Smith Waterman algorithm is a nice variant which compares all substrings and returns the two with the highest similarity score and runs in O(n2). Thanks to ariels for that addition

Closing words...

Levenshtein distance algorithm isn't the best to compute the distance or find the shortest sequence of operations needed to transform one string into another. However it is a good starting point since it is really easy to implement.

The algorithm described above also gives the edit sequences. Follow a path from the lower right corner to the upper left one, only moving north, west and northwest (insertion, deletion and substitution) with one constraint : the sequence of numbers you land on must be decreasing.


Sources :
(1) V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163(4) p845-848, 1965
http://www.csse.monash.edu.au/~lloyd/tildeAlgDS/Dynamic/Edit/ - a very nice web page
http://www.merriampark.com/ld.htm - has a Java applet that implements the algorithm, C++ and VB source code as well