The Knuth-Morris-Pratt algorithm is a method for quickly finding exact matches between a smaller string, often called the pattern, and a larger string, often called the text. It was first published in 1971 and was developed by Donald Knuth, James Morris, and Vaughan Pratt.
The Problem
The problem is succinctly stated as follows: given a (short) pattern and a (long) text, both strings, determine whether the pattern appears somewhere in the text.
For example, if we have the following pattern and text:
Pattern: walrus
Text: I am the eggman they are the eggmen I am the walrus
... we want to determine whether walrus appears in the text I am the eggman they are the eggmen I am the walrus as efficiently as possible.
Naive String Matching
Knuth-Morris-Pratt is mostly a series of improvements on the most naive method of executing this comparison, so let's begin by addressing this naive method.
Naive Method Pseudocode
Pseudocode for the naive string matching method looks something like this:
P[] = pattern;
T[] = text;
for (i = 0; T[i] != EndOfString; i++)
{
for (j = 0; (T[i+j] != EndOfString) and (P[j] != EndOfString)
and (T[i+j] == P[j]); j++)
{
do nothing;
}
if (P[j] == EndOfString)
found a match;
}
Naive Method Described
In a stepwise fashion below, the naive method is described and elucidated in English:
1. We have a pattern contained in the array P[] and a text contained in the array T[].
2. We also have an integer i, which we set to value 0 (such that P[0] is the first character in the pattern and T[0] is the first character in the text).
3. While we haven't reached the end of the text, we repeat over and over steps 4 through 7.
4. We have an integer j, which we set to value 0 for the reason described in step 2.
5. There are three conditions that have to be met:
5a. T[i+j] (the character in the text that is i + j characters into the string) is not the end of the text.
5b. P[j] (the character in the pattern that is j characters into the string) is not the end of the pattern.
5c. T[i+j] is exactly the same as P[j].
If all of these conditions are met, we simply add 1 to j and check these factors again; otherwise, we go onto step 6.
6. If P[j] is the end of the pattern, then we have found a match. This is because for the entire length of P above, exact matches were found, allowing j to keep growing larger until the condition where P[j] is the end of the string fails.
7. If P[j] is NOT the end of the pattern, we add 1 to i and return to step 4, going again through the loop described in step 3.
Naive Method Example
Let's say we have the following pattern and text:
Pattern: mask
Text: Under the mask
We would start by setting i to 0 and j to 0, so then P[j] = P[0] = "m" and T[i+j] = T[0] = "U". Since P[i] is not equal to T[i+j], we then add one to i and compare again. The steps are summarized below:
i = 1, j = 0, P[j] = P[0] = "m", T[i+j] = T[1] = "n"; "m" does not equal "n"; so we add 1 to i
i = 2, j = 0, P[j] = P[0] = "m", T[i+j] = T[2] = "d"; "m" does not equal "d"; so we add 1 to i
i = 3, j = 0, P[j] = P[0] = "m", T[i+j] = T[3] = "e"; "m" does not equal "e"; so we add 1 to i
i = 4, j = 0, P[j] = P[0] = "m", T[i+j] = T[4] = "r"; "m" does not equal "r"; so we add 1 to i
i = 5, j = 0, P[j] = P[0] = "m", T[i+j] = T[5] = " "; "m" does not equal " "; so we add 1 to i
i = 6, j = 0, P[j] = P[0] = "m", T[i+j] = T[6] = "t"; "m" does not equal "t"; so we add 1 to i
i = 7, j = 0, P[j] = P[0] = "m", T[i+j] = T[7] = "h"; "m" does not equal "h"; so we add 1 to i
i = 8, j = 0, P[j] = P[0] = "m", T[i+j] = T[8] = "e"; "m" does not equal "e"; so we add 1 to i
i = 9, j = 0, P[j] = P[0] = "m", T[i+j] = T[9] = " "; "m" does not equal " "; so we add 1 to i
When i = 10, things get interesting because the pattern and the text match each other; both are equal to "m". So then we add 1 to j and continue comparing...
i = 10, j = 0, P[j] = P[0] = "m", T[i+j] = T[10] = "m"; "m" does equal "m"; so we add 1 to j
i = 10, j = 1, P[j] = P[1] = "a", T[i+j] = T[11] = "a"; "a" does equal "a"; so we add 1 to j
i = 10, j = 2, P[j] = P[2] = "s", T[i+j] = T[12] = "s"; "s" does equal "s"; so we add 1 to j
i = 10, j = 3, P[j] = P[3] = "k", T[i+j] = T[13] = "k"; "k" does equal "k"; so we add 1 to j
When j = 4, then we've reached the end of the pattern. At this point, we are able to break the loop because we've found a complete match for the pattern in the text!
i = 10, j = 4, P[j] = P[4] = EndOfString; P[j] is the end of the pattern,
so we've found an exact match!
Knuth-Morris-Pratt Improves Naive String Matching
Knuth-Morris-Pratt incorporates a pair of clever techniques to improve upon the above algorithm, which in the worst case runs at speed O(mn) (meaning in the worst case the speed depends on the length of both the pattern and the text, because you have to potentially look at each character in the pattern once for each character in the text). KMP manages to improve this to 0(n), meaning you only have to do a comparatively tiny number of comparisons to the text. How does this time improvement occur?
KMP Improvement 1: Skipping Outer Iterations
The first trick employed by KMP involves the elimination of many of the iterations of the outer loop by making it possible to add values to i before going through the "add 1 to i" portion of the outer loop.
We can do this in many cases by determining the overlap of the prefix. An overlap of two strings x and y is the longest word that's a suffix of x and a prefix of y but is not the entire string of x or y itself.
Example of Overlap
To understand the overlap concept, one needs to see it in action. Say we have two strings x and y, described below:
String x: bonu
String y: onus
In this case, a prefix of y is "onu", whereas a suffix of x is "onu". These two strings are equal and thus indicate that x and y have an overlap of length 3.
To make this more clear, look at this alignment of prefix and suffix from the above strings.
Prefix of string x: bonu
Suffix of string y: onus
Overlap of 0
Suffix of string x: bonu
|||
Prefix of string y: onus
Overlap of 3
Choose the best overlap, so they have an overlap score of 3
Here's another example:
String x: fifo
String y: fif
Prefix of string x: fifo
|||
Suffix of string y: fif
The whole string y is used, so we can't use this suffix.
Prefix of string x: fifo
|
Suffix of string y: fif
Overlap of 1
Suffix of string x: fifo
Prefix of string y: fif
Overlap of 0
Choose the best overlap, so they have an overlap score of 1
In this case, a prefix of x is "f" and a suffix of y is "f". These two strings are equal and thus indicate that x and y have an overlap of length 1.
Adding Overlaps To The Naive Algorithm
First, let's look at the pseudocode implementation:
P[] = pattern;
T[] = text;
i = 0;
while (T[i] != EndOfString)
{
for (j = 0; (T[i+j] != EndOfString) and (P[j] != EndOfString)
and (T[i+j] == P[j]); j++)
{
do nothing;
}
if (P[j] == EndOfString)
found a match;
i = i + max(1, j - overlap(P[0..j-1], P[]));
}
Much of the program remains the same. Rather than being a "for" loop, it becomes more clear if we change the loop to a while loop. So, while we've not reached the end of the text, we do the activities inside there. All else remains the same except for the addition of a step to increment i.
Each time through the loop, i is recalculated to reduce the number of times needed to pass through the loop. Each time, i is at least increased by 1, except in a special case where part of the pattern has been matched. In that case, the overlap between the piece that was exactly matched and the entire pattern is compared. For example...
Text: ... devote ...
||||
Pattern: devor
Since there are three matches, j = 4 and so we would execute overlap(P[0..3],P[]):
String x: devo
String y: devor
There is no suffix-prefix overlap here, so
overlap(P[0..2],P[]) = 0
and so we could bump up i a lot in the overall algorithm:
i = i + max(1, j - overlap(P[0..j-1], P[]));
i = i + max(1, 4 - 0);
i = i + max(1, 4);
i = i + 4;
In patterns with little redundancy and great length, this trick can often save a lot of work!
KMP Improvement 2: Skipping Inner Iterations
Now we want to address the possibility of skipping some iterations in the inner for loop which controls the comparisons. Already, we have eliminated a degree of redundancy, but we can eliminate more redundancy by expanding on the use of the overlap.
Let's look at an example:
Pattern: ... fifedom ...
Text : fift
From the above improvement, we know that we can make the following skip:
Pattern: ... fifedom ...
|||
Text : fift
overlap of 1 and match length of 3
so we can increase i by 2
so next comparison is
Pattern: ... fifedom ...
|
Text : fift
But this is redundant. We already know that the second f in fifedom correctly matches with an f, as we learned in the calculation of the overlap. The comparison of f's has already happened.
Adding Overlaps To The Inner Loop
Let's look at some pseudocode again, as this is now a true example of the Knuth-Morris-Pratt algorithm:
P[] = pattern;
T[] = text;
i = 0;
k = 0;
while (T[i] != EndOfString)
{
for (j = k; (T[i+j] != EndOfString) and (P[j] != EndOfString)
and (T[i+j] == P[j]); j++)
{
do nothing;
}
if (P[j] == EndOfString)
found a match;
k = overlap(P[0..j-1], P[]);
i = i + max(1, j - k);
}
The big change from the previous example is that the value for the overlap is being calculated and stored in k. But there is one other minor change; rather than starting at 0, the inner loop is now starting by setting j equal to k! This is because we already know that T and P match for the first k characters, so there is no need to re-test them!
In the above example, where we are now making this comparison:
Pattern: ... fifedom ...
|
Text : fift
... we would be able to skip it, because j starts out being equal to 1, which is the value of the overlap from the previous step (the overlap of "fif" and "fift").
So, we can automatically skip and start making this comparision:
Pattern: ... fifedom ...
|
Text : fift
... which saves us an iteration through the loop.
The Overall Benefit
With these two improvements, we have to at most only make twice as many comparisons as there are in the text, which is significantly better than in the naive case, where we would have to make in the worst case as many comparisons as there are characters in the pattern for each character in the text.
This is done by ensuring that all comparisons fall into two groups: those that have a match and those that do not. If a comparison is a match, it will never have to be compared again. If the comparison is false, then there is at most one additional comparison that has to be done to compare these. Thus, in the absolute worst case, there will be only twice as many comparisons as there are characters in the text, and this limitation won't often be reached.
Thus, Knuth-Morris-Pratt is a major speed improvement over the original naive exact string matching algorithm.