The Problem
Consider two short files with the following content:
- File 1
- Mary had a little lamb whose fleece was white as snow
- File 2
- Mary had a dog and Mary had a little lamb with fleece as white as
snow
The exercise is to find the
minimal set of changes to go from one file
to the other. This is often needed, for example, to create
patches for
programs. A patch is an
update and it should be as small as possible
so as to accurately reflect the extent of a change.
Common Solutions
Many file comparison programs match file streams as early as possible
and then use a limited-size window to try to re-synchronize. A possible
output from such a strategy, for my example, would be:
- 3 words match
- (Mary had a)
- discard 5 words
- (little lamb whose fleece was)
- insert 7 words
- (dog and Mary had a little lamb with fleece as)
- 3 words match
- (white as snow)
This is correct but obviously far from optimal.
My Strategy
-
Take the entire texts to be compared and "slide" them
against each other (i.e. start comparing with different offsets into
the files) until a relative offset is found that has more matches than
any other (or is tied for first place).
For the example, if we start 5 words into the second file, we get a
match like this (matching words in stronger type):
Mary had a little lamb
whose fleece was white as
snow
Mary had a dog and Mary had a little
lamb with fleece as white as
snow
(that's 9 matching words, the maximum possible for this sample).
-
Find the longest consecutive run of matching items. This may well
be in the middle of everything, so there will usually text in one or
both files before it and after it.
In our example, Mary had a little lamb
wins hands down.
-
Identify the chunks of text from both files before this match
and the chunks of text from both files after this match.
Apply this same strategy to each set, starting with (1).
In our example, the "before" set consists of
–nothing–
against
Mary had a dog and
That's easy, it's a 5 word insert and
finished. The "after" set is
whose fleece was
white as snow
against
with fleece as white as
snow
which will need another two rounds or so of this algorithm.
Assuming we do our
bookkeeping right, we end up with 4
inserted words
and two replacements for the example, which is the optimum result.
Discussion
This solution is admittedly brute force. Many computer scientists will
scream at this.
I'm not sure, but I think the worst-case behavior for this algorithm is
O(n4) .
That's grounds for defenestrating one of their number from the highest
ivory tower. For shame!
My response is: I know, and I don't care.
This is a typical case of know your problem domain:
this algorithm is intended to work on program source files, and source files
tend to have a maximum size well within this algorithm's ability to
complete in a tolerable time frame on modern hardware.
Implementation Details
-
My example deals with individual words, but program patches deal in
lines. This algorithm calls for many comparisons, and we
certainly don't want to compare entire lines against one another,
character by character. So my program computes a CRC value for each
line. Thus, each line is represented by a single 32-bit number.
With a probability of billions to one, each number corresponds uniquely
to a line with a certain content. When all the shuffling is done, the
program compares matched lines by content, just to avoid those
off-chance false positives. During the matching phase, working with
numbers is of course much faster than working with lines of text.
-
I make casual mention of "sliding" the files against one another.
This wants to be done right! A careful choice of strategy here will go a
long way toward preventing the program from displaying its worst-case
behavior. Just as the A* algorithm for
shortest path finding starts out by looking in the direction of the
target, my approach starts with mutual file offsets of 0. A single
run-through will identify identical files, and I'm done. Failing this, I
increase the offset one by one, alternating between files. So I compare:
all of file 1 against file 2 starting at line 2;
all of file 2 against file 1 starting at line 2;
all of file 1 against file 2 starting at line 3;
all of file 2 against file 1 starting at line 3;
and so on. As soon as all this sliding leaves me comparing fewer lines
than the maximum number that matched so far, I know it will never get
any better, so I can stop.
-
Sooner or later, someone will try to use a program for a job it's not
intended for. There is no source file at my workplace longer than about
10,000 lines, and I believe that any programmer who writes a module
longer than this needs to be taken out and shot. However, there are
various data files around which are much bigger. Therefore, to keep
people from abusing my program, I've hardwired a limit of 15,000 lines
into the program; it refuses to process files larger than this, although
it could. This is a design decision based on the program's expected
problem domain.
Results
My program, based on this algorithm, has been in use for about five
years at my workplace and has largely replaced its predecessor, a
systems utility developed at the University of Maryland. It runs on a
powerful UNISYS 2200 mainframe and usually yields results in less than
5 seconds, rarely requiring more than 20. The older program was
known for occasional catastrophic failures, while mine consistently
produces results in agreement with "common sense".
Ariels has graciously pointed me at some prior art from the "real" Computer scientists, currently finding extensive use in bioinformatics:
- There's the Needleman-Wunsch algorithm, which tallies up match scores in an m x n matrix. The algorithm is elegant; I envy the authors for their smarts. Alas, my creaky mainframe doesn't have virtual memory, so Needleman and Wunsh would have crashed and burned long before reaching my arbitrary limit of 15,000 lines. Also, it looks to me like the amount of work done is O(n3).
- The Smith-Waterman algorithm looks like a cross between Needleman-Wunsch and mine: It pinpoints local matches but allows much finer weighting of penalties for differences. Again, at the heart of this algorithm is an array that only a RAM chip manufacturer could love.
-
Nowadays, a popular "standard" differencing tool is GNU diff, which by default runs an algorithm by Eugene Myers. At the time I wrote this program, I was aware of diff but not the origin of its algorithm. I was then and still am loath to try to analyze diff's source code. However, diff's documentation warns that diff will by default try to cut corners by ignoring the initial and final set of matching lines, which may yield a less than optimum result. Of course, there is an option to prevent this behavior. Also, for files too large to fit in memory, diff will delegate the job to an older, multi-pass algorithm. It seems like the authors of diff have thought of everything, but that's precisely the kind of headache my brute force approach blissfully sidesteps.
Ironically, they finally got a C compiler working on the 2200 last year. This means that it is now an option to port the actual diff code.
Alas, all this new information won't inspire me to write a new version of my cheerfully ponderous sub-optimal optimal difference utility: I have other projects keeping me busy and I heed Alf's Axiom: "If it's not broken, don't kick it."