Burrows-Wheeler is unusual in the field of data compression in that it doesn't
compress the data in any way at all: it just re-orders it. However, this re-ordered data has
two very attractive and useful properties.
- It is possible to cheaply recover the original data from the re-ordered output.
- The re-ordered data tends to much easier to compress than the original data.
Burrows-Wheeler is therefore used as a catalyst for actual compression algorithms such
as run-length encoding, move-to-front buffering and Huffman coding.
No code or implementation-specific details will be given here, instead I will use an
abstract overview of the algorithm to encode an example string: XYYXYX0 where
0 is an end of file marker that both parties know will be unique, and last, in
the original string. I will also do more work than is strictly required, in order to make
the algorithm clearer, and to help show why it works.
The first step is to create a matrix of all the possible rotations of the string:
X Y Y X Y X 0
Y Y X Y X 0 X
Y X Y X 0 X Y
X Y X 0 X Y Y
Y X 0 X Y Y X
X 0 X Y Y X Y
0 X Y Y X Y X
Now this matrix's rows are sorted according to a lexicographic ordering (the order
you get in a dictionary, i.e. sort by first character, then second etc.). For the sake of
argument, the end of file marker comes before X in ordering, which comes before
Y:
0 X Y Y X Y X
X 0 X Y Y X Y
X Y X 0 X Y Y
X Y Y X Y X 0
Y X 0 X Y Y X
Y X Y X 0 X Y
Y Y X Y X 0 X
The last column of this matrix is then output as the re-ordered data.
Why does this aid compression?
Because the rows are simply rotations of the same string, even in the sorted
matrix, it is important to remember that the characters in the last column are actually
one place before the character in the first column.
In English, every letter of the alphabet is very likely to have some letters before it,
and very unlikely to have others before it. For example, s or a will occur
very often before t, while yt, bt and ct will hardly ever
appear.
As you can see in the sorted matrix, the first column is split into partitions
according to the character. Therefore, for the partition of rows starting with
t, for example, there will be lots of s's and a's in the last
column, and barely any y's, b's or c's. It is this fact that
makes compression algorithms work better, long runs of the same character are easy to compress, as is text with a smaller number of symbols
present.
How is it possible to regain the original string?
This re-ordered string is now sent off to a variety of compression algorithms, all of
which can be reversed by the recipient of the compressed data. However, how is that a
seemingly arbitrary permutation of a string can be reversed by someone with no knowledge
of the original data?
The tricky part is realising that in order to rebuild the entire matrix, only two
columns of data are required: the transmitted column and the first column of the sorted
matrix, which is very easy to extract from the transmitted column itself - we just have to
sort it!
transmitted sorted
+-+ +-+
0 X Y Y X Y|X| |0|X Y Y X Y X
X 0 X Y Y X|Y| |X|0 X Y Y X Y
X Y X 0 X Y|Y| |X|Y X 0 X Y Y
X Y Y X Y X|0| |X|Y Y X Y X 0
Y X 0 X Y Y|X| |Y|X 0 X Y Y X
Y X Y X 0 X|Y| |Y|X Y X 0 X Y
Y Y X Y X 0|X| |Y|Y X Y X 0 X
+-+ +-+
Notice that in the sorted column, the partitions are sorted within themselves, as well
as the partitions being sorted with respect to one other.
For example, the Y partition is sorted according to the next character and beyond,
seeing as the first characters are all the same. However, the Y's in the last
column are also sorted according to the character that follows them and beyond; it just so
happens that due to the rotation, the next character has been written at the start of the
row.
The order of the Y's in the sorted column is the same as the order of the
Y's in the transmitted column. The same is true for the X's!
So the second Y down in the sorted column is the same as the second Y
down in the transmitted column! This can be shown by connecting the Y's that are in
the same position in the string together:
+-+ +-+
0 X Y Y X Y|X| |0|X Y Y X Y X
X 0 X Y Y X|Y|------o |X|0 X Y Y X Y
X Y X 0 X Y|Y|----o | |X|Y X 0 X Y Y
X Y Y X Y X|0| | | |X|Y Y X Y X 0
Y X 0 X Y Y|X| | o-|Y|X 0 X Y Y X
Y X Y X 0 X|Y|-o o---|Y|X Y X 0 X Y
Y Y X Y X 0|X| o------|Y|Y X Y X 0 X
+-+ +-+
As you can see, none of the lines cross - the Y's are in the same order. This not
just a coincidence either; the same operation and test could be performed on a matrix of
any size, with any number of characters, but the order of each character in the sorted and
transmitted columns would always be the same.
OK, so now we know that fact, how to we build the string from two columns? The key is
again that the rows are rotations of the same thing. The n character in the
last row must be followed by the n character in the sorted column. At this point we
could be stuck as we don't know how to find the character after this. However, because of
identical ordering shown above, we know how to get from the sorted column back to the same
character in the transmitted column. Therefore, what is required is:
- Start with some character: choose the character with index n in the transmitted
column for example.
- Determine the first character of the original string by looking at the corresponding
character in the sorted column: look at index n in the sorted column.
- Find the same character in the transmitted column using the ordering of the symbols:
if the second character is the m instance of it, look for the m instance of
it in the transmitted column.
- Use this character as the starting point of Step 1 and repeat until you loop round to
the character you started with
All that remains now is to turn this into some sort of an algorithm!
As the recipient of the re-ordered string, we now know how to recover the original
information, using the steps outlined above.
First, the sorted column is reconstructed by sorting the transmitted string.
transmitted string:
X Y Y 0 X Y X
sorted string:
0 X X X Y Y Y
Then, a unique name can be given to each symbol to avoid confusion and help us use the
ordering of the characters.
sorted string:
00 X0 X1 X2 Y0 Y1 Y2
transmitted string:
X0 Y0 Y1 00 X1 Y2 X2
Remember that due to the order of the symbols, X2 in the sorted string really is
the same character as X2 in the transmitted string.
Now, the three steps outlined above can be followed, starting with 00 as the initial symbol.
00 corresponds to X2: output X
X2 corresponds to Y2: output Y
Y2 corresponds to Y1: output Y
Y1 corresponds to X1: output X
X1 corresponds to Y0: output Y
Y0 corresponds to X0: output X
X0 corresponds to 00: output 0
And that's the end of the algorithm - we have reached the symbol we started with:
0
If the recipient doesn't start with 0 as the initial character, there is no
problem. Instead of getting the original string as above, some rotation of it would be
produced, which can easily be reversed by rotating the string until the end of file marker
is at the end, where it should be.
Conclusion
Obviously, the efficacy of any compression algorithm depends entirely on its input;
even some supposedly hopeless methods will outperform the `best' algorithms on certain data.
However, *nix's bzip2 compression programme, which uses Burrows-Wheeler and Huffman
coding, normally creates files 10%-20% smaller than the standard gzip programme. This
considerable improvement in compression does come at the cost of encode/decode times increased
by a factor of about 2 or 3.
Despite this, most linux kernels are now distributed in the bzip2 format, probably
because it will only be compressed and decompressed once, and size is everything when
moving large amounts of data over the Internet.
Sources:
University of Cambridge, Computer Laboratory