Burrows-Wheeler block-sorting transformation algorithm (idea) by jam-master jim

Burrows-Wheeler is unusual in the field of data compression in that it doesn't compress the data in any way at all: it just re-orders it. However, this re-ordered data has two very attractive and useful properties.

It is possible to cheaply recover the original data from the re-ordered output.
The re-ordered data tends to much easier to compress than the original data.

Burrows-Wheeler is therefore used as a catalyst for actual compression algorithms such as run-length encoding, move-to-front buffering and Huffman coding.

Encoding

No code or implementation-specific details will be given here, instead I will use an abstract overview of the algorithm to encode an example string: XYYXYX0 where 0 is an end of file marker that both parties know will be unique, and last, in the original string. I will also do more work than is strictly required, in order to make the algorithm clearer, and to help show why it works.
The first step is to create a matrix of all the possible rotations of the string:

X Y Y X Y X 0
Y Y X Y X 0 X
Y X Y X 0 X Y
X Y X 0 X Y Y
Y X 0 X Y Y X
X 0 X Y Y X Y
0 X Y Y X Y X

Now this matrix's rows are sorted according to a lexicographic ordering (the order you get in a dictionary, i.e. sort by first character, then second etc.). For the sake of argument, the end of file marker comes before X in ordering, which comes before Y:

0 X Y Y X Y X
X 0 X Y Y X Y
X Y X 0 X Y Y
X Y Y X Y X 0
Y X 0 X Y Y X
Y X Y X 0 X Y
Y Y X Y X 0 X

The last column of this matrix is then output as the re-ordered data.

Why does this aid compression?

Because the rows are simply rotations of the same string, even in the sorted matrix, it is important to remember that the characters in the last column are actually one place before the character in the first column.
In English, every letter of the alphabet is very likely to have some letters before it, and very unlikely to have others before it. For example, s or a will occur very often before t, while yt, bt and ct will hardly ever appear.

As you can see in the sorted matrix, the first column is split into partitions according to the character. Therefore, for the partition of rows starting with t, for example, there will be lots of s's and a's in the last column, and barely any y's, b's or c's. It is this fact that makes compression algorithms work better, long runs of the same character are easy to compress, as is text with a smaller number of symbols present.

How is it possible to regain the original string?

This re-ordered string is now sent off to a variety of compression algorithms, all of which can be reversed by the recipient of the compressed data. However, how is that a seemingly arbitrary permutation of a string can be reversed by someone with no knowledge of the original data?

The tricky part is realising that in order to rebuild the entire matrix, only two columns of data are required: the transmitted column and the first column of the sorted matrix, which is very easy to extract from the transmitted column itself - we just have to sort it!

       transmitted  sorted
           +-+        +-+
0 X Y Y X Y|X|        |0|X Y Y X Y X
X 0 X Y Y X|Y|        |X|0 X Y Y X Y
X Y X 0 X Y|Y|        |X|Y X 0 X Y Y
X Y Y X Y X|0|        |X|Y Y X Y X 0
Y X 0 X Y Y|X|        |Y|X 0 X Y Y X
Y X Y X 0 X|Y|        |Y|X Y X 0 X Y
Y Y X Y X 0|X|        |Y|Y X Y X 0 X
           +-+        +-+

Notice that in the sorted column, the partitions are sorted within themselves, as well as the partitions being sorted with respect to one other.
For example, the Y partition is sorted according to the next character and beyond, seeing as the first characters are all the same. However, the Y's in the last column are also sorted according to the character that follows them and beyond; it just so happens that due to the rotation, the next character has been written at the start of the row.

The order of the Y's in the sorted column is the same as the order of the Y's in the transmitted column. The same is true for the X's!

So the second Y down in the sorted column is the same as the second Y down in the transmitted column! This can be shown by connecting the Y's that are in the same position in the string together:

           +-+        +-+
0 X Y Y X Y|X|        |0|X Y Y X Y X
X 0 X Y Y X|Y|------o |X|0 X Y Y X Y
X Y X 0 X Y|Y|----o | |X|Y X 0 X Y Y
X Y Y X Y X|0|    | | |X|Y Y X Y X 0
Y X 0 X Y Y|X|    | o-|Y|X 0 X Y Y X
Y X Y X 0 X|Y|-o  o---|Y|X Y X 0 X Y
Y Y X Y X 0|X| o------|Y|Y X Y X 0 X
           +-+        +-+

As you can see, none of the lines cross - the Y's are in the same order. This not just a coincidence either; the same operation and test could be performed on a matrix of any size, with any number of characters, but the order of each character in the sorted and transmitted columns would always be the same.

OK, so now we know that fact, how to we build the string from two columns? The key is again that the rows are rotations of the same thing. The n character in the last row must be followed by the n character in the sorted column. At this point we could be stuck as we don't know how to find the character after this. However, because of identical ordering shown above, we know how to get from the sorted column back to the same character in the transmitted column. Therefore, what is required is:

Start with some character: choose the character with index n in the transmitted column for example.
Determine the first character of the original string by looking at the corresponding character in the sorted column: look at index n in the sorted column.
Find the same character in the transmitted column using the ordering of the symbols: if the second character is the m instance of it, look for the m instance of it in the transmitted column.
Use this character as the starting point of Step 1 and repeat until you loop round to the character you started with

All that remains now is to turn this into some sort of an algorithm!

Decoding

As the recipient of the re-ordered string, we now know how to recover the original information, using the steps outlined above.
First, the sorted column is reconstructed by sorting the transmitted string.

transmitted string:

 X Y Y 0 X Y X

sorted string:

 0 X X X Y Y Y

Then, a unique name can be given to each symbol to avoid confusion and help us use the ordering of the characters.

sorted string:

 00 X0 X1 X2 Y0 Y1 Y2

transmitted string:

 X0 Y0 Y1 00 X1 Y2 X2

Remember that due to the order of the symbols, X2 in the sorted string really is the same character as X2 in the transmitted string.
Now, the three steps outlined above can be followed, starting with 00 as the initial symbol.

00 corresponds to X2: output X
X2 corresponds to Y2: output Y
Y2 corresponds to Y1: output Y
Y1 corresponds to X1: output X
X1 corresponds to Y0: output Y
Y0 corresponds to X0: output X
X0 corresponds to 00: output 0

And that's the end of the algorithm - we have reached the symbol we started with: 0

If the recipient doesn't start with 0 as the initial character, there is no problem. Instead of getting the original string as above, some rotation of it would be produced, which can easily be reversed by rotating the string until the end of file marker is at the end, where it should be.

Conclusion

Obviously, the efficacy of any compression algorithm depends entirely on its input; even some supposedly hopeless methods will outperform the `best' algorithms on certain data. However, *nix's bzip2 compression programme, which uses Burrows-Wheeler and Huffman coding, normally creates files 10%-20% smaller than the standard gzip programme. This considerable improvement in compression does come at the cost of encode/decode times increased by a factor of about 2 or 3.
Despite this, most linux kernels are now distributed in the bzip2 format, probably because it will only be compressed and decompressed once, and size is everything when moving large amounts of data over the Internet.

Sources:
University of Cambridge, Computer Laboratory

compression	run-length encoding	bzip2	Huffman coding
Compress	column	Matrix	rotation
catalyst	data compression	permutation	decode
dictionary	algorithm	Linux	kernel
*nix	Lexical	alphabet	Partition
gzip	output	Compression ratio	Effervescence