Burrows-Wheeler block-sorting transformation algorithm

The Burrows-Wheeler Block-sorting transformation (henceforth BWT, to save my keyboard) is a reversible, lossless data manipulation algorithm originally discovered by David Wheeler at Cambridge in 1983 but not described until a paper written jointly with Mike Burrows for DEC in 1994. Although the algorithm is unavoidably slightly expansive, is sometimes used as a pre-processor in compression software (most notably bzip2) because its output is (usually) significantly more compressible than its input.

Sadly, the reason why this is so is not immediately apparrent when the algorithm is applied to short strings. Nevertheless, since following through an example is the easiest way to explain how the algorithm works, I will demonstrate its application on the word "EFFERVESCENCE"

Stage 1: The Sort

A data block of length n is imagined as a cycle of the same circumference, so that the last byte of the original string is immediately followed by the first byte. This is then expanded into an n x n matrix, with each row shifted by one relative to the previous row:



E F F E R V E S C E N C E
F F E R V E S C E N C E E
F E R V E S C E N C E E F
E R V E S C E N C E E F F
R V E S C E N C E E F F E
V E S C E N C E E F F E R
E S C E N C E E F F E R V
S C E N C E E F F E R V E
C E N C E E F F E R V E S
E N C E E F F E R V E S C
N C E E F F E R V E S C E
C E E F F E R V E S C E N
E E F F E R V E S C E N C

The rows of this matrix are then sorted into lexical order:


C E E F F E R V E S C E N
C E N C E E F F E R V E S
E E F F E R V E S C E N C
E F F E R V E S C E N C E
E N C E E F F E R V E S C
E R V E S C E N C E E F F
E S C E N C E E F F E R V
F E R V E S C E N C E E F
F F E R V E S C E N C E E
N C E E F F E R V E S C E
R V E S C E N C E E F F E
S C E N C E E F F E R V E
V E S C E N C E E F F E R

We take as our output of this stage the final column of this matrix, to which is appended an integer containing the zero-based index of the original first byte of the data block (This is the cause of the slight expansion mentioned above, which is sadly unavoidable as we need to know this number in order to reverse the transformation). In our example here, therefore, our output from stage one of the process is:

"NSCECFVFEEEER",8

Stage two: the move-to-front coder

The move-to-front coder is a simple locally adaptive one-for-one coding mechanism. You start with a stack containing the ASCII character set, and then, for each entry in your string, you return that particular byte's current position in the stack and then move it to position zero in the stack, pushing everything underneath it down as you go. In Hexadecimal notation, the output of encoding the string above, and the final uncompressed output of the BWT, is:

4D, 52, 44, 46, 1, 47, 55, 1, 3, 0, 0, 0, 53, F

Undoing the procedure

Reversing the Move-to-front coding is a simple procedure that simply involves applying the previous section again in reverse. Once this is done, however, there is no simple unsort() function that can transform "NSCECFVFEEEER",8 back to "EFFERVESCENT". To do this we need to know the transformation vector T{} which tells us which rows of the initial matrix ended up where. Whe begin by determining which character comes after each character of the input in the initial string. Because this is simply the first column of the initial matrix, which, by definition, was sorted into lexical order, we can determine this simply by sorting the string:


N C 
S C 
C E 
E E 
C E 
F E 
V E 
F F 
E F 
E N 
E R 
E S 
R V

From this, whe can calculate the transformation vector. the 'N' in row 0 of column 1 clearly moves to row 9 of column 2, so we can say that T{9}=0

Difficulty arises, however, when considering which of the 'C's, 'F's, and 'E's to match with which. Fortunately, however, the choice is not ambiguous as the 'Es' in the first column also appear in sorted order, as they all begin with the same character and are sorted on the second character. Thus the 'F' in row 5 of column 1 matches that in row 7 of column 2, and so on. Following this through to its conclusion, we end up with the vector:

T = { 2, 4, 3, 8, 9, A, B, 5, 7, 0, C, 1, 6 }

From this vector and the output of the sort, we can work out the initial input of the process as follows: Let I₀ be the integer we appended to the stage 1 output, and then let Output₁ equal the I₀th character of the input string and I₁ equal T{I₀}. Thus:


I0 = 8

O1 = 'E', I1 = 7
O2 = 'F', I2 = 5
O3 = 'F', I3 = A
O4 = 'E', I4 = C

... and so on, until the original input "EFFERVESCENCE" is restored.

So what's the point?

Well, like I said at the start, it's sort of hard to see with only a short string. However, if you look at the output of the move-to-front coder in stage 2, you may notice the beginnings of a definite skew towards low numbers in general and the number 0 in particular. As the size of the block being sorted increases and repeating patterns become more frequent, this skew increases dramatically. And, not by any form of coincedence at all seeing as this is why the algorithm was designed, this same skew is precisely what an entropy-based compression utility requires in order to achieve a good compression ratio.

The choice of compression algorithm used with the BWT is almost limitless. Sadly, however, the highly efficient arithmetic compressor reccommended by Burrows and Wheeler in their paper is covered by several software patents which limit its utility. The most popular implementation of the BWT, the open-source program bzip2, therefore uses the slightly less efficient, but still quite good and patent-free, Huffman coding method.

Sources:
"A Block-sorting Lossless Data Compression Algorithm", M. Burrows, D.J. Wheeler, 10 May 1994: ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.Z (Compressed PostScript)
"Data Compression with the Burrows-Wheeler Transform", article by Mark Nelson for Dr. Dobbs Journal, Sept. 1996: http://www.dogma.net/markn/articles/bwt/bwt.htm
The comp.compression FAQ: http://www.faqs.org/faqs/compression-faq/part2/section-9.html

Burrows-Wheeler is unusual in the field of data compression in that it doesn't compress the data in any way at all: it just re-orders it. However, this re-ordered data has two very attractive and useful properties.

It is possible to cheaply recover the original data from the re-ordered output.
The re-ordered data tends to much easier to compress than the original data.

Burrows-Wheeler is therefore used as a catalyst for actual compression algorithms such as run-length encoding, move-to-front buffering and Huffman coding.

Encoding

No code or implementation-specific details will be given here, instead I will use an abstract overview of the algorithm to encode an example string: XYYXYX0 where 0 is an end of file marker that both parties know will be unique, and last, in the original string. I will also do more work than is strictly required, in order to make the algorithm clearer, and to help show why it works.
The first step is to create a matrix of all the possible rotations of the string:

X Y Y X Y X 0
Y Y X Y X 0 X
Y X Y X 0 X Y
X Y X 0 X Y Y
Y X 0 X Y Y X
X 0 X Y Y X Y
0 X Y Y X Y X

Now this matrix's rows are sorted according to a lexicographic ordering (the order you get in a dictionary, i.e. sort by first character, then second etc.). For the sake of argument, the end of file marker comes before X in ordering, which comes before Y:

0 X Y Y X Y X
X 0 X Y Y X Y
X Y X 0 X Y Y
X Y Y X Y X 0
Y X 0 X Y Y X
Y X Y X 0 X Y
Y Y X Y X 0 X

The last column of this matrix is then output as the re-ordered data.

Why does this aid compression?

Because the rows are simply rotations of the same string, even in the sorted matrix, it is important to remember that the characters in the last column are actually one place before the character in the first column.
In English, every letter of the alphabet is very likely to have some letters before it, and very unlikely to have others before it. For example, s or a will occur very often before t, while yt, bt and ct will hardly ever appear.

As you can see in the sorted matrix, the first column is split into partitions according to the character. Therefore, for the partition of rows starting with t, for example, there will be lots of s's and a's in the last column, and barely any y's, b's or c's. It is this fact that makes compression algorithms work better, long runs of the same character are easy to compress, as is text with a smaller number of symbols present.

How is it possible to regain the original string?

This re-ordered string is now sent off to a variety of compression algorithms, all of which can be reversed by the recipient of the compressed data. However, how is that a seemingly arbitrary permutation of a string can be reversed by someone with no knowledge of the original data?

The tricky part is realising that in order to rebuild the entire matrix, only two columns of data are required: the transmitted column and the first column of the sorted matrix, which is very easy to extract from the transmitted column itself - we just have to sort it!

       transmitted  sorted
           +-+        +-+
0 X Y Y X Y|X|        |0|X Y Y X Y X
X 0 X Y Y X|Y|        |X|0 X Y Y X Y
X Y X 0 X Y|Y|        |X|Y X 0 X Y Y
X Y Y X Y X|0|        |X|Y Y X Y X 0
Y X 0 X Y Y|X|        |Y|X 0 X Y Y X
Y X Y X 0 X|Y|        |Y|X Y X 0 X Y
Y Y X Y X 0|X|        |Y|Y X Y X 0 X
           +-+        +-+

Notice that in the sorted column, the partitions are sorted within themselves, as well as the partitions being sorted with respect to one other.
For example, the Y partition is sorted according to the next character and beyond, seeing as the first characters are all the same. However, the Y's in the last column are also sorted according to the character that follows them and beyond; it just so happens that due to the rotation, the next character has been written at the start of the row.

The order of the Y's in the sorted column is the same as the order of the Y's in the transmitted column. The same is true for the X's!

So the second Y down in the sorted column is the same as the second Y down in the transmitted column! This can be shown by connecting the Y's that are in the same position in the string together:

           +-+        +-+
0 X Y Y X Y|X|        |0|X Y Y X Y X
X 0 X Y Y X|Y|------o |X|0 X Y Y X Y
X Y X 0 X Y|Y|----o | |X|Y X 0 X Y Y
X Y Y X Y X|0|    | | |X|Y Y X Y X 0
Y X 0 X Y Y|X|    | o-|Y|X 0 X Y Y X
Y X Y X 0 X|Y|-o  o---|Y|X Y X 0 X Y
Y Y X Y X 0|X| o------|Y|Y X Y X 0 X
           +-+        +-+

As you can see, none of the lines cross - the Y's are in the same order. This not just a coincidence either; the same operation and test could be performed on a matrix of any size, with any number of characters, but the order of each character in the sorted and transmitted columns would always be the same.

OK, so now we know that fact, how to we build the string from two columns? The key is again that the rows are rotations of the same thing. The n character in the last row must be followed by the n character in the sorted column. At this point we could be stuck as we don't know how to find the character after this. However, because of identical ordering shown above, we know how to get from the sorted column back to the same character in the transmitted column. Therefore, what is required is:

Start with some character: choose the character with index n in the transmitted column for example.
Determine the first character of the original string by looking at the corresponding character in the sorted column: look at index n in the sorted column.
Find the same character in the transmitted column using the ordering of the symbols: if the second character is the m instance of it, look for the m instance of it in the transmitted column.
Use this character as the starting point of Step 1 and repeat until you loop round to the character you started with

All that remains now is to turn this into some sort of an algorithm!

Decoding

As the recipient of the re-ordered string, we now know how to recover the original information, using the steps outlined above.
First, the sorted column is reconstructed by sorting the transmitted string.

transmitted string:

 X Y Y 0 X Y X

sorted string:

 0 X X X Y Y Y

Then, a unique name can be given to each symbol to avoid confusion and help us use the ordering of the characters.

sorted string:

 00 X0 X1 X2 Y0 Y1 Y2

transmitted string:

 X0 Y0 Y1 00 X1 Y2 X2

Remember that due to the order of the symbols, X2 in the sorted string really is the same character as X2 in the transmitted string.
Now, the three steps outlined above can be followed, starting with 00 as the initial symbol.

00 corresponds to X2: output X
X2 corresponds to Y2: output Y
Y2 corresponds to Y1: output Y
Y1 corresponds to X1: output X
X1 corresponds to Y0: output Y
Y0 corresponds to X0: output X
X0 corresponds to 00: output 0

And that's the end of the algorithm - we have reached the symbol we started with: 0

If the recipient doesn't start with 0 as the initial character, there is no problem. Instead of getting the original string as above, some rotation of it would be produced, which can easily be reversed by rotating the string until the end of file marker is at the end, where it should be.

Conclusion

Obviously, the efficacy of any compression algorithm depends entirely on its input; even some supposedly hopeless methods will outperform the `best' algorithms on certain data. However, *nix's bzip2 compression programme, which uses Burrows-Wheeler and Huffman coding, normally creates files 10%-20% smaller than the standard gzip programme. This considerable improvement in compression does come at the cost of encode/decode times increased by a factor of about 2 or 3.
Despite this, most linux kernels are now distributed in the bzip2 format, probably because it will only be compressed and decompressed once, and size is everything when moving large amounts of data over the Internet.

Sources:
University of Cambridge, Computer Laboratory

compression	run-length encoding	bzip2	Huffman coding
Compress	column	Matrix	rotation
catalyst	data compression	permutation	decode
dictionary	algorithm	Linux	kernel
*nix	Lexical	alphabet	Partition
gzip	output	Compression ratio	Effervescence