The Burrows-Wheeler Block-sorting transformation (henceforth BWT, to save my keyboard) is a reversible, lossless data manipulation algorithm originally discovered by David Wheeler at Cambridge in 1983 but not described until a paper written jointly with Mike Burrows for DEC in 1994. Although the algorithm is unavoidably slightly expansive, is sometimes used as a pre-processor in compression software (most notably bzip2) because its output is (usually) significantly more compressible than its input.

Sadly, the reason why this is so is not immediately apparrent when the algorithm is applied to short strings. Nevertheless, since following through an example is the easiest way to explain how the algorithm works, I will demonstrate its application on the word "EFFERVESCENCE"

Stage 1: The Sort

A data block of length n is imagined as a cycle of the same circumference, so that the last byte of the original string is immediately followed by the first byte. This is then expanded into an n x n matrix, with each row shifted by one relative to the previous row:


E F F E R V E S C E N C E
F F E R V E S C E N C E E
F E R V E S C E N C E E F
E R V E S C E N C E E F F
R V E S C E N C E E F F E
V E S C E N C E E F F E R
E S C E N C E E F F E R V
S C E N C E E F F E R V E
C E N C E E F F E R V E S
E N C E E F F E R V E S C
N C E E F F E R V E S C E
C E E F F E R V E S C E N
E E F F E R V E S C E N C

The rows of this matrix are then sorted into lexical order:

C E E F F E R V E S C E N
C E N C E E F F E R V E S
E E F F E R V E S C E N C
E F F E R V E S C E N C E
E N C E E F F E R V E S C
E R V E S C E N C E E F F
E S C E N C E E F F E R V
F E R V E S C E N C E E F
F F E R V E S C E N C E E
N C E E F F E R V E S C E
R V E S C E N C E E F F E
S C E N C E E F F E R V E
V E S C E N C E E F F E R

We take as our output of this stage the final column of this matrix, to which is appended an integer containing the zero-based index of the original first byte of the data block (This is the cause of the slight expansion mentioned above, which is sadly unavoidable as we need to know this number in order to reverse the transformation). In our example here, therefore, our output from stage one of the process is:

"NSCECFVFEEEER",8

Stage two: the move-to-front coder

The move-to-front coder is a simple locally adaptive one-for-one coding mechanism. You start with a stack containing the ASCII character set, and then, for each entry in your string, you return that particular byte's current position in the stack and then move it to position zero in the stack, pushing everything underneath it down as you go. In Hexadecimal notation, the output of encoding the string above, and the final uncompressed output of the BWT, is:

4D, 52, 44, 46, 1, 47, 55, 1, 3, 0, 0, 0, 53, F

Undoing the procedure

Reversing the Move-to-front coding is a simple procedure that simply involves applying the previous section again in reverse. Once this is done, however, there is no simple unsort() function that can transform "NSCECFVFEEEER",8 back to "EFFERVESCENT". To do this we need to know the transformation vector T{} which tells us which rows of the initial matrix ended up where. Whe begin by determining which character comes after each character of the input in the initial string. Because this is simply the first column of the initial matrix, which, by definition, was sorted into lexical order, we can determine this simply by sorting the string:

N C 
S C 
C E 
E E 
C E 
F E 
V E 
F F 
E F 
E N 
E R 
E S 
R V

From this, whe can calculate the transformation vector. the 'N' in row 0 of column 1 clearly moves to row 9 of column 2, so we can say that T{9}=0

Difficulty arises, however, when considering which of the 'C's, 'F's, and 'E's to match with which. Fortunately, however, the choice is not ambiguous as the 'Es' in the first column also appear in sorted order, as they all begin with the same character and are sorted on the second character. Thus the 'F' in row 5 of column 1 matches that in row 7 of column 2, and so on. Following this through to its conclusion, we end up with the vector:

T = { 2, 4, 3, 8, 9, A, B, 5, 7, 0, C, 1, 6 }

From this vector and the output of the sort, we can work out the initial input of the process as follows: Let I0 be the integer we appended to the stage 1 output, and then let Output1 equal the I0th character of the input string and I1 equal T{I0}. Thus:


I0 = 8

O1 = 'E', I1 = 7
O2 = 'F', I2 = 5
O3 = 'F', I3 = A
O4 = 'E', I4 = C

... and so on, until the original input "EFFERVESCENCE" is restored.

So what's the point?

Well, like I said at the start, it's sort of hard to see with only a short string. However, if you look at the output of the move-to-front coder in stage 2, you may notice the beginnings of a definite skew towards low numbers in general and the number 0 in particular. As the size of the block being sorted increases and repeating patterns become more frequent, this skew increases dramatically. And, not by any form of coincedence at all seeing as this is why the algorithm was designed, this same skew is precisely what an entropy-based compression utility requires in order to achieve a good compression ratio.

The choice of compression algorithm used with the BWT is almost limitless. Sadly, however, the highly efficient arithmetic compressor reccommended by Burrows and Wheeler in their paper is covered by several software patents which limit its utility. The most popular implementation of the BWT, the open-source program bzip2, therefore uses the slightly less efficient, but still quite good and patent-free, Huffman coding method.


Sources:
"A Block-sorting Lossless Data Compression Algorithm", M. Burrows, D.J. Wheeler, 10 May 1994: ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.Z (Compressed PostScript)
"Data Compression with the Burrows-Wheeler Transform", article by Mark Nelson for Dr. Dobbs Journal, Sept. 1996: http://www.dogma.net/markn/articles/bwt/bwt.htm
The comp.compression FAQ: http://www.faqs.org/faqs/compression-faq/part2/section-9.html