statistical compression - Everything2.com

Statistical compression algorithms are a class of lossless compression algorithms based on the probability that certain characters will occur.

RLE

RLE, or run length encoding, is perhaps the most primitive of these sorts of algorithms. It is based on the assumption that characters in the input stream will occur in clumps. This assumption is rarely true in things like text, but more often is true when considering bitmapped images, especially when using indexed color schemes (as in GIF and BMP. It turns out that RLE compression was one of the few forms of compression that Windows 3.0 knew about, which was fine, because many displays at the time were only capable of having 16 colors (and therefore images tended to have large areas that were mono-chromatic).

Shannon-Fano Coding

Independently discovered by Claude Shannon and Simon Fano in 1949, Shannon-Fano coding is a way of assigning variable-length bit patterns to encode characters in such a way that the characters being decoded are unique. This encoding is often more efficient than a fixed-length encoding, but it's not too hard to do better.

Huffman Coding

A method attributed to David Huffman. The basic idea is that you have some sort of probability distribution for how often characters occur in the input stream. From this, you can build a binary tree which tells what bits you should use to encode that character.

If each character has a probability of appearing that's equal to 1/2ⁿ for some n, Huffman coding is optimal.

Arithmetic Coding

Arithmetic coding is a further improvement of these techniques. It turns out that it's possible to think of any stream of binary digits as encoding a number between 0 and 1. (Just imagine sticking a "0." in front of the digits, and you've got a fractional binary number.)

With this in mind, it's possible to divide up the interval between 0 and 1 on the real number line into segments according to which letter is supposed to come next. For example, if 'a' occurred half the time, and 'b' and 'c' both occurred a quarter of the time, you can split up the interval [0,1) into the intervals [0,0.5) for "a", [0.5,0.75) for "b", and [0.75,1) for "c". For the next letter, simply continue subdividing the interval for the given message, and what you're effectively doing is finding the interval that corresponds to your message. The last step is to convert that interval into a binary number somewhere in that range, and there's your message. (Most implementations will do this conversion as they're narrowing down on the interval, and will do it using integer arithmetic, which tends to be much faster than floating point arithmetic.)

One of the strange consequences of this is that you can actually encode a single character in less than one bit (amortized over time, of course.) In order to remove ambiguity, it's might also be necessary to encode an "end-of-file" character as well as all the possible characters you might be encoding. Unfortunately, this advantage comes at the cost of increased computation. Typically, Huffman codes are more common, even though they're not as good at compressing, because they are easier to compute (both from the standpoint of getting the code right and the amount of processor power necessary to encode and decode the streams).

Adaptive vs. Static encodings

One problem with the previous three encodings is that the decoder needs to know the probability distribution before it starts the decoding process. It's also a problem from the encoder's point of view: if it doesn't a priori know the probability distribution, it needs to scan through the entire input stream and compute the probabilities before rewinding to the beginning and encoding the stream. If you're encoding a real-time stream, and you don't know the probability distribution, this is less than ideal.

One way to solve this is to compress in chunks. You read in a whole chunk, calculate the statistics, dump the probability distribution, dump the compressed chunk, and repeat, each time resetting the statistics to zero.

Another solution is to use an adaptive scheme. In an adaptive scheme, you assume at the beginning that the probability distribution is uniform. That is, every character has an equal chance of occurring. Each time you encode a symbol, you update the probability of that symbol, and readjust whatever coding mechanism you're using. This sort of scheme has the advantage that you don't even have to know how much data you're going to compress: you just take it as it comes in. It also deals well with input streams that change characteristics over time (especially if you do something clever like only keep track of the last n bytes when calculating the probability distribution).

References

The Data Compression Book, by Mark Nelson and Jean-Loup Gailly
The Shannon-Fano coding node, by alisdair.
A /msg from kozmund (about using integer math for arithmetic coding).

self-extracting executable	arithmetic coding	Shannon-Fano coding	Huffman code
lossless compression	Regional information	An Introduction to Data Compression, MPEG Video, and DVD Technology	run length encoding
File Compression	PNG