Statistical compression algorithms are a class of lossless compression
algorithms based on the probability that certain characters will
occur.
RLE
RLE, or run length encoding, is perhaps the most primitive of these
sorts of algorithms. It is based on the assumption that characters in
the input stream will occur in clumps. This assumption is rarely true
in things like text, but more often is true when considering
bitmapped images, especially when using indexed color schemes (as in
GIF and BMP. It turns out that RLE compression was one of the few
forms of compression that Windows 3.0 knew about, which was fine,
because many displays at the time were only capable of having 16 colors
(and therefore images tended to have large areas that were
mono-chromatic).
Shannon-Fano Coding
Independently discovered by Claude Shannon and Simon Fano in 1949,
Shannon-Fano coding is a way of assigning variable-length bit patterns
to encode characters in such a way that the characters being decoded are
unique. This encoding is often more efficient than a fixed-length
encoding, but it's not too hard to do better.
Huffman Coding
A method attributed to David Huffman.
The basic idea is that you have some sort of probability distribution
for how often characters occur in the input stream. From this, you can
build a binary tree which tells what bits you should use
to encode that character.
If each character has a probability of appearing that's equal to
1/2n for some n, Huffman coding is optimal.
Arithmetic Coding
Arithmetic coding is a further improvement of these techniques. It
turns out that it's possible to think of any stream of binary digits as
encoding a number between 0 and 1. (Just imagine sticking a "0." in
front of the digits, and you've got a fractional binary number.)
With this in mind, it's possible to divide up the interval between 0 and
1 on the real number line into segments according to which letter is
supposed to come next. For example, if 'a' occurred half the time, and
'b' and 'c' both occurred a quarter of the time, you can split up the
interval [0,1) into the intervals [0,0.5) for "a", [0.5,0.75)
for "b", and [0.75,1) for "c". For the next letter, simply continue
subdividing the interval for the given message, and what you're
effectively doing is finding the interval that corresponds to your
message. The last step is to convert that interval into a binary number
somewhere in that range, and there's your message. (Most implementations will do this conversion as they're narrowing down on the interval, and will do it using integer arithmetic, which tends to be much faster than floating point arithmetic.)
One of the strange consequences of this is that you can actually encode
a single character in less than one bit (amortized over time, of
course.) In order to remove ambiguity, it's might also be necessary to
encode an "end-of-file" character as well as all the possible characters
you might be encoding. Unfortunately, this advantage comes at the cost
of increased computation. Typically, Huffman codes are more common,
even though they're not as good at compressing, because they are easier
to compute (both from the standpoint of getting the code right and the
amount of processor power necessary to encode and decode the streams).
Adaptive vs. Static encodings
One problem with the previous three encodings is that the decoder needs
to know the probability distribution before it starts the decoding
process. It's also a problem from the encoder's point of view: if it
doesn't a priori know the probability distribution, it needs
to scan through the entire input stream and compute the probabilities
before rewinding to the beginning and encoding the stream. If you're
encoding a real-time stream, and you don't know the probability
distribution, this is less than ideal.
One way to solve this is to compress in chunks. You read in a whole
chunk, calculate the statistics, dump the probability distribution, dump
the compressed chunk, and repeat, each time resetting the statistics to
zero.
Another solution is to use an adaptive scheme. In an adaptive scheme,
you assume at the beginning that the probability distribution is
uniform. That is, every character has an equal chance of occurring.
Each time you encode a symbol, you update the probability of that
symbol, and readjust whatever coding mechanism you're using. This sort
of scheme has the advantage that you don't even have to know how much
data you're going to compress: you just take it as it comes in. It also
deals well with input streams that change characteristics over time
(especially if you do something clever like only keep track of the last
n bytes when calculating the probability distribution).
References
- The Data Compression Book, by Mark Nelson and Jean-Loup
Gailly
- The Shannon-Fano coding node, by alisdair.
- A
/msg
from kozmund (about using integer math for arithmetic coding).