It's fairly common to hear that some compression scheme is a variant of
Lempel-Ziv. Since pretty much any sliding window compression scheme
or any dictionary-based compression scheme can be claimed to be a
variant of LZ, this leaves an awful lot of difference between the
The original algorithm
In 1977, Jakob Ziv and Abraham Lempel published their first algorithm,
which is now known as LZ77. One advantage this algorithm had over
existing algorithms was that it was adaptive: as the compression
process continues, the algorithm gets better and better about
compressing the input stream.
Essentially, the way LZ77 works is to encode pointers to earlier in the
input stream. As a motivating example, if we wanted to encode the
string "Welcome to Everything@Everything2.com", we could encode it as
"Welcome to Everything@<11,10>.com", which is three characters
shorter. (Hopefully, we can come up with some better way of encoding
things than writing the ASCII representations of numbers, but it
demonstrates the point.)
Once we've noticed that we might be able to compress data by encoding
pointers to earlier strings, we need to figure out how we're going to
write the pointers. In LZ77, every time we write some output, we're
writing out three pieces of information:
- a pointer to where the string begins
- the length of the string
- the character that comes after the string.
The paper includes various arguments for how many bits should be used to
encode the pointer and the length; in the example that follows, I'll
assume that we're using 4-bit numbers.
So, let's say we're in the middle of compression, and we're encoding
To be, or not to be: that is the question
x indicates how far we've gotten in the compression.
We look back in the last 16 characters we've encoded, and try and find
the longest string that is a prefix of 'o be: that is...'. The longest
such string is the 4-character long string that starts at location 2 ('o
be'). In this case, we'll write out the values: (2,4,':'). In this
case, it only takes 16 bits to encode 5 characters.
It should be pretty obvious now that using such small numbers to encode
the pointers and lengths isn't going to buy you very much: pretty often
we're not going to find any strings that are prefixes, so we'll have to
encode something like (0,0,'q'), meaning that we're encoding a 0-length
string followed by the letter q. This takes 2 bytes to describe a
single character, which is clearly suboptimal.
It's also important to know how to begin the encoding process: what do
we do at the beginning of a bunch of text? LZ77 assumes that the
beginning of the text is preceded by null-characters.
It turns out that with appropriate choices for the number of bits used
to encode the pointers and lengths, LZ77 does about as well as many
other compression schemes that needed a little bit more information
about what they were encoding.
Algorithms based on LZ77
The LZSS algorithm was published in 1982 by James Storer and Thomas
Szymanski. They noticed that in LZ77, it was common to encode 0-length
string prefixes as described above. Rather than always encoding a
pointer, only encode a pointer if we found a string that will be longer
than the pointer itself. This also means that we'll need some way to
distinguish between pointers and regular characters, so we can send a
bit before each symbol to tell whether it's a pointer or a character.
In the case above, we'd be outputting 17 bits every time we output a
pointer, but only 9 every time we output a character.
This generally achieves a better compression ratio than LZ77, and is
about the same difficulty to encode or decode.
Modifications of LZSS
Variants of LZSS typically apply additional compression to the output of
the LZSS compressor.
LZO was designed as a real-time compression library. The O stands for
Oberhumer, after the author (Markus Franz Xaver Johannes Oberhumer). It
appears that LZO actually implements many different variations, and
there's limited documentation on the web site about what these
variations are, but it's open source, so if you're really interested,
use the source.
Typically, when people refer to "Lempel Ziv," they mean this
The original algorithm
LZ78 abandons the idea of the sliding window. Instead, the algorithm
keeps track of a dictionary of previously seen phrases. When an entry
in the dictionary is seen, it outputs the dictionary index instead.
This eliminates the need to output the length of the encoded string,
since it's part of the dictionary.
In LZ78, we find the largest sequence of characters that's already in
the dictionary. We then add that sequence of characters, plus the next
character in the input stream, into the dictionary. So, to encode the
word "tattle" we'd do the following steps:
- Start out with the empty string "" in the dictionary.
- The longest prefix of "tattle" we've got in the dictionary is "".
Output the dictionary index of "", followed by the code for the
character "t", and add "t" to the dictionary.
- The longest prefix of "attle" we've got is "". Output the
dictionary index of "", followed by the code for the character "a". Add
"a" to the dictionary.
- Now we've got a different prefix... "ttle" starts with "t". Output
its dictionary index, followed by the code for the character "t". Add
"tt" to the dictionary.
The most common modification of LZ78 is LZW, first described by Terry
Welch in 1984. LZW initializes the first 256 entries of the dictionary
with all 256 possible 8-bit values. The LZW algorithm is used in the
TIFF and GIF graphic formats, as well as in the v.42bis modem
specification. The patent for LZW is owned by Unisys, who didn't
assert their ownership of the patent until after Compuserve had made
the GIF file format standard.
LZFG is an algorithm first described by Edward R. Fiala and Daniel H.
Greene in 1989. Rather than storing strings in a dictionary, they
store the strings in a suffix tree. Each branch of the tree at each
level corresponds to a unique letter (note that suffix trees are not
binary trees: if we were encoding a suffix tree of 8-bit characters,
we could have as many as 256 branches at each node.) Leaves are
collapsed to form suffixes. An example of a suffix tree follows:
A description of how to make such a data structure efficient is beyond
the scope of this write-up. The main advantage of this algorithm is
that it uses suffix trees and a specialized encoding scheme to indicate
strings in the tree to achieve better compression. The biggest
advantage of this scheme over LZ78 is that only a single copy of each
sequence is inserted into the tree, thereby saving storage space.
- http://wombat.doc.ic.ac.uk/foldoc/ - free online dictionary of
p490-fiala.pdf - LZFG
- http://www.oberhumer.com/opensource/lzo/ - the lzo homepage
- The Data Compression Book, by Mark Nelson and Jean-Loup
- Various other sites found via Google