Huffman codes are a widely used and very effective technique for the compression of electronic data. Using the Huffman codes technique, savings of 20% to 90% are the norm depending on the characteristics of the file being compressed.

In order to understand what Huffman codes are and how exactly they can compress data so well, we need to consider the underlying problem of designing a binary character code. A binary character code is a type of code in which each character in the alphabet is represented by a unique binary string (i.e., a unique sequence of 0's and 1's). Huffman codes are a particular type of binary code that are especially useful in some situations.

For a simple example of a binary character code, let's assume we have a text file 100,000 characters in length. The entire file consists of just the letters a, b, c, d, e, and f, so the alphabet size of the file is 6.

Since our alphabet is of size 6, it makes sense that we would only need six different binary strings (repeated over and over again in a particular order) to represent this text file on an electronic system. Let's assume that for simplicity's sake the binary strings should all be the same length, for ease of processing.

If that is the case, each character must be represented with a binary string of length 3; length 2 doesn't provide enough unique strings to represent each letter, and length 4 wastes space with extra unused strings. Let's take a look at the following table, summarizing the contents of this file:

                             |  a  |  b  |  c  |  d  |  e  |  f  |
-----------------------------|-----|-----|-----|-----|-----|-----|
Frequency (in thousands)     | 53  | 11  | 10  | 14  |  7  |  5  |
Fixed-length codeword        | 000 | 001 | 010 | 011 | 100 | 101 |

Using this scheme, it would take 300,000 (i.e., 3 bits/character x 100,000 characters) bits to store the entire file. This is pretty good, but this can easily be topped by using a variable length code. Let's use the following variable length strings to represent the data file:

                             |   a   |   b   |   c   |   d   |   e   |   f   |
-----------------------------|-------|-------|-------|-------|-------|-------|
Frequency (in thousands)     |  53   |  11   |  10   |  14   |   7   |   5   |
Fixed-length codeword        |  000  |  001  |  010  |  011  |  100  |  101  |
variable-length codeword     |  0    |  101  |  100  |  111  |  1101 |  1100 |

With this variable length codeword, it would take 53,000 x 1 + (11,000 + 10,000 + 14,000) x 3 + (7,000 + 5,000) x 4 = 206,000 bytes to store the file, a savings of over 31% in storage space.

This seems impressive, but there is a problem: how can we make sure that we can tell that each of the variable-length codewords can easily be told apart? With a fixed length codeword, it's easy to make sure which is which, as they can be delimited by their specific length. With a variable-length codeword, length doesn't matter.

In order to do this, we have to make sure that the variable-length code we use is a prefix code. In other words, we have to make sure that each codeword in the code is not a prefix of another codeword in that code. For example, since in the code above, 0 is used to represent a, no other code can begin with 0. Since 101, 100, and 111 are used to represent b, c, and d, no other codes can start with 101, 100, or 111, meaning the other codes must all start with 110, which the other two do (1101 and 1100 for e and f, respectively).

A Huffman code is, then, an optimal variable length code, meaning that the most common member of the code has the shortest length, and the lengths of the codewords increase as the frequency of the codewords in the file decrease. A Huffman code is usually generated by constructing a binary tree using a simple greedy algorithm, which can be simply demonstrated as follows.

Let's say we need to construct a binary tree using the characters from the file shown above:

(a: 53)  (b: 11)  (c: 10)  (d: 14)  (e: 7)  (f: 5)

Using a greedy algorithm to build a tree out of these means at each step, we choose the root that has the lowest value and merge it with the tree with the second lowest root value, creating a new root. We don't re-balance the tree; instead, we leave it as is. Let's get going; first, we merge e and f:

(a: 53)  (b: 11)  (c: 10)  (d: 14)       (12)
                                         /  \
                                        /    \
                                    (e: 7)  (f: 5)

Next, we choose the two roots with the lowest value and merge again. This time, c and b are the two lowest:

(a: 53)        (21)        (d: 14)       (12)
               /  \                      /  \
              /    \                    /    \
         (b: 11)  (c: 10)           (e: 7)  (f: 5)

Now, d and the tree made up of e and f have the two lowest values, so we merge them as follows (note that switching left and right on each individual subtree does not matter while merely constructing the tree as long as proper parent-child relations remain):

(a: 53)        (21)                      (26)
               /  \                      /  \
              /    \                    /    \
         (b: 11)  (c: 10)         (d: 14)    (12)
                                             /  \
                                            /    \
                                        (e: 7)  (f: 5)

Merging 21 and 26, then merging this tree with a gives our final tree:

                    (100)
                    /   \
                   /     \
              (a: 53)    (47)
                         /  \
                        /    \
                       /      \
                      /        \
                   (21)        (26)
                   /  \        /  \
                  /    \      /    \
             (b:11) (c:10) (12)  (d:14)
                           /  \
                          /    \
                      (e: 7)  (f: 5)

Once you have assembled this tree, assigning a 1 to each right branch and a 0 to each left branch builds your Huffman codes for you, and builds them optimally, for the least frequent characters are the furthest from the root and thus have the longest strings. Also, given the nature of the binary tree assembled, each bitstring is unique and is not the prefix of another.

Huffman codes, usually assembled by a binary tree as above, are used extremely regularly in modern computing for compression of data. Whenever you listen to an mp3 or use a zip file, you are utilizing Huffman codes.