6 Mar 1996

Outline

Binary prefix codes
Huffman trees

Binary prefix codes

A binary prefix code is a code in which every symbol (for example, letters of the alphabet) is assigned a distinct binary code, and no code is a prefix of any other code. For example, the following is one such code: b, 00; r, 0100; u, 0101; c, 011; e, 1.

One way to construct a binary prefix code is to use a binary tree. Begin by drawing a binary tree in which every internal node has two children, and every edge is labeled either 0 or 1. Label each leaf node with one of the symbols that is to be encoded (for example, letters of the alphabet). (Note that some legitimate binary prefix codes correspond to trees in which some internal nodes have only one child.) The above example, then corresponds to

Note that this is not a binary search tree.

To encode a symbol, you start at the symbol's leaf, and then print the path downwards. So to encode `c', you find `c' (this can be found quickly using an array and then print out the path. The first node below the root is to the left, so you print a zero bit. The next node is to the right, so you print a one bit. And then `c' is to the right of that, so you print another one bit.

To decode a string, you begin at the root and go to the left or right as you encounter zeroes and ones. If the bit string is 0001011, you first go to the left. This isn't a leaf, so based on the next zero you go to the left again. Since this is a leaf, you output `b' and start at the top again. The next bit is a zero, so you go to the left. That's not a leaf, so you read the next bit (a one) and go to the right. Still not a leaf, you say, so you take the next bit (a zero) and go to the left. That still isn't a leaf, so you take the next bit, a one, which indicates that you should go to the right. Now we're finally at a leaf, and we output `u'. Then you start at the root again.

Try proving that no code is a prefix of any other code if you use this tree representation.

Suppose that the code for symbol s_1 is a prefix of the code for symbol s_2. For each symbol, there is unique path from the root to the leaf that is labeled with that symbol, and the edges on that path specify the code for that symbol. Consider the path for s_2. Since it is a prefix of the code for s_1, it must pass through the leaf labeled s_1, which is a contradiction, since a leaf has no children.

(The no-prefix property also implies that every symbol has a distinct code.)

Huffman Trees

A Huffman tree is tree in which the average string length is minimum when frequency of occurrence is taken into account.

For example, suppose that we want to compress the following file:

abcabacababbadabba

The frequencies of the characters are:

a: 8
b: 7
c: 2
d: 1
total number of characters: 18

The Huffman tree for this file is: (We'll explain how to make the tree next.)

     *
   0/ \1
   a   *
     0/ \1
     *   b
   0/ \1
   c   d

The binary prefix code maps the letters to strings as follows:

a -> 0
b -> 11
c -> 100
d -> 101

string to encode:  a b  c   a b  a c   a b  a b  b  a d   a b  b  a
compressed output: 0 11 100 0 11 0 100 0 11 0 11 11 0 101 0 11 11 0

Now, to decompress the output: Suppose that we want to decompress the following string:

110101100011

You should get ``badcab''.

For this code, average string length is:

1*(8/18)+2*(7/18)+3*(2/18)+3*(1/18) = 1.72

Notice that this is better than the average string length of 2 given by a complete binary tree:

      *
    /   \
  *       *
 / \     / \
a   b   c   d

How do we build a Huffman tree?

Start with a set of nodes, one for each character (e.g., 'a', 'b', 'c', 'd'), each of which has a value equal to its frequency.
If there is only one node in the set, terminate.
Find and remove the pair of nodes in the set with minimum values.
Insert into the set a new node with this pair as children, whose value is the sum of their values.
Goto step two.

In this example, we start by merging c and d, to form a new internal node with value 3. Then we merge the new internal node with b, to form a new node with value 10. Finally, we merge a with this last internal node to form the root, which has value 18.

Some things to think about:

Why does this algorithm produce the code with the minimum average string length. (This proof is by induction)
A compression algorithm cannot guarantee that it will produce a compressed file that is shorter than the original. (Otherwise, we'd run it repeatedly - what would we end up with?)
A compression algorithm cannot sometimes produce a file that is shorter than the original, but also guarantee that it will never produce a compressed file that is longer than the original.

This is a little more subtle!

Suppose that this were true, and let n be a value such that there is some input file of length n that, when compressed, yields an output file of length less than n. Consider all possible input files of length 1 through n bytes. There are a total of 256^1 + 256^2 + 256^3 + ... + 256^n such files. Since the compressor sometimes reduces the length of a file (and never increases the length of a file) it can't output all 256^n files of length n. Therefore, if we run all of the different files of length 1 through n through the compressor, we'll get fewer than 256^1 + 256^2 + ... + 256^n different output files. This means that there must be two different input files that, when compressed, result in the same output file!