A binary prefix code is a code in which every symbol (for example, letters of the alphabet) is assigned a distinct binary code, and no code is a prefix of any other code. For example, the following is one such code: b, 00; r, 0100; u, 0101; c, 011; e, 1.
One way to construct a binary prefix code is to use a binary tree. Begin by drawing a binary tree in which every internal node has two children, and every edge is labeled either 0 or 1. Label each leaf node with one of the symbols that is to be encoded (for example, letters of the alphabet). (Note that some legitimate binary prefix codes correspond to trees in which some internal nodes have only one child.) The above example, then corresponds to
*
0/ \1
* e
0/ \1
b *
0/ \1
* c
0/ \1
r u
Note that this is not a binary search tree.
To encode a symbol, you start at the symbol's leaf, and then print the path downwards. So to encode `c', you find `c' (this can be found quickly using an array and then print out the path. The first node below the root is to the left, so you print a zero bit. The next node is to the right, so you print a one bit. And then `c' is to the right of that, so you print another one bit.
To decode a string, you begin at the root and go to the left or right as you encounter zeroes and ones. If the bit string is 0001011, you first go to the left. This isn't a leaf, so based on the next zero you go to the left again. Since this is a leaf, you output `b' and start at the top again. The next bit is a zero, so you go to the left. That's not a leaf, so you read the next bit (a one) and go to the right. Still not a leaf, you say, so you take the next bit (a zero) and go to the left. That still isn't a leaf, so you take the next bit, a one, which indicates that you should go to the right. Now we're finally at a leaf, and we output `u'. Then you start at the root again.
Try proving that no code is a prefix of any other code if you use this tree representation.
Suppose that the code for symbol s_1 is a prefix of the code for symbol s_2. For each symbol, there is unique path from the root to the leaf that is labeled with that symbol, and the edges on that path specify the code for that symbol. Consider the path for s_2. Since it is a prefix of the code for s_1, it must pass through the leaf labeled s_1, which is a contradiction, since a leaf has no children.
(The no-prefix property also implies that every symbol has a distinct code.)
A Huffman tree is tree in which the average string length is minimum when frequency of occurrence is taken into account.
For example, suppose that we want to compress the following file:
abcabacababbadabbaThe frequencies of the characters are:
a: 8 b: 7 c: 2 d: 1 total number of characters: 18The Huffman tree for this file is: (We'll explain how to make the tree next.)
*
0/ \1
a *
0/ \1
* b
0/ \1
c d
The binary prefix code maps the letters to strings as follows:
a -> 0 b -> 11 c -> 100 d -> 101 string to encode: a b c a b a c a b a b b a d a b b a compressed output: 0 11 100 0 11 0 100 0 11 0 11 11 0 101 0 11 11 0
Now, to decompress the output: Suppose that we want to decompress the following string:
110101100011You should get ``badcab''.
For this code, average string length is:
1*(8/18)+2*(7/18)+3*(2/18)+3*(1/18) = 1.72Notice that this is better than the average string length of 2 given by a complete binary tree:
*
/ \
* *
/ \ / \
a b c d
How do we build a Huffman tree?
In this example, we start by merging c and d, to form a new internal node with value 3. Then we merge the new internal node with b, to form a new node with value 10. Finally, we merge a with this last internal node to form the root, which has value 18.
Some things to think about:
This is a little more subtle!
Suppose that this were true, and let n be a value such that there is some input file of length n that, when compressed, yields an output file of length less than n. Consider all possible input files of length 1 through n bytes. There are a total of 256^1 + 256^2 + 256^3 + ... + 256^n such files. Since the compressor sometimes reduces the length of a file (and never increases the length of a file) it can't output all 256^n files of length n. Therefore, if we run all of the different files of length 1 through n through the compressor, we'll get fewer than 256^1 + 256^2 + ... + 256^n different output files. This means that there must be two different input files that, when compressed, result in the same output file!