Normally we store data using a fixed-length code (such as ASCII). This is easy, but assigning a shorter code to more frequent data is more efficient with space.
We examine a compression algorithm called Huffman encoding.
We can regard ASCII as a tree.
*
0/ :
/
*______
0/ \__________1__
/ \
* *
: \1 0/ :
\ /
* *
: \1 0/ :
\ /
* *
0/ : 0/ :
/ /
* *
0/ : 0/ :
/ /
/\ /\
0/ \1 0/ \1
/ \ / \
0/\1 0/\1 0/\1 0/\1
... 0 1 2 3 ... @ A B C ...
0 <=> 00110000
A <=> 01000001
Because every character is a leaf, we call this a prefix code.
For tree T it takes depth[T](x) bits to represent x.
Given some text, compute the number of occurrences of each letter x. Call this freq(x).
PGSS is exhausting but exhilarating.
_ 4 s 2 r 1 i 4 n 2 l 1 t 3 h 2 b 1 a 3 g 2 P 1 x 2 e 2 G 1 u 2 S 2 . 1
A prefix code tree T takes
bits[T] = sum freq(c) depth[T](c)
c
characters to represent the string.
The Huffman encoding finds the optimal T.
Start with many trees with weight according to freq.
_ i t a x u s n h g e S r l b P G . 4 4 3 3 2 2 2 2 2 2 2 2 1 1 1 1 1 1
Combine the two lightest trees.
/\
_ i t a x u s n h g e S G . r l b P
4 4 3 3 2 2 2 2 2 2 2 2 2 1 1 1 1
Repeat until one tree left.
/\ /\ /\
_ i t a x u s n h g e S G . b P r l
4 4 3 3 2 2 2 2 2 2 2 2 2 2 2
/\
/ \
/ \ /\
/\ /\ S /\ /\ /\ /\
_ i b P r l G . g e n h u s t a x
4 4 4 4 4 4 4 3 3 2
/\_____________
/ /\
/\ / \___
/ /\ / \
/ _ i __/_ \____
/\ / \ / \
/ \ __/_ /\ /\ /\
/\ \ / \ S /\ g e n h
/\ t /\ /\ /\ G .
u s a x b P r l
10001 10110 00001 00001 010 011 00001 010 1101 0011 1111 0010 00000 P G S S _ i s _ e x h a u 00001 0001 011 1110 1100 010 10000 00000 0001 010 1101 0011 1111 011 s t i n g _ b u t _ e x h i 10011 0010 10010 0010 0001 011 1110 1100 10111 l a r a t i n g .
The string takes 304 bits with ASCII, but 148 bits with the Huffman encoding: a 52% savings!
Lemma: If x and y occur least frequently, some optimal trees has x and y as siblings.
Say a and b are the deepest siblings.
*
: :
* *
/ \
x y
*
/ \
a b
Then freq(x) <= freq(a) and depth[T](x) <= depth[T](a). Let T' be T
with a and x swapped. We show that T' is at least as good as T.
bits[T] - bits[T'] = freq(a) (depth[T](a) - depth[T](x))
+ freq(x) (depth[T](x) - depth[T](a))
= (freq(a) - freq(x)) (depth[T](a) - depth[T](x))
>= 0
So bits[T'] <= bits[T].
By the same reasoning we can swap y and b to make x and y siblings.
Lemma: Say T is optimal for alphabet A. Choose siblings x and y in T with parent z. Let T' be T without x and y. Let freq(z) = freq(x) + freq(y). Then T' is optimal for A-{x,y}+{z}.
bits[T'] = bits[T] - freq(x) depth[T](x) - freq(y) depth[T](y)
+ freq(z) depth[T](z)
= bits[T] - freq(z) depth[T](x) + freq(z) depth[T](z)
= bits[T] - freq(z)
Say U' is better than T' for A-{x,y}+{z}. Replace z in U' with
/\ x yto get U. We see that
bits[U] = bits[U'] + freq(z) < bits[T'] + freq(z) = bits[T](The first step follows by the relationship similar to that between bits[T] and bits[T'].) The statement that bits[U] < bits[T] is a contradiction, so U' does not exist, and T' is optimal.