Data Compression

PGSS Computer Science Core Slides

Normally we store data using a fixed-length code (such as ASCII). This is easy, but assigning a shorter code to more frequent data is more efficient with space.

We examine a compression algorithm called Huffman encoding.

Prefix codes

We can regard ASCII as a tree.

             *
           0/ :
           /
          *______
        0/       \__________1__
        /                      \
       *                        *
      : \1                    0/ :
         \                    /
          *                   *
         : \1               0/ :
            \               /
             *              *
           0/ :           0/ :
           /              /
           *              *
         0/ :           0/ :
         /              /
        /\             /\
      0/  \1         0/  \1
      /    \         /    \
    0/\1  0/\1     0/\1  0/\1

... 0  1  2  3 ... @  A  B  C ...

0 <=> 00110000
A <=> 01000001

Because every character is a leaf, we call this a prefix code.

For tree T it takes depth[T](x) bits to represent x.

Letter frequency

Given some text, compute the number of occurrences of each letter x. Call this freq(x).

           PGSS is exhausting but exhilarating.
   _ 4    s 2    r 1
   i 4    n 2    l 1
   t 3    h 2    b 1
   a 3    g 2    P 1
   x 2    e 2    G 1
   u 2    S 2    . 1

A prefix code tree T takes

          bits[T] = sum freq(c) depth[T](c)
                     c
characters to represent the string.

The Huffman encoding finds the optimal T.

Building the Tree

Start with many trees with weight according to freq.

_ i t a x u s n h g e S r l b P G .
4 4 3 3 2 2 2 2 2 2 2 2 1 1 1 1 1 1

Combine the two lightest trees.

                         /\
_ i t a x u s n h g e S G  . r l b P
4 4 3 3 2 2 2 2 2 2 2 2  2   1 1 1 1

Repeat until one tree left.

                         /\   /\   /\
_ i t a x u s n h g e S G  . b  P r  l
4 4 3 3 2 2 2 2 2 2 2 2  2    2    2



        /\
       /  \
      /    \    /\
     /\    /\  S /\   /\   /\   /\
_ i b  P  r  l  G  . g  e n  h u  s t a x
4 4     4        4    4    4    4   3 3 2

Finished Tree

         /\_____________
        /              /\
       /\             /  \___
      / /\           /       \
     / _  i       __/_        \____
    /\           /    \       /    \
   /  \       __/_    /\     /\    /\
  /\   \     /    \  S /\   g  e  n  h
 /\ t  /\   /\    /\  G  .
u  s  a x  b  P  r  l
10001 10110 00001 00001 010 011 00001 010 1101 0011 1111 0010 00000
  P     G     S     S    _   i    s    _   e    x    h    a    u

00001 0001 011 1110 1100 010 10000 00000 0001 010 1101 0011 1111 011
  s    t    i   n    g    _    b     u    t    _   e    x    h    i

10011 0010 10010 0010 0001 011 1110 1100 10111
  l    a     r    a    t    i   n    g    .

The string takes 304 bits with ASCII, but 148 bits with the Huffman encoding: a 52% savings!

Optimality, Part I

Lemma: If x and y occur least frequently, some optimal trees has x and y as siblings.

Say a and b are the deepest siblings.

    *
   : :
  *   *
 /     \
x       y
    *
   / \
  a   b
Then freq(x) <= freq(a) and depth[T](x) <= depth[T](a). Let T' be T with a and x swapped. We show that T' is at least as good as T.

  bits[T] - bits[T'] = freq(a) (depth[T](a) - depth[T](x))
                        + freq(x) (depth[T](x) - depth[T](a))
                     = (freq(a) - freq(x)) (depth[T](a) - depth[T](x))
                     >= 0
So bits[T'] <= bits[T].

By the same reasoning we can swap y and b to make x and y siblings.

Optimality, Part II

Lemma: Say T is optimal for alphabet A. Choose siblings x and y in T with parent z. Let T' be T without x and y. Let freq(z) = freq(x) + freq(y). Then T' is optimal for A-{x,y}+{z}.

  bits[T'] = bits[T] - freq(x) depth[T](x) - freq(y) depth[T](y)
               + freq(z) depth[T](z)
           = bits[T] - freq(z) depth[T](x) + freq(z) depth[T](z)
           = bits[T] - freq(z)

Say U' is better than T' for A-{x,y}+{z}. Replace z in U' with

 /\
x  y
to get U. We see that
bits[U] = bits[U'] + freq(z) < bits[T'] + freq(z) = bits[T]
(The first step follows by the relationship similar to that between bits[T] and bits[T'].) The statement that bits[U] < bits[T] is a contradiction, so U' does not exist, and T' is optimal.