# Data Compression

PGSS Computer Science Core Slides

Normally we store data using a fixed-length code (such as ASCII). This is easy, but assigning a shorter code to more frequent data is more efficient with space.

We examine a compression algorithm called Huffman encoding.

### Prefix codes

We can regard ASCII as a tree.

```             *
0/ :
/
*______
0/       \__________1__
/                      \
*                        *
: \1                    0/ :
\                    /
*                   *
: \1               0/ :
\               /
*              *
0/ :           0/ :
/              /
*              *
0/ :           0/ :
/              /
/\             /\
0/  \1         0/  \1
/    \         /    \
0/\1  0/\1     0/\1  0/\1

... 0  1  2  3 ... @  A  B  C ...

0 <=> 00110000
A <=> 01000001
```

Because every character is a leaf, we call this a prefix code.

For tree T it takes depth[T](x) bits to represent x.

### Letter frequency

Given some text, compute the number of occurrences of each letter x. Call this freq(x).

```           PGSS is exhausting but exhilarating.
```
```   _ 4    s 2    r 1
i 4    n 2    l 1
t 3    h 2    b 1
a 3    g 2    P 1
x 2    e 2    G 1
u 2    S 2    . 1
```

A prefix code tree T takes

```          bits[T] = sum freq(c) depth[T](c)
c
```
characters to represent the string.

The Huffman encoding finds the optimal T.

### Building the Tree

Start with many trees with weight according to freq.

```_ i t a x u s n h g e S r l b P G .
4 4 3 3 2 2 2 2 2 2 2 2 1 1 1 1 1 1
```

Combine the two lightest trees.

```                         /\
_ i t a x u s n h g e S G  . r l b P
4 4 3 3 2 2 2 2 2 2 2 2  2   1 1 1 1
```

Repeat until one tree left.

```                         /\   /\   /\
_ i t a x u s n h g e S G  . b  P r  l
4 4 3 3 2 2 2 2 2 2 2 2  2    2    2

/\
/  \
/    \    /\
/\    /\  S /\   /\   /\   /\
_ i b  P  r  l  G  . g  e n  h u  s t a x
4 4     4        4    4    4    4   3 3 2
```

### Finished Tree

```         /\_____________
/              /\
/\             /  \___
/ /\           /       \
/ _  i       __/_        \____
/\           /    \       /    \
/  \       __/_    /\     /\    /\
/\   \     /    \  S /\   g  e  n  h
/\ t  /\   /\    /\  G  .
u  s  a x  b  P  r  l
```
```10001 10110 00001 00001 010 011 00001 010 1101 0011 1111 0010 00000
P     G     S     S    _   i    s    _   e    x    h    a    u

00001 0001 011 1110 1100 010 10000 00000 0001 010 1101 0011 1111 011
s    t    i   n    g    _    b     u    t    _   e    x    h    i

10011 0010 10010 0010 0001 011 1110 1100 10111
l    a     r    a    t    i   n    g    .
```

The string takes 304 bits with ASCII, but 148 bits with the Huffman encoding: a 52% savings!

### Optimality, Part I

Lemma: If x and y occur least frequently, some optimal trees has x and y as siblings.

Say a and b are the deepest siblings.

```    *
: :
*   *
/     \
x       y
*
/ \
a   b
```
Then freq(x) <= freq(a) and depth[T](x) <= depth[T](a). Let T' be T with a and x swapped. We show that T' is at least as good as T.

```  bits[T] - bits[T'] = freq(a) (depth[T](a) - depth[T](x))
+ freq(x) (depth[T](x) - depth[T](a))
= (freq(a) - freq(x)) (depth[T](a) - depth[T](x))
>= 0
```
So bits[T'] <= bits[T].

By the same reasoning we can swap y and b to make x and y siblings.

### Optimality, Part II

Lemma: Say T is optimal for alphabet A. Choose siblings x and y in T with parent z. Let T' be T without x and y. Let freq(z) = freq(x) + freq(y). Then T' is optimal for A-{x,y}+{z}.

```  bits[T'] = bits[T] - freq(x) depth[T](x) - freq(y) depth[T](y)
+ freq(z) depth[T](z)
= bits[T] - freq(z) depth[T](x) + freq(z) depth[T](z)
= bits[T] - freq(z)
```

Say U' is better than T' for A-{x,y}+{z}. Replace z in U' with

``` /\
x  y
```
to get U. We see that
bits[U] = bits[U'] + freq(z) < bits[T'] + freq(z) = bits[T]
(The first step follows by the relationship similar to that between bits[T] and bits[T'].) The statement that bits[U] < bits[T] is a contradiction, so U' does not exist, and T' is optimal.