15-451 Algorithms 9/28/00
* Hashing
- dictionary problem
- hashing with random inputs
- Universal hashing
- Perfect hashing (next time)
-----------------------------------------------------------------------------
THE DICTIONARY PROBLEM:
- static version: Given set S of (key,object) pairs (e.g, key=word,
object=definition). Want to store S so that we can do lookups
efficiently.
- dynamic: sequence of insert(key,object), find(key), and, perhaps,
delete(key) requests. Want to do these all efficiently.
Pretty much all the algorithms and data structures we've been talking
about so far are motivated at least in part by these problems. E.g.,
one reason to do sorting is that a sorted array is a good way of
handling a static dictionary, since then can do binary search. Sorted
arrays not so good for dynamic - but can use B-trees or Splay Trees
or Tries.
Hashing is another way to do it - very practical and easy to
implement. You probably implemented them in 211. But a little
mysterious too. Our focus will be on the theoretical underpinnings,
and we'll draw on 2-player game way of thinking from hwk2.
BASIC IDEA OF HASHING
- define a hash function h that maps keys into numbers from 0...M-1.
- to store our dictionary, we will allocate an array A of size M and use
the power of indirect addressing. We'll store (key, object)
in A[h(key)]. To handle collisions (h(key) = h(key')), we'll make
each entry of A be a linked list, and just put at top of list.
- searches are easy: compute index i=h(key) and then walk down list at
A[i] until we find it. Deletes are easy too.
Ideal setting: M is O(size of dictionary), all buckets (lists) are of
constant size, and h takes constant amount of time to compute. In
that case, what is time for insert/search/delete?
To analyze hashing schemes, make a few definitions.
- Let U be the "universe" of possible keys (e.g, all < 50-character strings)
- So, h is a function from U to {0,...,M-1}.
- Some much smaller subset S of U that we care about. (S may be
static or changing with time). Use N = |S|.
- Let T_h = time to compute h. Time to do a lookup for key k is
T_h + O(length of list at h(k)). insert just takes time T_h + O(1).
We'll be assuming T_h is O(1) but good to keep in back of our heads
that the time to compute h is an issue.
- basic intuition: really bad hash function: h maps everything in S
to the same place. Really good hash function: S gets evenly spread
through range. Almost as good: h spreads elements randomly --- but,
we can't just use a random number generator to decide where the next
thing goes because need h to be able to do lookups: h(k) has to
be the same when we lookup as it was when we inserted.
WORST-CASE BEHAVIOR (I)
- Claim: for any hash function h, if |U| >= N*M, there exists a set
S of size N that all hash to the same location.
- Proof: by the pigeon-hole principle.
- So, this is partly why hashing seems so mysterious -- how can you
claim hashing is good if for any hash function you can come up with ways of
foiling it? Any thoughts?
If think in 2-player game view, can imagine different hash functions
as rows and different sets S as columns, and we are saying that for
any deterministic strategy of row player, there is a column that kills
it. So, one way around is to assume column is picked randomly.
Another is to allow randomized strategy for row player. E.g., maybe
can pick hash function at random from some set and argue it has good
expected behavior for any column. Will get back to that in a bit.
Start with easier case of random inputs.
WHAT IF KEYS ARE RANDOM?
U = all u-bit strings, M = 2^m. Keys chosen at random in U. What
would be a good hash function?
-- can just use: h(x) = last m-bits of x. Typical
in systems applications where keys are random looking and maybe m=8.
Say have N keys total (chosen at random). If we fix one of them, what
is expected number of other elements that collide with it? Basically
throwing N balls at random into M boxes, and asking how many collide
with first ball. Expected number is (N-1)/M. So, if N=M, then for any
given element, expected length of list is < 2.
So, just need linear size table and average time is constant.
WHAT IF S ISN'T RANDOM? Idea is to use some randomness in choice of
h. E.g., maybe we can find a set H of h's such that no matter what S
is, *most* h in H behave well. This is idea of universal hashing.
UNIVERSAL HASHING
-----------------
Definition: a set H is a universal family of hash functions if for all
x != y in U,
Pr [ h(x) = h(y) ] <= 1/M
h<-H
Theorem: if H is universal, then for any set S in U of size N, and
any x in S, the *expected* number of collisions x has (where this is
over the choice of h) is <= (N-1)/M
Proof:
-- For each y!=x in S, let C_xy = 1 if x and y collide and 0 otherwise.
E[C_xy] = Pr(x and y collide) <= 1/M
-- C_x = total # collisions for x = sum_y C_xy.
-- E[C_x] = sum_y E[C_xy] <= (N-1)/M
Corollary: say N=M, and time to compute h is constant. Then for any
set S of size N, and any sequence of L lookups, the expected total
time for the lookups if we use a random h in H is O(L) -- so constant
time per operation.
Proof: say lookup x_1 then x_2, ..., then x_L (some of these could be
the same). Then the expected total time is just the sum of expected
time for each one, and each has constant expected time by our theorem.
So, this is giving us a good randomized strategy in the 2-player game
sense.
Example of universal family
|U| = 6, M=2
x : 0 1 2 3 4 5
------------------------
h_1(x): 0 1 0 1 0 1 (what's a bad S for h_1?)
h_2(x): 0 0 0 1 1 1 (what's a bad S for h_2?)
[is this universal yet? no. problems: (x=0, y=2), (x=3, y=5)]
h_3(x): 0 0 1 0 1 1
h_4(x): 1 0 0 1 1 0
[Now? Yes. Easiest way to see is notice that above pairs are
fine now. And problem pairs on the bottom are (0,3) and (2,5) which
are fine on top. Some pairs are better than 1/m, like (2,3),(1,4),(0,5)]
HOW TO CONSTRUCT: Matrix method
-----------------
Say keys are u-bits long. Say table size is power of 2, so an index is
m-bits long with M = 2^m.
H = { u x m 0-1 matrices}. h(x) = hx (h is a matrix, x is a vector),
where we do addition mod 2.
(these matrices are short and fat)
Claim: For x!=y, Pr[h(x) = h(y)] = 1/M = 1/2^m.
Proof: First of all, what does it mean to multiply h by x? E.g.,
h x h(x)
--------- - ---
[1 0 0 0] 1 1
[0 1 1 1] 0 = 1
[1 1 1 0] 1 0
0
Can think of as adding some of the columns of h (mod 2) where the 1
bits in x tell you which ones to add. (e.g., we added the 1st and 3rd
columns of h above)
Now, take arbitrary x!=y. Must differ someplace, so say they differ
in ith coordinate and for concreteness say x_i=0 and y_i=1.
Imagine we first choose all of h but the ith column. Over remaining
choices of ith column, h(x) is fixed. But, each of the 2^m different
settings of the ith column gives a different value of h(y) [every time
we flip a bit in that column, we flip the corresponding bit in h(y).]
So there is exactly a 1/2^m chance that h(x) = h(y).
Here's another universal family that's easier to deal with computationally:
Lets let U = {0,...,u-1} and M = {0,...,m-1}.
Pick prime p >= u (or, think of just rounding u up to nearest prime). Define
h_{a,b}(x) = ( (a*x + b) mod p ) mod m.
H = {h_ab | a in {1,...,p-1}, b in {0,...,p-1}}
Claim: H is a universal family too.