15-451 Algorithms 10/03/00
* Universal hashing, cont
* Perfect hashing
----------------------------------------------------------------------------
Hashing recap from last time:
General picture: There's a large universe U of possible keys (like all
possible strings of at most 50 chars). Some set S in U of keys we
actually care about (which may be static or dynamic), N = |S|. Do
inserts and lookups by having an array A of size M and a hash function
h:U ->{0,...,M-1}. Resolve collisions by having each entry in A be a
linked list. So, we want as few collisions as possible.
Last time we saw:
1. For any given function h, there exist bad sets S (so long as U is
large) that all hash to the same place.
2. If S is *random* then simple h's will turn this into problem of
tossing N balls at random into M buckets.
-> for any given x, expected number that collide with x is (N-1)/M.
-> if N=M, this is < 1.
-> if N=M, w.h.p., the size of the LONGEST list is O(log N/loglogN)
Proof of this last fact:
Let's fix a bucket like bucket 1. Let's also fix k balls.
What's the chance those k all fall into bucket 1? Ans: 1/M^k.
Claim: the chance that bucket #1 gets >= k balls is at most
(N choose k)/M^k. The reason is that if we look at these (N choose
k) events, at least one has to occur in order for >= k balls to
fall into bucket #1. For N=M, this is <= 1/k!. Suppose we set k
so that this is 0.01/M. Then the chance that *any* bucket gets k or
more balls is at most 0.01. This happens at k = O(log N/log log
N). [And you can work out the constants using Stirling's formula].
3. Universal hashing:
A set H is a universal family of hash functions if for all x != y in U,
Pr [ h(x) = h(y) ] <= 1/M
h<-H
Theorem: if H is universal, then for any S, any x in S,
expected # of collisions with x [for h random from H] is <= (N-1)/M.
So, this is like a good randomized strategy for the row-player.
[But, our argument that largest bucket has size O(log N/loglogN)
breaks down because we can't necessarily make the same claims about
k-tuples for k>2. E.g., can't say that Pr(h(x)=h(y)=h(z)) <= 1/M^2]
Question: if we fix the set S, can we find hash function h such that
*all* lookups are constant-time? Yes. This leads to...
PERFECT HASHING
---------------
Hash function is "perfect" for S if all lookups involve O(1) work.
(Usually defined as "no collisions" but that's a little confusing in
2-level scheme)
METHOD #1
---------
Say M = O(|S|^2) is OK. I.e., we are willing to have a table whose
size is quadratic in the size of our dictionary S. Then, here is an
easy method. Let H be universal and M = |S|^2. Pick random h in H
and hash everything in S.
Claim: Pr(no collisions) >= 1/2.
[So, we just try it, and if we got any collisions, we just try again]
Proof:
- how many pairs (x,y) in S are there? {|S| \choose 2}
- for each pair, the chance they collide is <= 1/M by defn of universal.
- so, Pr(exists a collision) <= {|S| \choose 2}/M < 1/2.
This is like the "other side" to the "birthday paradox". If the
number of days is *more* than (# people)^2, then there is a reasonable
chance *no* pair has the same birthday.
What if we want to use just O(|S|) space?
METHOD #2:
---------
History: This was a big open question -- posed as "should tables be
sorted" -- for a fixed set can you get constant lookup time with only
linear space? Was followed by more and more complicated attempts,
finally solved in mid-80s by Fredman Komlos and Szemeredi using nice
idea of universal hash functions in 2-level scheme. One thing to
mention -- a lot of things we talk about were big research results in
70s and 80s, now digested into course topics. Not the sort of thing
people have known for ages -- lot of smart people struggled over
understanding how to do these things...
Proposal: hash into table of size |S|. Will get some collisions.
Then, for each bin, rehash it using Method #1, squaring the size of
the bin to get zero collisions.
E.g., if 1st hash function maps {a,b,c,d,e,f} to
[--,{a}, {b,c}, {d,e,f}],
then final result would look something like:
[--,h_2, h_3, h_4]
h_2 would just be the function h_2(x) = 0. h_3(x) would be a function
to table of size 4, h_4 would be function to table of size 9. Point
is: using Method #1, we can find h_i with no collisions by picking
from universal H and trying it -- if it doesn't work we try again
(each time, Pr(success)>= 1/2).
So, total space used is O(|S|) for the 1st table (assume it takes a
constant amount of space to store each hash function), plus O(sum of
|B_i|^2) for the other tables. So, all we need to do is prove the
following theorem:
Theorem: if pick initial h from universal set H, then
Pr[sum of |B_i|^2 > 4*|S|] < 1/2.
Proof: We'll prove this by showing that E[sum_i |B_i|^2] < 2*|S|
(Why does this give us the result we need? <-- "Markov's inequality")
Now, the neat trick is that one way to count this quantity is to
count the number of ordered pairs that collide, including element
colliding with itself. E.g, if bucket has {d,e,f}, then d collides
with each of d,e,f; e collides with each of d,e,f; f collides with
each of d,e,f; so get 9.
So,
sum_i |B_i|^2 = # of ordered pairs that collide (allowing x=y)
= sum_x sum_y C(x,y) [C(x,y)=1 if in same bucket, else 0]
So,
E[sum_i |B_i|^2] = sum_x sum_y Pr(x and y are in same bucket)
<= |S| + |S|(|S|-1) * (1/M), where the "1/M" comes from
definition of universal.
< 2*|S|. (since we set M = |S|)
So, final "hash function" is
1. compute i = h(x)
2. Now compute h_i(x) and use to lookup in table T_i.
Total space used is O(sum of (B_i)^2) = O(|S|).
============================================================================
If there's time, here is another way of achieving universal hashing.
Let's assume keys in {0,1,...,U-1}. Hashing down to {0,1,...,M-1}
Pick prime p >= U (or, think of just rounding U up to nearest prime). Define
h_{a,b}(x) = ( (a*x + b) mod p ) mod M.
H = {h_ab | a in {1,...,p-1}, b in {0,...,p-1}}
Claim: H is a universal family.
Proof idea: the inner part will spread things out nicely in the range
0..p-1. In particular, given x!= y, (1) x is equally likely to go
anywhere, and (2) given where x goes, y is equally likely to go to
each of the p-1 places x didn't go (there won't be any collisions at
the mod p level). Then, this allows us to show things will be nice
when we take the results mod M.
Actual Proof: Fixing x!=y and two locations r and s, let's look at:
Pr[[(ax + b) = r (mod p)] AND [(ay + b) = s (mod p)]].
-> this means that ax-ay = r-s (mod p), so a = (r-s)/(x-y) mod p, which
has exactly one solution (mod p) since Z_p* is a field.
[We needed p >= U to ensure that x!=y mod p]
-> If r=s, then this means we need a=0 which wasn't allowed. So Pr=0.
-> If r!=s, then there is a 1/(p-1) chance that a has the right
value. Given this value of a, we need b = r-ax (mod p), and there is
a 1/p chance b gets this value, so the overall probability is
1/[p(p-1)].
Now, the probability x and y collide is equal to 1/p(p-1) times the
number of pairs r!=s in {0,...,p-1} such that r = s (mod M).
We have p choices for r, and then at most ceiling(p/M)-1 choices for
s ("-1" since we disallow s=r). The product is at most p(p-1)/M
Putting this all together,
Pr[(a*x + b mod p) mod M = (a*y + b mod p) mod M]
<= p(p-1)/M * [1/(p(p-1))] = 1/M.
QED