15-451 Algorithms 10/04/01 * Universal hashing, cont * Perfect hashing ---------------------------------------------------------------------------- Hashing recap from last time: General picture: There's a large universe U of possible keys (like all possible strings of at most 50 chars). Some set S in U of keys we actually care about (which may be static or dynamic), N = |S|. Do inserts and lookups by having an array A of size M and a hash function h:U ->{0,...,M-1}. Resolve collisions by having each entry in A be a linked list. Last time we saw: For any given function h, there exist bad sets S (so long as U is large) that all hash to the same place. One idea to get around this: can we find a randomized procedure for constructing h so that for any set S, we have good expected performance (over our choice of h)? H is a universal set of hash functions if for all x != y in U, Pr [ h(x) = h(y) ] <= 1/M h<-H Theorem: if H is universal, then for any S, any x in S, expected # of collisions with x [for h random from H] is <= (N-1)/M. Matrix method for universal hashing ----------------------------------- Say keys are u-bits long. Say table size is power of 2, so an index is m-bits long with M = 2^m. H = { u x m 0-1 matrices}. h(x) = hx (h is a matrix, x is a vector), where we do addition mod 2. (these matrices are short and fat) Claim: For x!=y, Pr[h(x) = h(y)] = 1/M = 1/2^m. Proof: First of all, what does it mean to multiply h by x? E.g., h x h(x) --------- - --- [1 0 0 0] 1 1 [0 1 1 1] 0 = 1 [1 1 1 0] 1 0 0 Can think of as adding some of the columns of h (mod 2) where the 1 bits in x tell you which ones to add. (e.g., we added the 1st and 3rd columns of h above) Now, take arbitrary x!=y. Must differ someplace, so say they differ in ith coordinate and for concreteness say x_i=0 and y_i=1. Imagine we first choose all of h but the ith column. Over remaining choices of ith column, h(x) is fixed. But, each of the 2^m different settings of the ith column gives a different value of h(y) [every time we flip a bit in that column, we flip the corresponding bit in h(y).] So there is exactly a 1/2^m chance that h(x) = h(y). ===================================================================== Question: if we fix the set S, can we find hash function h such that *all* lookups are constant-time? Yes. This leads to... PERFECT HASHING --------------- Hash function is "perfect" for S if all lookups involve O(1) work. METHOD #1 --------- Say M = O(|S|^2) is OK. I.e., we are willing to have a table whose size is quadratic in the size of our dictionary S. Then, here is an easy method. Let H be universal and M = |S|^2. Pick random h in H and hash everything in S. Claim: Pr(no collisions) >= 1/2. [So, we just try it, and if we got any collisions, we just try again] Proof: - how many pairs (x,y) in S are there? {|S| \choose 2} - for each pair, the chance they collide is <= 1/M by defn of universal. - so, Pr(exists a collision) <= {|S| \choose 2}/M < 1/2. This is like the "other side" to the "birthday paradox". If the number of days is *more* than (# people)^2, then there is a reasonable chance *no* pair has the same birthday. What if we want to use just O(|S|) space? METHOD #2: --------- History: This was a big open question -- posed as "should tables be sorted" -- for a fixed set can you get constant lookup time with only linear space? Was followed by more and more complicated attempts, finally solved using nice idea of universal hash functions in 2-level scheme. Proposal: hash into table of size |S|. Will get some collisions. Then, for each bin, rehash it using Method #1, squaring the size of the bin to get zero collisions. E.g., if 1st hash function maps {a,b,c,d,e,f} to [--,{a}, {b,c}, {d,e,f}], then final result would look something like: [--,h_2, h_3, h_4] h_2 would just be the function h_2(x) = 0. h_3(x) would be a function to table of size 4, h_4 would be function to table of size 9. Point is: using Method #1, we can find h_i with no collisions by picking from universal H and trying it -- if it doesn't work we try again (each time, Pr(success)>= 1/2). So, total space used is O(|S|) for the 1st table (assume it takes a constant amount of space to store each hash function), plus O(sum of |B_i|^2) for the other tables. So, all we need to do is prove the following theorem: Theorem: if pick initial h from universal set H, then Pr[sum of |B_i|^2 > 4*|S|] < 1/2. Proof: We'll prove this by showing that E[sum_i |B_i|^2] < 2*|S| (Why does this give us the result we need? <-- "Markov's inequality") Now, the neat trick is that one way to count this quantity is to count the number of ordered pairs that collide, including element colliding with itself. E.g, if bucket has {d,e,f}, then d collides with each of d,e,f; e collides with each of d,e,f; f collides with each of d,e,f; so get 9. So, sum_i |B_i|^2 = # of ordered pairs that collide (allowing x=y) = sum_x sum_y C(x,y) [C(x,y)=1 if in same bucket, else 0] So, E[sum_i |B_i|^2] = sum_x sum_y Pr(x and y are in same bucket) <= |S| + |S|(|S|-1) * (1/M), where the "1/M" comes from definition of universal. < 2*|S|. (since we set M = |S|) So, final "hash function" is 1. compute i = h(x) 2. Now compute h_i(x) and use to lookup in table T_i. Total space used is O(sum of (B_i)^2) = O(|S|). ============================================================================ If there's time, here is a proof for the prime-number method of doing universal hashing. The hashing scheme: Let's assume keys in {0,1,...,U-1}. Hashing down to {0,1,...,M-1} Pick prime p >= U. Define h_{a,b}(x) = ( (a*x + b) mod p ) mod M. H = {h_ab | a in {1,...,p-1}, b in {0,...,p-1}} Claim: H is a 2-universal family. Proof idea: the inner part will spread things out nicely in the range 0..p-1. In particular, given x!= y, (1) x is equally likely to go anywhere, and (2) given where x goes, y is equally likely to go to each of the p-1 places x didn't go (there won't be any collisions at the mod p level). Then, this allows us to show things will be nice when we take the results mod M. Actual Proof: Fixing x!=y and two locations r and s, let's look at: Pr[[(ax + b) = r (mod p)] AND [(ay + b) = s (mod p)]]. -> this means that ax-ay = r-s (mod p), so a = (r-s)/(x-y) mod p, which has exactly one solution (mod p) since Z_p* is a field. [We needed p >= U to ensure that x!=y mod p] -> If r=s, then this means we need a=0 which wasn't allowed. So Pr=0. -> If r!=s, then there is a 1/(p-1) chance that a has the right value. Given this value of a, we need b = r-ax (mod p), and there is a 1/p chance b gets this value, so the overall probability is 1/[p(p-1)]. Now, the probability x and y collide is equal to 1/p(p-1) times the number of pairs r!=s in {0,...,p-1} such that r = s (mod M). We have p choices for r, and then at most ceiling(p/M)-1 choices for s ("-1" since we disallow s=r). The product is at most p(p-1)/M Putting this all together, Pr[(a*x + b mod p) mod M = (a*y + b mod p) mod M] <= p(p-1)/M * [1/(p(p-1))] = 1/M. QED =======================================================================