15-451 Algorithms 09/30/03 * Universal hashing, cont * Perfect hashing mini#3 due tonite ---------------------------------------------------------------------------- Hashing recap from last time: General picture: There's a large universe U of possible keys (like all possible strings of at most 50 chars). Some set S in U of keys we actually care about (which may be static or dynamic), N = |S|. Do inserts and lookups by having an array A of size M and a hash function h:U ->{0,...,M-1}. Resolve collisions by having each entry in A be a linked list. Last time we saw: For any given function h, there exist bad sets S (so long as U is large) that all hash to the same place. One idea to get around this: can we find a randomized procedure for constructing h so that for any set S, we have good expected performance (over our choice of h)? H is a universal set of hash functions if for all x != y in U, Pr [ h(x) = h(y) ] <= 1/M h<-H Theorem: if H is universal, then for any S, any x in S, expected # of collisions with x, if we pick a random h in H, is <= (N-1)/M. One clarification: we're not actually going to explicitly construct all of H and then pick one h in H at random. Instead, we'll have a randomized procedure for constructing h. In fact, it probably would have been more intuitive to define H as a *probability distribution* over hash functions, and say H is a universal probability distrib over hash function if [rest of defn here]. But ``universal sets'' is the way it's done in the literature. Analogy is randomized quicksort: can think of it as picking an ordering on pivots at random from the set of all n! orderings, but instead we find it more natural to just say: each time you need to pick a pivot, pick one at random. Ended last time by showing matrix method for universal hashing, and proving it was universal. Hash function is a random b-by-u 0/1 matrix (short and fat). Key is u-bits long, and h hashes it into something only b-bits long by h(x) = hx, where addition is mod 2. h x h(x) --------- - --- [1 0 0 0] 1 1 [0 1 1 1] 0 = 1 [1 1 1 0] 1 0 0 This is then an index in the table (so M = 2^b). Claim: For x!=y, Pr[h(x) = h(y)] = 1/M = 1/2^b. Proof: Can think of h(x) as adding some of the columns of h (mod 2) where the 1 bits in x tell you which ones to add. (e.g., we added the 1st and 3rd columns of h above) Now, take arbitrary x!=y. Must differ someplace, so say they differ in ith coordinate and for concreteness say x_i=0 and y_i=1. Imagine we first choose all of h but the ith column. Over remaining choices of ith column, h(x) is fixed. But, each of the 2^b different settings of the ith column gives a different value of h(y) [every time we flip a bit in that column, we flip the corresponding bit in h(y).] So there is exactly a 1/2^b chance that h(x) = h(y). ===================================================================== Question: if we fix the set S, can we find hash function h such that *all* lookups are constant-time? Yes. This leads to... PERFECT HASHING --------------- Hash function is "perfect" for S if all lookups involve O(1) work. N = |S|. METHOD #1 --------- Say we are willing to have a table whose size is quadratic in the size of our dictionary S. Then, here is an easy method. Let H be universal and M = N^2. Pick random h in H and hash everything in S. Claim: Pr(no collisions) >= 1/2. [So, we just try it, and if we got any collisions, we just try a new h] Proof: - how many pairs (x,y) in S are there? {N \choose 2} - for each pair, the chance they collide is <= 1/M by defn of universal. - so, Pr(exists a collision) <= {N \choose 2}/M < 1/2. This is like the "other side" to the "birthday paradox". If the number of days is a lot *more* than (# people)^2, then there is a reasonable chance *no* pair has the same birthday. What if we want to use just O(N) space? METHOD #2: --------- History: This was a big open question -- posed as "should tables be sorted" -- for a fixed set can you get constant lookup time with only linear space? Was followed by more and more complicated attempts, finally solved using nice idea of universal hash functions in 2-level scheme. Proposal: hash into table of size N. Will get some collisions. Then, for each bin, rehash it using Method #1, squaring the size of the bin to get zero collisions. E.g., if 1st hash function maps {a,b,c,d,e,f} to [--,{a}, {b,c}, {d,e,f}], then final result would look something like: [--,h_2, h_3, h_4] h_2 would just be the function h_2(x) = 0. h_3(x) would be a function to table of size 4, h_4 would be function to table of size 9. Point is: using Method #1, we can find h_i with no collisions by picking from universal H and trying it -- if it doesn't work we try again (each time, Pr(success)>= 1/2). By *design* this is O(1) time for lookups, since just compute two hash functions. Question is: how much space does it use? Space used is O(N) for the 1st table (assume it takes a constant amount of space to store each hash function), plus O(sum of |B_i|^2) for the other tables (B_i is ith bucket). So, all we need to do is prove: Theorem: if pick initial h from universal set H, then Pr[sum of |B_i|^2 > 4*N] < 1/2. Proof: We'll prove this by showing that E[sum_i |B_i|^2] < 2*N (Why does this give us the result we need? <-- "Markov's inequality") Now, the neat trick is that one way to count this quantity is to count the number of ordered pairs that collide, including element colliding with itself. E.g, if bucket has {d,e,f}, then d collides with each of d,e,f; e collides with each of d,e,f; f collides with each of d,e,f; so get 9. So, sum_i |B_i|^2 = # of ordered pairs that collide (allowing x=y) = sum_x sum_y C(x,y) [C(x,y)=1 if in same bucket, else 0] So, E[sum_i |B_i|^2] = sum_x sum_y Pr(x and y are in same bucket) <= N + N(N-1) * (1/M), where the "1/M" comes from definition of universal. < 2*N. (since we set M = N) So, final "hash function" is 1. compute i = h(x) 2. Now compute h_i(x) and use to lookup in table T_i. Total space used is O(sum of (B_i)^2) = O(N). ============================================================================ Here's another method of doing universal hashing that requires less computation. The hashing scheme: Let's assume keys in {0,1,...,U-1}. Hashing down to {0,1,...,M-1} Pick prime p >= U. (Or, think of U as prime). Define h_{a,b}(x) = ( (a*x + b) mod p ) mod M. H = {h_ab | a in {1,...,p-1}, b in {0,...,p-1}} Claim: H is a 2-universal family. Proof idea: the inner part will spread things out nicely in the range 0..p-1. In particular, given x!= y, (1) x is equally likely to go anywhere, and (2) given where x goes, y is equally likely to go to each of the p-1 places x didn't go (there won't be any collisions at the mod p level). Then, this allows us to show things will be nice when we take the results mod M. Actual Proof: Fixing x!=y and two locations r and s, let's look at: Pr[[(ax + b) = r (mod p)] AND [(ay + b) = s (mod p)]]. -> this means that ax-ay = r-s (mod p), so a = (r-s)/(x-y) mod p, which has exactly one solution (mod p) since Z_p* is a field. [We needed p >= U to ensure that x!=y mod p] -> If r=s, then this means we need a=0 which wasn't allowed. So Pr=0. -> If r!=s, then there is a 1/(p-1) chance that a has the right value. Given this value of a, we need b = r-ax (mod p), and there is a 1/p chance b gets this value, so the overall probability is 1/[p(p-1)]. Now, the probability x and y collide is equal to 1/p(p-1) times the number of pairs r!=s in {0,...,p-1} such that r = s (mod M). We have p choices for r, and then at most ceiling(p/M)-1 choices for s ("-1" since we disallow s=r). The product is at most p(p-1)/M Putting this all together, Pr[(a*x + b mod p) mod M = (a*y + b mod p) mod M] <= p(p-1)/M * [1/(p(p-1))] = 1/M. QED ======================================================================= Other uses of hashing ===================== Suppose we have a long sequence of items and we want to see how many *different* items are in the list. What's a good way of doing that? One way: create hash table; do lookup and then insert if not in table; count number of inserts. Now what if the list is really huge, so we don't have space to store them all, but we are OK with just an approximate answer. E.g., imagine we're a router and watching a lot of packets go by, and we want to see (roughly) how many different source IP addresses there are. Neat idea: say we have a hash function h that behaves like a random function, and let's think of h(x) as a real number between 0 and 1. One thing we can do is just keep track of the *minimum* hash value produced so far (so we won't have a table at all). E.g., if keys are 3,10,3,3,12,10,12 and h(3)=0.4, h(10)=0.2, h(12)=0.7, then we get 0.2. Here's why this might be useful: say there are N distinct items (e.g., N = 1 billion). Then, the chance the minimum is bigger than 1/N is: (1 - 1/N)^N = 1/e (approximately). The chance the minimum is bigger than 2/N is about 1/e^2. The chance it's less than 1/(4N) is at most 1/4. So if we do this with a few hash functions and see what the minima look like, we can get a reasonable estimate of what N is, without having to store a lot of information. Question: why use a hash function rather than just picking a random number each time? That is because we care about the number of *different* items, not just the number of items (that problem is a lot easier: just keep a counter...) =======================================================