15-451 Algorithms 10/02/01 * Hashing - basic idea - hashing with random inputs - Universal hashing - Perfect hashing ----------------------------------------------------------------------------- Hashing: a great practical tool, with interesting & subtle theory too. Basic setting: have a bunch of items we want to store for efficient lookup. This is often called the "dictionary problem". Two versions: - static: Given set S of (key,object) pairs. Want to store S so that we can do find(key) operations quickly. E.g., fixed dictionary (e.g, key=word, object=definition), or any sort of static lookup table. - dynamic: sequence of insert(key,object), find(key), and, perhaps, delete(key) requests. Want to do these all efficiently. E.g., memoizing, where no simple array structure. E.g., in AI planning, might have key = state, and object = correct action to take in that state (or "value function" in MDPs). FORMAL SETUP ------------ - Say keys come from some large universe U. (e.g, all < 50-character strings) - Some set S in U of keys we actually care about (which may be static or dynamic). N = |S|. - do inserts and lookups b having an array A of size M, and a HASH FUNCTION h: U -> {0,...,M-1}. Use power of indirect addressing to Store key k (and its associated object) in A[h(k)]. - One thing to worry about are collisions: h(k1) = h(k2). Simplest way to resolve collisions is to have each entry in A be a linked list. To insert, just put at top of list. If h is good, then hopefully lists will be small. This is called "separate chaining". Nice properties: searches are easy: compute index i=h(key) and then walk down list at A[i] until we find it. Inserts and deletes are easy too. Properties we want: - M is O(N) - Elements are spread out. Ideally all buckets (lists) of constant size. - h is fast (constant time) to compute. Let T_h = time to compute h. Time to do a lookup for key k is T_h + O(length of list at h(k)). insert just takes time T_h + O(1). We'll be assuming T_h is O(1) but good to keep in back of our heads that the time to compute h is an issue. Basic intuition: one way to spread things out nicely is to spread them randomly. Unfortunately, we can't just use a random number generator to decide where the next thing goes because need h to be able to do lookups: h(k) has to be the same when we lookup as it was when we inserted. So, we want something "pseudorandom". First give good news. Then bad news. Then good news. GOOD NEWS (I): WHAT IF KEYS ARE RANDOM? -------------------------------------- Then, property we want of h is just that roughly same number of elements of U hash to each location. E.g., a really bad functions is "h(x) = 17, no matter what x is". Good hash function: h(x) = x mod M. Or, if M = 2^m, and U = all strings of some length u, then can use h(x) = last m-bits of x. Typical in systems applications where keys are random looking and maybe m=8. In this case, can think of having N balls being tossed at random into M boxes. If we fix some x, what is the expected number of other elements in S (other balls) that collide with it? Ans: (N-1)/M. This is great, because if M=N, it means the for any given x, the expected length of x's list is < 2. Another interesting fact: if M=N, then the expected size of the LONGEST list is O(log N / log log N). I.e., we toss N balls into N buckets. This is the expected size of the fullest bucket. (maybe we will prove this next time). BAD NEWS: WORST-CASE BEHAVIOR IS BAD ------------------------------------ - Claim: for any hash function h, if |U| >= N*M, there exists a set S of size N that all hash to the same location. Actually only need |U| >= (N-1)*M + 1. - Proof: by the pigeon-hole principle. - So, this is partly why hashing seems so mysterious -- how can you claim hashing is good if for any hash function you can come up with ways of foiling it? One nice answer: universal hashing. Idea is a lot like in randomized quicksort. We'll use some random numbers in our construction of h. What we'll show is that for *any* set S (don't need to assume S is random), if we then pick h in this random way, things will be good in expectation. Once we develop this idea, we'll use it for a really nice practical goal: "perfect hashing". Given a fixed set S, a perfect hash function for S is one such that *every* lookup is constant time. We'll use universal hashing as a tool for constructing perfect hash functions. We'll do universal hashing in this class, and then perfect hashing on Thurs. UNIVERSAL HASHING ----------------- A set H is a universal family of hash functions if for all x != y in U, Pr [ h(x) = h(y) ] <= 1/M h<-H Theorem: if H is universal, then for any set S in U, any x in S, the *expected* number of collisions x has [for a random h in H] is <= (N-1)/M Proof: -- For each y!=x in S, let C_xy = 1 if x and y collide and 0 otherwise. E[C_xy] = Pr(x and y collide) <= 1/M -- C_x = total # collisions for x = sum_y C_xy. -- E[C_x] = sum_y E[C_xy] <= (N-1)/M Corollary: say the time to compute h is constant. Then for any sequence of L lookups, the expected total time for the lookups if we use a random h in H is O(LN/M) -- so constant time per operation if M=N. Proof: say lookup x_1 then x_2, ..., then x_L (some of these could be the same). Then the expected total time is just the sum of expected time for each one, and each has constant expected time by our theorem. Question: can we actually construct a universal hash family? If not, this this is all pretty vacuous. Answer is yes. First, let's do a small concrete example: |U| = 6, M=2 x : 0 1 2 3 4 5 ------------------------ h_1(x): 0 1 0 1 0 1 (what's a bad S for h_1?) h_2(x): 0 0 0 1 1 1 (what's a bad S for h_2?) [is this universal yet? no. problems: (x=0, y=2), (x=3, y=5)] h_3(x): 0 0 1 0 1 1 h_4(x): 1 0 0 1 1 0 [Now? Yes. Easiest way to see is notice that above pairs are fine now. And problem pairs on the bottom are (0,3) and (2,5) which are fine on top. Some pairs are better than 1/m, like (2,3),(1,4),(0,5)] [[Note: since time was running short, we skipped the matrix method and just looked at the prime-number method, without proof. We'll look at the matrix method next time.]] HOW TO CONSTRUCT: Matrix method ----------------- Say keys are u-bits long. Say table size is power of 2, so an index is m-bits long with M = 2^m. H = { u x m 0-1 matrices}. h(x) = hx (h is a matrix, x is a vector), where we do addition mod 2. (these matrices are short and fat) Claim: For x!=y, Pr[h(x) = h(y)] = 1/M = 1/2^m. Proof: First of all, what does it mean to multiply h by x? E.g., h x h(x) --------- - --- [1 0 0 0] 1 1 [0 1 1 1] 0 = 1 [1 1 1 0] 1 0 0 Can think of as adding some of the columns of h (mod 2) where the 1 bits in x tell you which ones to add. (e.g., we added the 1st and 3rd columns of h above) Now, take arbitrary x!=y. Must differ someplace, so say they differ in ith coordinate and for concreteness say x_i=0 and y_i=1. Imagine we first choose all of h but the ith column. Over remaining choices of ith column, h(x) is fixed. But, each of the 2^m different settings of the ith column gives a different value of h(y) [every time we flip a bit in that column, we flip the corresponding bit in h(y).] So there is exactly a 1/2^m chance that h(x) = h(y). HOW TO CONSTRUCT: prime-number method ------------------------------------- Let's assume keys in {0,1,...,U-1}. Hashing down to {0,1,...,M-1} Pick prime p >= U (or, think of just rounding U up to nearest prime). Define h_{a,b}(x) = ( (a*x + b) mod p ) mod M. H = {h_ab | a in {1,...,p-1}, b in {0,...,p-1}} Claim: H is a universal family. Proof idea: the inner part will spread things out nicely in the range 0..p-1. In particular, given x!= y, (1) x is equally likely to go anywhere, and (2) given where x goes, y is equally likely to go to each of the p-1 places x didn't go (there won't be any collisions at the mod p level). Then, this allows us to show things will be nice when we take the results mod M. Actual Proof: Fixing x!=y and two locations r and s, let's look at: Pr[[(ax + b) = r (mod p)] AND [(ay + b) = s (mod p)]]. -> this means that ax-ay = r-s (mod p), so a = (r-s)/(x-y) mod p, which has exactly one solution (mod p) since Z_p* is a field. [We needed p >= U to ensure that x!=y mod p] -> If r=s, then this means we need a=0 which wasn't allowed. So Pr=0. -> If r!=s, then there is a 1/(p-1) chance that a has the right value. Given this value of a, we need b = r-ax (mod p), and there is a 1/p chance b gets this value, so the overall probability is 1/[p(p-1)]. Now, the probability x and y collide is equal to 1/p(p-1) times the number of pairs r!=s in {0,...,p-1} such that r = s (mod M). We have p choices for r, and then at most ceiling(p/M)-1 choices for s ("-1" since we disallow s=r). The product is at most p(p-1)/M Putting this all together, Pr[(a*x + b mod p) mod M = (a*y + b mod p) mod M] <= p(p-1)/M * [1/(p(p-1))] = 1/M. QED