15-451 Algorithms 09/30/04 * Hashing - basic setup and idea - Universal hashing - Perfect hashing ----------------------------------------------------------------------------- Hashing: a great practical tool, with interesting & subtle theory too. Basic setting: dictionary problem just like we've been discussing so far. We have a bunch of items we want to store for efficient lookup. Two versions: - static: Given set S of items. Want to store so that we can do lookups quickly. E.g., fixed dictionary. - dynamic: sequence of inserts, lookups, and perhaps delete requests. Want to do these all efficiently. For the first problem we could use a sorted array with binary search for lookups. For the second we could use a balanced search tree. But hashing gives an alternative approach that is often the fastest and most convenient way of solving these problems. Example: writing an AI-search program, and you want to store situations that you've already solved (board positions or elements of state-space) so that you don't redo the same computation when you encounter them again. Also a lot of other uses in cryptography, networks, complexity theory. FORMAL SETUP ------------ - Keys come from some large universe U. (e.g, all < 50-character strings) - Some set S in U of keys we actually care about (which may be static or dynamic). N = |S|. - do inserts and lookups by having an array A of size M, and a HASH FUNCTION h: U -> {0,...,M-1}. Given element x, store in A[h(x)]. [if U was small (like 3-character strings) then you could just store x in A[x] like in bucketsort. But the problem is that U is big. That's why we need the hash function.] - Will resolve collisions by having each entry in A be a linked list. Collision is when h(x) = h(y). There are other methods but this is cleanest -- called "separate chaining". To insert, just put at top of list. If h is good, then hopefully lists will be small. Nice properties: searches are easy: compute index i=h(x) and then walk down list at A[i] until we find it. Inserts and deletes are easy too. Properties we want: - M is O(N) - Elements are spread out. Ideally all buckets (lists) of constant size. - h is fast to compute. In analysis today we'll be ignoring time to compute h, viewing it as constant. But worth remembering in the back of your head that h shouldn't be crazy. Given this, time to do a lookup for item x is O(length of list at h(x)). Same for delete. Insert just takes time O(1). So, important thing is to be able to analyze how big these lists get. Basic intuition: one way to spread things out nicely is to spread them randomly. Unfortunately, we can't just use a random number generator to decide where the next thing goes because need h to be able to do lookups: h(x) has to be the same when we lookup as it was when we inserted. So, we want something "pseudorandom". First give bad news. Then good news. BAD NEWS: WORST-CASE BEHAVIOR IS BAD ------------------------------------ - Claim: for any hash function h, if |U| >= (N-1)*M + 1, there exists a set S of size N that all hash to the same location. - Proof: by the pigeon-hole principle. - So, this is partly why hashing seems so mysterious -- how can you claim hashing is good if for any hash function you can come up with ways of foiling it? One answer is that there are a lot of simple hash functions that work well in practice. But what if we want to have a guarantee for *every* set S? A nice idea: use randomization in analogy to randomized quicksort. We'll use some random numbers in our construction of h. (h itself will be a deterministic function, of course). What we'll show is that for *any* set S (don't need to assume S is random), if we then pick h in this random way, things will be good in expectation. Same kind of guarantee as in randomized quicksort. This is idea of UNIVERSAL hashing. Once we develop this idea, we'll use it for a really nice practical application called "perfect hashing". UNIVERSAL HASHING ----------------- A set of hash functions H is *universal* if for all x != y in U, Pr [ h(x) = h(y) ] <= 1/M h<-H Theorem: if H is universal, then for any set S in U, for any x in S, the *expected* number of collisions x has [for a random h in H] is <= (N-1)/M Proof: Each y!=x has <= 1/M chance of colliding with x. Adding up over all y in S you get expected number <= (N-1)/M. Or, more formally: - Let C_xy = 1 if x and y collide and 0 otherwise. E[C_xy] = Pr(x and y collide) <= 1/M - C_x = total # collisions for x = sum_y C_xy. - E[C_x] = sum_y E[C_xy] <= (N-1)/M Corollary: For any *sequence* of L lookups, the expected *total* time for the lookups if we use a random h in H is O(L(1 + N/M)) -- so constant time per operation if M=N. Proof: By linearity of expectation, the expected total time is just the sum of expected times for each lookup. Lookup time is proportional to length of list = 1 + # of collisions. Question: can we actually construct a universal hash family? If not, this this is all pretty vacuous. Answer is yes. One clarification: we're not actually going to explicitly construct all of H and then pick one h in H at random. Instead, we'll have a randomized procedure for constructing h. In fact, it probably would have been more intuitive to define H as a *probability distribution* over hash functions, and say H is a universal probability distrib over hash function if [rest of defn here]. But ``universal sets'' is the way it's done in the literature. Analogy is randomized quicksort: can think of it as picking an ordering on pivots at random from the set of all n! orderings, but instead we find it more natural to just say: each time you need to pick a pivot, pick one at random. HOW TO CONSTRUCT: Matrix method ----------------- Say keys are u-bits long. Say table size is power of 2, so an index is b-bits long with M = 2^b. What we'll do is pick h to be a random b-by-u 0/1 matrix, and define h(x) = hx, where we do addition mod 2. These matrices are short and fat. For instance: h x h(x) --------- - --- [1 0 0 0] 1 1 [0 1 1 1] 0 = 1 [1 1 1 0] 1 0 0 Claim: For x!=y, Pr[h(x) = h(y)] = 1/M = 1/2^b. Proof: First of all, what does it mean to multiply h by x? We can think of it as adding some of the columns of h (mod 2) where the 1 bits in x tell you which ones to add. (e.g., we added the 1st and 3rd columns of h above) Now, take arbitrary x,y, x!=y. Must differ someplace, so say they differ in ith coordinate and for concreteness say x_i=0 and y_i=1. Imagine we first choose all of h but the ith column. Over remaining choices of ith column, h(x) is fixed. But, each of the 2^b different settings of the ith column gives a different value of h(y) [every time we flip a bit in that column, we flip the corresponding bit in h(y).] So there is exactly a 1/2^b chance that h(x) = h(y). There are other methods based on multiplication modulo primes too. ===================================================================== Question: if we fix the set S, can we find hash function h such that *all* lookups are constant-time? Yes. This leads to... PERFECT HASHING --------------- Hash function is "perfect" for S if all lookups involve O(1) work. METHOD #1 --------- Say we are willing to have a table whose size is quadratic in the size N of our dictionary S. Then, here is an easy method. Let H be universal and M = N^2. Pick random h in H and hash everything in S. Claim: Pr(no collisions) >= 1/2. [So, we just try it, and if we got any collisions, we just try a new h] Proof: - how many pairs (x,y) in S are there? {N \choose 2} - for each pair, the chance they collide is <= 1/M by defn of universal. - so, Pr(exists a collision) <= {N \choose 2}/M < 1/2. This is like the "other side" to the "birthday paradox". If the number of days is a lot *more* than (# people)^2, then there is a reasonable chance *no* pair has the same birthday. What if we want to use just O(N) space? METHOD #2: --------- History: This was a big open question -- posed as "should tables be sorted" -- for a fixed set can you get constant lookup time with only linear space? Was followed by more and more complicated attempts, finally solved using nice idea of universal hash functions in 2-level scheme. Proposal: hash into table of size N. Will get some collisions. Then, for each bin, rehash it using Method #1, squaring the size of the bin to get zero collisions. E.g., if 1st hash function maps {a,b,c,d,e,f} to [--,{a}, {b,c}, {d,e,f}], then final result would look something like: [--,h_2, h_3, h_4] h_2 would just be the function h_2(x) = 0. h_3(x) would be a function to table of size 4, h_4 would be function to table of size 9. Point is: using Method #1, we can find h_i with no collisions by picking from universal H and trying it -- if it doesn't work we try again (each time, Pr(success)>= 1/2). By *design* this is O(1) time for lookups, since just compute two hash functions. Question is: how much space does it use? Space used is O(N) for the 1st table (assume it takes a constant amount of space to store each hash function), plus O(sum of |B_i|^2) for the other tables (B_i is ith bucket). So, all we need to do is prove: Theorem: if pick initial h from universal set H, then Pr[sum of |B_i|^2 > 4*N] < 1/2. Proof: We'll prove this by showing that E[sum_i |B_i|^2] < 2*N (Why does this give us the result we need? <-- "Markov's inequality") Now, the neat trick is that one way to count this quantity is to count the number of ordered pairs that collide, including element colliding with itself. E.g, if bucket has {d,e,f}, then d collides with each of d,e,f; e collides with each of d,e,f; f collides with each of d,e,f; so get 9. So, sum_i |B_i|^2 = # of ordered pairs that collide (allowing x=y) = sum_x sum_y C(x,y) [C(x,y)=1 if in same bucket, else 0] So, E[sum_i |B_i|^2] = sum_x sum_y Pr(x and y are in same bucket) <= N + N(N-1) * (1/M), where the "1/M" comes from definition of universal. < 2*N. (since we set M = N) So, final "hash function" is 1. compute i = h(x) 2. Now compute h_i(x) and use to lookup in table T_i. Total space used is O(sum of (B_i)^2) = O(N). ============================================================================ Other uses of hashing ===================== Suppose we have a long sequence of items and we want to see how many *different* items are in the list. What's a good way of doing that? One way: create hash table; do lookup and then insert if not in table; count number of inserts. Now what if the list is really huge, so we don't have space to store them all, but we are OK with just an approximate answer. E.g., imagine we're a router and watching a lot of packets go by, and we want to see (roughly) how many different source IP addresses there are. Neat idea: say we have a hash function h that behaves like a random function, and let's think of h(x) as a real number between 0 and 1. One thing we can do is just keep track of the *minimum* hash value produced so far (so we won't have a table at all). E.g., if keys are 3,10,3,3,12,10,12 and h(3)=0.4, h(10)=0.2, h(12)=0.7, then we get 0.2. Point is: if we pick N random numbers in [0,1], the expected value of the minimum is 1/(N+1). And, there's a good chance it is fairly close (can improve estimate by running several hash functions and taking median of mins). Question: why use a hash function rather than just picking a random number each time? That is because we care about the number of *different* items, not just the number of items (that problem is a lot easier: just keep a counter...) =======================================================