Notes for 01/27 15-852 Randomized Algorithms * Universal hashing Section 8.4, 8.5 * Perfect hashing * Occupancy problems ============================================================================== HASHING ------- Set up: have a domain/universe M, and some subset S of M of points we care about (|S| << |M|). Want to map M into a much smaller set N (e.g., |N| = O(|S|)), in a way that's easy to calculate and results in few collisions in S. Function h: M --> N is called a hash function. Lots of motivations for hashing. One is the Dictionary problem: - static: set S of keys (e.g., key is a word, or name) and associated with each key is an object (e.g., definition, phone number). Want to store S so can do efficient lookups. - dynamic: sequence of insert(key,object), find(key), delete(key) requests. Goal: Idea of hashing: maintain a small table, and use hash function h to map keys into this table. If h behaves randomly, shouldn't get too many collisions. Interesting because very basic technique in computer science where idea of randomness is critical. Even if in practice you don't use randomness in selecting the function (as in the methods we will prove things about), the motivation/inspiration is random behavior. Simple fact: if M is sufficiently large (|M| > |S|*|N|), then for any h, there exists S such that all elements of S collide. So, need to either choose h based on S, or make randomness assumptions on S, or choose h using some randomness. For convenience, lets abuse notation and let N = |N| and M = |M|. Universal hash functions ------------------------ Definition: H is a 2-universal family (set) of hash functions from M to N if: for all x,y in M (x != y), Prob[h(x) = h(y)] <= 1/N h<-H similar to pairwise independence. Even more similar is notion of "strongly 2-universal". H is "strongly 2-universal" if for all x_1 != x_2 in M, all y_1, y_2 in N, Prob[h(x_1)=y_1 & h(x_2)=y_2] = 1/N^2. Why are universal hash functions useful? ---------------------------------------- Say we use hash function to store set S in table of size n, where, say, we handle collisions with linked list. Then, cost of find(x) is proportional to length of list at h(x). So, let C_{x,y} be indicator random variable for x and y having collision. C_{x,S} be length of list at h(x). Then, E[C_xy] <= 1/N, so E[C_xS] < 1 + |S|/N So, just need linear table size to have O(1) expected time per operation. How to construct? Will give 2 ways. ----------------- Way #1: Let's assume that M, N are powers of 2: M = {0,1}^m and N = {0,1}^n. H = { m x n 0-1 matrices}. h(x) = hx, in GF2 (addition is mod 2). Note: h(0) = 0. So, could remove 0 from domain if want strong 2-universal. Claim: For x!=y, Pr[h(x) = h(y)] = 1/|N|. Proof: Say x and y differ in ith coordinate, and say that y_i = 1. Let z = y, but with z_i = 0. (So, it may be that z=x). Then, h(y) = h(z) + h_i, where h_i is the ith column of h. Notice that h(z) is independent from h_i, So, Pr[h(y)=h(x)] = Pr[h_i = h(x) - h(z)] = 1/2^n. In fact, we've shown that for any r, prob[h(y) = h(x) + r] = 1/N, independent of h(x). So, so long as x!=0, we have strong 2-universality. So, this is a nice, easy family of hash functions. One problem is that it requires log(N)*log(M) random bits. Here's another 2-universal family: Here, lets let M = {0,...,m-1} and N = {0,...,n-1}. Pick prime p >= m (or, think of just rounding m up to nearest prime). Define h_{a,b}(x) = ( (a*x + b) mod p ) mod n. H = {h_ab | a,b in Z_p and a != 0} Claim: H is a 2-universal family. Proof: Let's fix r != s and calculate, for x!=y, Pr[[(ax + b) = r (mod p)] AND [(ay+b) = s (mod p)]]. -> it must be that ax-ay = r-s (mod p), so a = (r-s)/(x-y) mod p, which has exactly one solution (mod p) since Z_p* is a field. -> given this value of a, we must have b = r-ax (mod p). So, this probability is 1/[p(p-1)]. (What is the probability for r=s? Ans: 0) Now, how many pairs r!=s are there in {0,...,p-1} such that r = s (mod n)? -> have p choices for r, and then at most ceiling(p/n)-1 choices for s ("-1" since we disallow s=r). The product is at most p(p-1)/n Putting this all together, Pr[(a*x + b mod p) mod n = (a*y + b mod p) mod n] <= p(p-1)/n * [1/(p(p-1))] = 1/n. QED Perfect hashing --------------- Say we want to hash a fairly stable set. E.g., a real dictionary, or lookup table. So, we can pick the hash function based on the set S. Can we get O(1) time WORST-CASE? --> Perfect hashing. (h is perfect for S if it causes no collisions). 1 way: Say O(|S|^2) space is OK. Then, just do a universal hash function into table of size n=|S|^2. For any x!=y, we have Pr[h(x)=h(y)] <= 1/n. So, Pr[Exists x!=y that collide] <= |S|(|S|-1)/(2n) < 1/2. So, just pick random h in H and try it. Each time have at least 1/2 chance of success. If fail, try again. What if we only want to use O(|S|) space? Let's let n=|S| and try a universal hash function. Let B_i = size of ith bin. What is the expected value of sum_i (B_i)^2? sum_i (B_i)^2 = sum_(x,y) C_xy = n + sum_(x != y) C_x,y < 2n. So, any ideas....? For at least 1/2 of h's, this is at most 4n. Now, if B_j = t, can hash into t^2 elements perfectly with random h. So, final hash function is a tree of depth 2, into a table of size 4n with no collisions.