15-859(D) Randomized Algorithms 10/6/98 * finish occupancy problems * Hashing * lower bounds for randomized algs Occupancy problems, contd. -> see slides Coupon-collector's problem -> see slides hashing intro -> see slides Universal hash functions -> see slides Why are universal hash functions useful? ---------------------------------------- Say we use hash function to store set S in table of size n, where, say, we handle collisions with linked list. Then, cost of find(x) is proportional to length of list at h(x). So, let C_{x,y} be indicator random variable for x and y having collision. C_{x,S} be length of list at h(x). Then, E[C_xy] <= 1/N, so E[C_xS] < 1 + (|S|-1)/N So, just need linear table size to have O(1) expected time per operation. Perfect hashing -> see slides. How to construct Universal Hash Function? Will give 2 ways. ---------------------------------------- Way #1: Let's assume that M, N are powers of 2: M = {0,1}^m and N = {0,1}^n. H = { m x n 0-1 matrices}. h(x) = hx, in GF2 (addition is mod 2). Note: h(0) = 0. So, could remove 0 from domain if want strong 2-universal. Claim: For x!=y, Pr[h(x) = h(y)] = 1/|N|. Proof: Say x and y differ in ith coordinate, and say that y_i = 1. Now notice that, fixing all of h but the ith column, h(x) is fixed but each of the 2^n different settings of the ith column gives a different value of h(y) [i.e., h(y) is still uniform random] So there is exactly a 1/2^n chance that h(x) = h(y). In fact, we've shown that for any r, prob[h(y) = h(x) + r] = 1/N, independent of h(x). So, so long as we remove x=0 from the domain (so h(x) is uniform random), we have strong 2-universality. So, this is a nice, easy family of hash functions. One problem is that it is a little inconvenient computationally. Here's another 2-universal family: Here, lets let M = {0,...,m-1} and N = {0,...,n-1}. Pick prime p >= m (or, think of just rounding m up to nearest prime). Define h_{a,b}(x) = ( (a*x + b) mod p ) mod n. H = {h_ab | a in {1,...,p-1}, b in {0,...,p-1}} Claim: H is a 2-universal family. Proof: Let's fix r != s and calculate, for x!=y, Pr[[(ax + b) = r (mod p)] AND [(ay+b) = s (mod p)]]. -> it must be that ax-ay = r-s (mod p), so a = (r-s)/(x-y) mod p, which has exactly one solution (mod p) since Z_p* is a field. -> given this value of a, we must have b = r-ax (mod p). So, this probability is 1/[p(p-1)]. (What is the probability for r=s? Ans: 0) Now, how many pairs r!=s are there in {0,...,p-1} such that r = s (mod n)? -> have p choices for r, and then at most ceiling(p/n)-1 choices for s ("-1" since we disallow s=r). The product is at most p(p-1)/n Putting this all together, Pr[(a*x + b mod p) mod n = (a*y + b mod p) mod n] <= p(p-1)/n * [1/(p(p-1))] = 1/n. QED Lower bounds for randomized algorithms ====================================== Consider some problem we want to solve, like sorting. What does it mean to give a lower bound on performance? for all alg A, exists input I s.t. C(A,I) > b, Where C(A,I) = cost of alg A on input I. Can think of as game between algorithm designer and adversary, where alg has to go first. What about randomized algorithms? View randomized alg as a distribution on deterministic algs. Then we want to say that for any distribution on deterministic algs, there exist an input such that expected cost is high. 1) for all dist Q, exists I such that E_Q[C(A_Q, I)] > b Often easiest way to do this is to have adversary go first, and be randomized, and assume alg is deterministic. That is, to show: 2) exists dist P such that for all A, E_P[C(A,I_P)] > b Can you see why this is sufficient? -> look at E[C(A_Q, I_P)], for some Q and for P of (2). (2) says this must be larger than b, since its larger for all A, and if for some Q this is larger than b, then clearly exists I s.t. it's larger than b. (Using fact that if E[X] > y, then exists legal value of X that is > y) In fact, in the case where algs=circuits, so have finite number, and any distrib is OK, then min-max theorem of game theory says max_P min_A E[C(A, I_P)] == min_Q max_I E[C(A_Q, I)] adv goes first alg goes first Apply to hashing ---------------- Theorem: for any family H of functions h: M-->N, (|M| = m, |N| = n), there exist x != y in M s.t. Prob[ h(x) = h(y) ] >= 1/n - 1/m. h<-H Proof: show that for any h, for random x!=y, Prob[ h(x) = h(y) ] >= 1/n - 1/m. x!=y Solve by counting number of ordered pairs that collide and then dividing by m(m-1). B_j = number of points in M that map to j. Number of ordered pairs (x != y) that collide is: Sum_j B_j(B_j - 1) = Sum_j(B_j)^2 - m. Minimum occurs when all B_j are equal. Why? No randomness, but consider random variable X = B_j, j=1,...,n with equal probability. Variance = Avg[(B_j)^2] - Avg[B_j]^2 Notice that variance can't be negative. So, Avg[(B_j)^2] >= (m/n)^2. Minimized when all equal (variance=0). So, Sum_j B_j(B_j - 1) >= n(m/n)^2 - m = m^2(1/n - 1/m). There are m(m-1) ordered pairs x,y, so the probability that a random pair collides is at least: [m^2/(m(m-1))] * (1/n - 1/m) >= 1/n - 1/m.