15-859(D) Randomized Algorithms
10/6/98
* finish occupancy problems
* Hashing
* lower bounds for randomized algs
Occupancy problems, contd. -> see slides
Coupon-collector's problem -> see slides
hashing intro -> see slides
Universal hash functions -> see slides
Why are universal hash functions useful?
----------------------------------------
Say we use hash function to store set S in table of size n, where,
say, we handle collisions with linked list. Then, cost of find(x) is
proportional to length of list at h(x). So, let
C_{x,y} be indicator random variable for x and y having collision.
C_{x,S} be length of list at h(x).
Then, E[C_xy] <= 1/N, so
E[C_xS] < 1 + (|S|-1)/N
So, just need linear table size to have O(1) expected time per operation.
Perfect hashing -> see slides.
How to construct Universal Hash Function? Will give 2 ways.
----------------------------------------
Way #1: Let's assume that M, N are powers of 2: M = {0,1}^m and N = {0,1}^n.
H = { m x n 0-1 matrices}. h(x) = hx, in GF2 (addition is mod 2).
Note: h(0) = 0. So, could remove 0 from domain if want strong 2-universal.
Claim: For x!=y, Pr[h(x) = h(y)] = 1/|N|.
Proof: Say x and y differ in ith coordinate, and say that y_i = 1.
Now notice that, fixing all of h but the ith column, h(x) is
fixed but each of the 2^n different settings of the ith column
gives a different value of h(y) [i.e., h(y) is still uniform random]
So there is exactly a 1/2^n chance that h(x) = h(y).
In fact, we've shown that for any r, prob[h(y) = h(x) + r] = 1/N,
independent of h(x). So, so long as we remove x=0 from the domain (so
h(x) is uniform random), we have strong 2-universality.
So, this is a nice, easy family of hash functions. One problem is
that it is a little inconvenient computationally.
Here's another 2-universal family:
Here, lets let M = {0,...,m-1} and N = {0,...,n-1}.
Pick prime p >= m (or, think of just rounding m up to nearest prime). Define
h_{a,b}(x) = ( (a*x + b) mod p ) mod n.
H = {h_ab | a in {1,...,p-1}, b in {0,...,p-1}}
Claim: H is a 2-universal family.
Proof: Let's fix r != s and calculate, for x!=y,
Pr[[(ax + b) = r (mod p)] AND [(ay+b) = s (mod p)]].
-> it must be that ax-ay = r-s (mod p), so a = (r-s)/(x-y) mod p, which
has exactly one solution (mod p) since Z_p* is a field.
-> given this value of a, we must have b = r-ax (mod p).
So, this probability is 1/[p(p-1)].
(What is the probability for r=s? Ans: 0)
Now, how many pairs r!=s are there in {0,...,p-1} such that r = s (mod n)?
-> have p choices for r, and then at most ceiling(p/n)-1 choices for
s ("-1" since we disallow s=r). The product is at most p(p-1)/n
Putting this all together,
Pr[(a*x + b mod p) mod n = (a*y + b mod p) mod n]
<= p(p-1)/n * [1/(p(p-1))] = 1/n.
QED
Lower bounds for randomized algorithms
======================================
Consider some problem we want to solve, like sorting. What does it
mean to give a lower bound on performance?
for all alg A, exists input I s.t. C(A,I) > b,
Where C(A,I) = cost of alg A on input I.
Can think of as game between algorithm designer and adversary, where
alg has to go first.
What about randomized algorithms? View randomized alg as a
distribution on deterministic algs. Then we want to say that for any
distribution on deterministic algs, there exist an input such that
expected cost is high.
1) for all dist Q, exists I such that E_Q[C(A_Q, I)] > b
Often easiest way to do this is to have adversary go first, and be
randomized, and assume alg is deterministic. That is, to show:
2) exists dist P such that for all A, E_P[C(A,I_P)] > b
Can you see why this is sufficient?
-> look at E[C(A_Q, I_P)], for some Q and for P of (2). (2) says
this must be larger than b, since its larger for all A, and if for
some Q this is larger than b, then clearly exists I s.t. it's larger
than b. (Using fact that if E[X] > y, then exists legal value of X
that is > y)
In fact, in the case where algs=circuits, so have finite number, and
any distrib is OK, then min-max theorem of game theory says
max_P min_A E[C(A, I_P)] == min_Q max_I E[C(A_Q, I)]
adv goes first alg goes first
Apply to hashing
----------------
Theorem: for any family H of functions h: M-->N, (|M| = m, |N| = n),
there exist x != y in M s.t.
Prob[ h(x) = h(y) ] >= 1/n - 1/m.
h<-H
Proof: show that for any h, for random x!=y,
Prob[ h(x) = h(y) ] >= 1/n - 1/m.
x!=y
Solve by counting number of ordered pairs that collide
and then dividing by m(m-1).
B_j = number of points in M that map to j. Number of ordered
pairs (x != y) that collide is:
Sum_j B_j(B_j - 1) = Sum_j(B_j)^2 - m.
Minimum occurs when all B_j are equal. Why?
No randomness, but consider random variable X = B_j, j=1,...,n
with equal probability. Variance = Avg[(B_j)^2] - Avg[B_j]^2
Notice that variance can't be negative. So,
Avg[(B_j)^2] >= (m/n)^2. Minimized when all equal (variance=0).
So, Sum_j B_j(B_j - 1) >= n(m/n)^2 - m = m^2(1/n - 1/m).
There are m(m-1) ordered pairs x,y, so the probability that a
random pair collides is at least:
[m^2/(m(m-1))] * (1/n - 1/m) >= 1/n - 1/m.