03/30/11 15-859(M) Randomized Algorithms * Random walks and Rapid mixing Chap 6.2, 6.7, 6.8 ============================================================================== Last time: random walks and cover time. BTW, we showed that in the stationary distribution, the prob on a node is proportional to its degree. Random walk on graph is special case of random walk on Markov Chain. Markov Chains: - defn. P_ij = prob of going to j given that you're in state i. - e.g., for a graph, if put 1/2 prob of staying put on each step, then P_ij = 1/2d(i) if j is a neighbor of i, P_ii= 1/2, rest are 0. write a prob distribution as a row vector q. Then, one step of walk = qP If underlying graph (include only edges with non-zero prob) is strongly connected, then chain is "irreducible". for finite irreducible MC's, "aperiodic" -> for any starting distribution, after finitely many steps, have non-zero prob on every state. stationary distribution: eigenvector of eigenvalue 1. Note: this is largest eigenvalue. One of things we'll want for some algorithms is to show that a random walk is rapidly mixing -> E.g., approach stationary in only polylog(n) steps. What kinds of graphs / markov-chains give us this property? ============================================================================= When do we get rapid mixing? [esp focusing on random walks on constant-degree graphs] Ans #1: if have gap between largest and 2nd largest eigenvalue. Ans #2: if graph is expander: for all sets S of <= n/2 nodes, |N(S) - S| > epsilon*|S|. I.e., graph doesn't look like a dumbbell. Next time: nice result of Noga Alon -> equivalence of these two concepts. Today: keeping with theme of random walks: why eigenvalue gap gives you rapid mixing, randomized construction of expanders, and one interesting application of deterministic constructions. Eigenvalue gap -> Rapid Mixing ============================== Theorem: Say M is a markov chain with real eigenvalues and orthogonal eigenvectors. Then, for any starting distribution, the L_2 distance between q^(t) (the distribution after t steps) and the stationary distribution \pi is at most |lambda_2|^t, where lambda_2 is the eigenvalue of second-largest absolute value. So, if we have an eigenvalue gap of some constant epsilon, then takes only O((log n)/epsilon) to get down to 1/n^c. For instance, symmetric matrices have real eigenvalues and orthogonal eigenvectors. E.g., say M = (I/2 + A/(2d)) where A is matrix for a d-regular graph. Then M is symmetric, all eigenvalues of M are nonnegative [that's because I has all eigenvalues equal to 1], and M has eigenvalue gap epsilon/(2d) where epsilon is the eigenvalue gap for A. In fact, can generalize to ``time-reversible'' Markov chains: p_{ij}\pi_i = p_{ji}\pi_j for all i,j. E.g., random walk on a graph where nodes are not necessarily all the same degree. PROOF OF THEOREM: Say orthogonal eigenvectors are e_1, ..., e_n. e_1 = \pi (just for easier intuition we are normalizing e_1 to have unit L_1 length). Say we start at q^(0) = c_1e_1 + ... + c_ne_n. (Actually, c_1=1 since entries in \pi sum to 1 and entries in all other eigenvectors sum to zero.) After t steps, we're at: q^(t) = c_1e_1 + c_2(lambda_2)^t e_2 +... + c_n (lambda_n)^t e_n). The L_2 length of the (e_2...e_n) part is at most |lambda_2|^t times the length of q^(0) (and ||q^(0)|| \leq 1). QED Next: building expanders and some of their uses. RANDOM LOW-DEGREE GRAPHS ARE EXPANDERS -------------------------------------- Claim: Say we create a (n,n) bipartite graph as follows: for each node on the left, we pick a random subset of d=3 nodes on the right to be neighbors. Then, there exists a constant c such that w.h.p, for all sets S on the left with |S| <= n/c, we have |N(S)| >= 1.9*|S|. [Equivalently, can view this as creating an n-node directed graph by giving each node 3 random out-edges.] Proof: For given values of s, r, what is the probability that there *exists* a set S of s vertices on the left with <= r neighbors on the right? Let's upper bound this by fixing a set S of s nodes on the left and fixing a set R of r nodes on the right and calculating the prob that all nbrs of S lie inside R, and then multiply this by (n choose s)*(n choose r) Pr(exists such an S) <= (n choose s)(n choose r)[(r choose d) / (n choose d)]^s [now, using (a/b)^b <= (a choose b) <= (ae/b)^b] <= (ne/s)^s (ne/r)^r [(re/d) / (n/d)]^{ds} <= e^{s + r + ds} * r^{ds - r} / ( n^{ds - s - r} s^s ) [now, plugging in d=3, r=1.9s] <= e^{6s} r^{1.1s} / ( n^{0.1s} s^s ) = (1.9 e^6 r^{0.1} / n^{0.1})^s We now need to sum this over all values of s. For s=1 this looks like const/n^{0.1}, for s=2 this looks like const/n^{0.2}. By s=10 it is down to const/n. The quantity keeps dropping until s starts getting large. We're OK so long as the numerator is sufficiently smaller than the denominator (say less than half of the denominator). This will be fine for sufficiently large c. Applications ------------ Lots of applications for things like routing, fault-tolerant networks, etc. For today will look at amplifying success probability of randomized algorithm without using a lot more random bits. For this application, we'll need the fact that there are deterministic constructions of constant-degree expanders with the property that even if 2^n nodes, can perform walk in poly time per step. (Given the name of a node, can compute quickly who the neighbors are). Gabber-Galil construction ------------------------- |X| = |Y| = n = m^2. Each vertex in X labeled by pair (a,b) for a,b in Z_m. Same for vertices in Y. Degree 5. Neighborhood of (x,y) is: (x+y+1,y) (x+y,y) (x,y) (x,x+y) (x,x+y+1) Specifically, here is result of [Impagliazzo & Zuckerman]: ------------------------------------------------------------ * Have BPP alg using r random bits, with failure prob 1/100. Claim: can decrease failure to (1/100)^k by using only r + O(k) random bits. (in contrast to r*k bits if ran k times independently) * Idea: set up implicit expander graph with one node for each string of length r, and imagine we color nodes ``good'' or ``bad'' depending on whether they cause the BPP algorithm to answer correctly or not (so 99% of the nodes are good). Start at random initial position and then do a random walk. Only need constant random bits per step. Sample every \beta steps (ie., run the BPP alg using the current node as its random input) where \beta is defined to make 2nd largest eigenvalue of R = M^\beta at most 1/10. Sample 42k times and take majority vote. What we want is for it to be very unlikely that more than half of samples are bad nodes. * We'd like to say that no matter where you start, after running one step of R, there's at most 1/5 chance of being at a bad node. Can't quite get this. But, get something similar by looking at L_2 length. In particular, for any vector p, sqrt(sum of squares of bad entries of pR) <= 1/5 *(L_2 length of p). Proof: say eigenvectors are e_1, e_2, ... where e_1 = (1/n,...,1/n). All orthogonal. Let p = x + y, where x = e_1, y = c_2e_2 + ... + c_ne_n For convenience, define Z as matrix that zeroes out all the good entries. I.e., Z is identity but where have zeroed out entries for good nodes. So, our goal is to show that ||pRZ|| <= 1/5 * ||p||. Look at x: ||xRZ|| = ||xZ|| <= 1/10 * ||x||. [because 10 = sqrt(100)] Look at y: ||yRZ|| <= ||yR|| = ||c_2 lambda_2^beta e_2 + ...+ c_n lambda_n^beta e_n|| <= 1/10 * ||y||. [since each component shrunk by 1/10] So, ||pRZ|| <= ||xRZ|| + ||yRZ|| (triangle inequality) <= 1/10||x|| + 1/10||y|| <= 1/5 * ||p|| (Note: this also shows that ||pRRZ|| <= 1/5 * ||p||, etc.) Intuitively, if p was "spread out" already, then so is pR, and multiplying by Z is zeroing out a lot of weight of entries. On the other hand, if p is highly concentrated, then multiplying by R by itself is decreasing the L_2 length by spreading out the distribution. * Now, to finish the proof: We want to say it's unlikely more than half the samples are bad. Let's consider a fixed set of t samples, and ask: what's the probability that these are all bad? Claim: if q is our starting distribution, then what we want is the L_1 length of q R R R Z R Z R R Z R Z where the t "Z"s are indexing the t samples we took (there's at least one "R" between any two "Z"s). We'll use the fact that L_1 length <= sqrt(n) * L_2 length. And, L_2 length of (q R R R Z R Z R R Z R Z) <= (1/5)^t * L_2 length of q. And, L_2 length of q is 1/sqrt(n) [since we started at a *random* initial position -- this is where we use that fact] So, the probability these are all bad is at most (1/5)^t. For half of all 42k samples to be bad we need some set of t > 21k to be bad. At most 2^(42k) such sets. Prob of failure at most 2^(42k) * (1/5)^(21k) = (4/5)^(21k) <= (1/100)^k. QED