03/16/11 15-859(M) Randomized Algorithms * Randomized oblivous routing Chapter 4.2, 4.4 on hypercubes and butterflies (also material taken from Satish Rao's notes at Berkeley) * Martingales: defs, Azuma's inequality * McDiarmid's inequality ===================== In honor of Les Valiant winning the Turing award, I wanted to start with a now-classic algorithm of his for oblivious routing on a parallel computer "A scheme for fast parallel communication", and Valiant and Brebner "Universal schemes for parallel communication" (The later one was in STOC 1981, the earlier one in SIAM J Computing 1982. Even then, journals were slow...). We will be considering routing on a hypercube network. The setting is we have N = 2^n processors, where each is a vertex of an n-dimensional hypercube. The links in the network are the edges in the hypercube. E.g., 011------111 / | / | 010------110 | | | | | | 001---|--101 | / | / 000------100 Suppose that every vertex i has a packet it wants to send to some destination d(i). Our model is going to be that at most one packet can go on any given edge in any given time step, so if multiple packets want to go on the same edge, the rest wait at that node. We'll get into trouble for sure if many packets want to go to the same place so let's assume that the d(i)'s are all distinct (we are routing a permutation). The question is: can we do this without much delay (getting everyone to their destination in O(n) steps) using an OBLIVIOUS routing scheme? This means that the route taken by packet (i,d(i)) depends on only its start and end location, and not what any of the other packets are doing. To first get a feel for this, suppose we route packets by just correcting bits from left to right. What happens if all packets of the form i = (a,\vec{0}) where a \in {0,1}^{n/2} want to go to destination d(i) = (\vec{0},b) for b \in {0,1}^n/2? Any problems? We get 2^{n/2} paths going through a single node (\vec{0},\vec{0}). Since this node can send out at most n packets per time step, this means some packet will take time at least 2^{n/2}/n. Before getting to Valiant's result, first a detour into the Butterfly and Benes (pronounced "Benish") networks. A butterfly network is a network that corresponds to the process of routing left to right, though it's more "lock-step" than the hypercube. Here is a butterfly on 2 bits. START END 00---00---00 \ / X 01---01---01 X X 10---10---10 / \ X 11---11---11 This has the same problem we just saw. But what if we add a reversed copy of the butterfly onto the end? This is called a Benes network. START END 00---00---00---00---00 \ / X X \ / 01---01---01---01---01 X X X X 10---10---10---10---10 / \ X X / \ 11---11---11---11---11 This network is actually an abstraction of telephone switching networks developed in the 1950s. A cool fact about Benes networks is that now, for any permutation, there is always a way to route it that has no congestion at all. The key thing is to think about the first and last levels. E.g., say 00 wants to go to 11. It can either go through the "upper inside Benes network" or the "lower inside Benes network". Same with if, say, 10 wants to go to 00. But if the first goes up, then the 2nd has to go down and vice-versa. And for whatever index i wants to go to 01, it must also make the opposite choice as 00-11. If we draw a graph with nodes for all these pairs (i,d(i)), we will see we have degree 2. Also, no odd cycles [can anyone see why?]. This means we can give a consistent set of up/down decisions so we don't have any collisions at the first and last levels. Then internally, we just have smaller Benes networks so we are fine by induction. Now, back to Valiant's result. Note that the above construction was not oblivious. Valiant's idea was: how about for each (i,d(i)), we pick a *random* intermediate destination Delta(i) and route there first using left-to-right routing and then from Delta(i) to d(i). Equivalently, in the Benes network, pick a random Delta(i) in the middle column. Note: picking iid: the Delta(i) may not be a permutation. To analyze this [let's just worry about the routing from i to Delta(i), and furthermore will go back to thinking about routing on the hypercube] let P denote the path taken by packet i, and say it uses edges e_1, e_2, ..., e_k. Let S denote the set of packets whose routes also use at least one of the edges in P. Claim 1: the delay of packet i is at most |S|. Proof. First, what do the intersections look like: can some other packet intersect on e_1 and e_2, then go off, and then come back on e_5? No. Why? OK, now let's imagine that every time packet i gets delayed (it's like a train waiting at a station), it gives a coin to the packet that is traveling on its desired edge. Now, let's say the coins furthermore have the following behavior: if their packet gets delayed, waiting on some edge e in P, the coin jumps to the packet traveling on e. So, the coins always move forward. This continues until their packet either leaves P or is waiting on some edge outside P (that counts as leaving P -- they don't jump to a packet that wasn't part of S). The claim is that no packet ever has more than one coin on it at any point in time (it might get a coin, give it up, and get one again). That's because coins always move forward and each new coin is at least one step behind the previous one. So, each packet in S leaves path P with at most one coin, and so we can charge the delays to |S|. [[Note: this is a different proof from the one in the book]] Claim 2: Whp, |S| = O(n). In particular, Pr[|S| > 3n] < 2^{-3n} = 1/N^3. Proof. Define H_j = 1 if P and P_j (the path of packet j) share one or more edges, else H_j = 0. So, |S| = sum_j H_j. Fixing P, these H_j are independent {0,1} random variables, so first we will bound the expectation of sum_j H_j and then we will use Chernoff. For edge e on path P, define T(e) to be the number of paths (besides P) that use edge e. These quantities are *not* independent, but nonetheless, sum_j H_j \leq sum_{e in P} T(e). Also, what is E[T(e)]? For example, "e" might be the edge (1 1 0 1 1 0 1) (1 1 0 1 0 0 1) What source-dest pairs use this edge? How many different sources is this? (say this is bit B). What is the probability their dest is of the correct type? Overall, we get E[T(e)] = 2^{B-1} * 1/2^B = 1/2. So, overall, E[sum_j H_j] <= n/2. Now, by Chernoff, Pr[sum_j H_j > 3n] <= (e^5/6^6)^{n/2} <= (1/2)^{3n}. So, we can just wait until n+3n=4n steps have elapsed (by then, whp everyone has made it to their intermediate destinations) and then just run the routes from the intermediate points to the final destination using the same analysis. In the end, whp this takes time at most 8n. ======================================================== A quick introduction to Martingales ----------------------------------- Consider a gambler who goes into a fair casino, and bets on various things using some crazy strategy. Let X_i denote the amount of money he has at time i, and by "fair casino" we mean that no matter what he bets on at time i, his expected winnings are 0. So, by definition of "fair casino", E[X_i|X_0,...,X_{i-1}] = X_{i-1}. A sequence of random variables having this property is called a Martingale. Def: We say that RVs X_0, X_1, X_2, ... are a Martingale Sequence if E[X_i|X_0,...,X_{i-1}] = X_{i-1}. Some other terms you might encounter: - if E[X_i|X_0,...,X_{i-1}] \leq X_{i-1} then it's a Supermartingale - if E[X_i|X_0,...,X_{i-1}] \geq X_{i-1} then it's a Submartingale (go figure!) The fair casino is one prototypical example of a Martingale. We'll get to another important one (called an "exposure Martingale" or "Doob Martingale) in a minute. BTW, one basic fact: For RVs X and Y, E[E[X|Y]] = E[X]. What's going on here: [Draw picture of a sample space partitioned by Y.] E[X|Y] is really a RV that with probability Pr(Y=y) takes on the value E[X|Y=y]. Adding this up gives us E[X]. Easiest to think of in reverse: e.g., X = running time of Quicksort, Y = 1st pivot choice. E[running time] = sum_i Pr(Y=a_i)*E[running time | 1st pivot is a_i] So, for a Martingale, E[X_i] = E[X_{i-1}] = ... = E[X_0] as we know. Azuma's inequality (aka Hoeffding-Azuma): ================== Let X_0, X_1, ...be a Martingale sequence such that for each i, |X_i - X_{i-1}| < c_i. Then Pr[|X_n - X_0| \geq lambda] <= 2e^{-lambda^2 / (2 \sum_i (c_i)^2)} E.g., if we have a gambler who can win or lose at most 1 per round in a fair game (maybe betting $1 on a fair coin or maybe doing something crazy) we get c_i=1. So if we set lambda = delta*n, we get 2e^{-delta^2 n/2}, just like our basic Hoeffding bound with iid {-1,+1} random variables. Here is a quite useful thing one can prove using Azuma (up to a constant factor in the exponent): McDiarmid's inequality ====================== Say z_1, ..., z_n are independent RVs, and phi(z_1,...,z_n) is some real-valued function. Assume that phi satisfies the Lipschitz condition that if z_i is changed, phi can change by at most c_i. Then McDiarmid's inequality states that: Pr[|phi(z) - E[phi(z)]| > lambda] <= 2e^{-2*lambda^2 / sum_i c_i^2} For example, if phi is a sum and the z_i are {0,1} then c_i=1. If lambda = delta*n we get 2e^{-2*delta^2 n} just like our basic Hoeffding bound. Or, phi could be the error rate of some complicated learning algorithm that doesn't get too influenced by any one example. Or the running time of some program whose running time doesn't change much if you just change one of the random bits. To prove McDiarmid from Azuma (up to a constant factor in the exponent) we define the Doob Martingale / Exposure Martingale. Think of drawing the RVs z_1, z_2, ... in sequence and consider the probability tree of their values. In fact, think of phi as the running time of some randomized algorithm where z_1, z_2, ... are its random coins. Define X_0 = E[phi(z_1,...,z_n)] (an RV with just one value). X_1 = E[phi(z_1,...,z_n) | z_1] X_i = E[phi(z_1,...,z_n) | z_1,...,z_i] (so is a function of the current node in the tree z_1,...,z_i, and the expectation is over z_{i+1},...,z_n) X_n = phi(z_1,...,z_n). Claim: this is a Martingale. The easiest way to think of this is (as mentioned above) to think of X_i as the "expectation to go" after we've already chosen z_1,...,z_i. If you fix z_1,...,z_{i-1} (but not z_i), the expected value of this over the different possible choices for z_i is just the "expectation to go" given the fixed z_1,...,z_{i-1}, namely X_{i-1}. This means E[X_i|z_1,...,z_{i-1}] = X_{i-1}. Since that's true for any z_1,...,z_{i-1} that yield the same value of X_{i-1}, we get E[X_i|X_{i-1}] = X_{i-1}. So, we can just apply Azuma. (Actually, if the z_i are fair coins, then X_i can differ by at most c_i/2 from X_{i-1}, so we don't lose anything in the exponent)