Notes for 9/15/98 15-859(D) Randomized Algorithms * Admin Handout: course info * Why randomized algorithms? * Plan for the course * Two interesting algorithms ============================================================================= ADMIN ----- Grades based on homeworks + class participation + take-home final. Plan is to have some student class presentations. E.g., maybe in groups of 2 -- one person presents and the other writes up a handout to give to the class. There is no TA so students will also be asked to help with the homework grading. Web page reachable from my home page. Page will have rough notes for each class, plus handouts. Book is Motwani-Raghavan. Should be in bookstore by now. Pass around name sheet. WHY A COURSE ON RANDOMIZED ALGORITHMS? -------------------------------------- * Randomness is a useful algorithmic resource. Explore important techniques for taking advantage of it. * Analysis can be delicate. Explore analysis techniques (e.g., setting up the right random variable). Also vehicle for exploring important probabilistic concepts (k-wise independence, tail inqeualities), and combinatorial objects/concepts (expander graphs, conductance), and so on. * Randomized algs often simpler than deterministic counterparts. Good opportunity for talking about some neat algorithms. Plan for course is to follow this list in a sense. First 2/3 will give algs focusing on important techniques, and in last 1/3 do more a sampling of different areas where randomized algorithms are useful. (Book tries to make this distinction too; it's not really a clear distinction, though). What I envision for class presentations is that in last part, there are a lot more subjects than we can really go through. So, I'll pick some that I like, and by doing presentations you will each have a chance to learn one in depth. HOW DOES RANDOMNESS HELP IN ALGORITHMS? -------------------------------------- 1. Use simple randomized algorithm to achieve same worst-case performance as complicated deterministic algorithm. E.g., sometimes have simple deterministic strategy with good performance "most of the time" (on average over a random choice of the input) that can transform into an algorithm with good worst-case performance by adding randomness. (Worst case over the input, on average over the internal randomness of the algorithm) E.g., median-finding, quicksort, many standard algorithmic problems. 2. Sometimes can do things we don't know how to do deterministically. E.g., primality testing (deterministic alg requires unproven assumptions), linear-time min-spanning tree. 3. Sometimes can do things provably impossible deterministically (protocol or oracle problems) 4. Sometimes can do things that just seem impossible... Today: Some neat randomized algs. Categories 1 and (3,4). SPLITTING A 3-COLORABLE GRAPH INTO TWO PIECES SO NEITHER HAS A TRIANGLE ----------------------------------------------------------------------- Definition of problem: You have a 3-colorable graph. Want to split into 3 independent sets. NP-hard. Easier problem: split into two pieces so that neither has a triangle. (Could imagine this as first step in some heuristic or approximation algorithm. In fact, if you could split into a constant number of pieces so that no piece had a diamond or pentagon, then this would improve current best approx ratio for 3-coloring) A solution: split arbitrarily (Red/Blue). Find some bad triangle. Then pick one of its 3 nodes AT RANDOM, and move it to the other side. Continue until there are no more bad triangles. * Analysis Trick: Fix some arbitrary correct coloring. Look at quantity T = # nodes you've colored in same way as this coloring. T is some integer between 0 and n. Let's look at how T changes with time. * How to see as random walk, where in each step go left with prob 1/3, right with prob 1/3 and stay put with prob 1/3. In this walk, the expected time to hit boundary 0 or n is O(n^2) [*]. This means that expect to end in at most that much time. Proof of [*]: Rough calculation: First, let's consider 50/50 random walk --- proving for this case is enough. Now, in a random walk with m steps, prob that make A steps to right is 2^{-m} * (m choose A). Maximized at A=m/2, and has value about 1/sqrt(m). (Stirling's approx gives 1/sqrt(pi*m/2)). So, prob that after m steps you are between 0 and n is at most n times this, which is < 1/2 for m = 4n^2. So, chance we haven't finished by 4n^2 steps is < 1/2. Chance we haven't finished by 8n^2 steps is < 1/4, etc. So, expectation is O(n^2). More exact calculation: Let E_x be the expected time to reach 0 or n given that you start at x. So, E_0 = E_n = 0. What is E_x in terms of its neighbors? E_x = 1 + 1/3(E_x + E_{x+1} + E_{x-1}). Note: this is using linearity of expectation. (E.g., event that in next k steps, you make at least x more steps to left than to right is correlated with event that you make at least x+1 more steps to left than to right. Nonetheless, E(X+Y) = E(X)+E(Y) even if X and Y are correlated.) Rewrite as: E_{x+1} - E_x = E_x - E_{x-1} - 3. So ``second derivative'' of E w.r.t. x is -3. Suggests to solve E_x = - 3/2 x^2 + cx + c' at boundary conditions. Get: 3/2 x(n-x). Lessons: setting up the "right" quantity/random-variable (in this case, "T") is crucial. Also, if you don't know how to walk in the right direction, sometimes it's enough to walk randomly. (start this and continue next time) FINDING THE AVERAGE GRADE, WITHOUT GIVING AWAY YOUR OWN SCORE ------------------------------------------------------------- * The problem: A group of N friends wants to determine their average score on a test, but nobody wants to reveal their own score. * Even if two people collaborate, they shouldn't be able to find your score. * More generally, if K of the N "friends" collaborate, then they should still learn no more than the sum of the remaining N-K honest ones. (which they can figure out anyway from the answer). [in a sec we'll get back to the question of what it means to say that the protocol leaks no additional information]. * Possible protocols? Here's a protocol: We'll compute the sum mod M. M is some sufficiently large number known to be greater than the sum. E.g., M=101*N. Say player i has secret S. Player i chooses N-1 numbers X_1, X_2, ..., X_{N-1} independently at random in the range 0,...,M-1 and writes them each on a piece of paper. Then he chooses X_N in this range so that X_1 + ... + X_{N-1} + X_N = S mod M, and writes X_N on a piece of paper. Player i then distributes N-1 of these to the others (one per person) and keeps one. Then, player i takes the number he kept, plus all of the others he received, and adds them up (mod M), and announces the result. Finally, all players add up the announced results (mod M), which is the desired sum. Analysis: Can see that the sum is right by writing out the matrix of values. But, how do we show that this is secure? First some definitions: Two events A and B are INDEPENDENT if Pr[A and B] = Pr[A] * Pr[B]. Easier way to think of it: Pr[A | B] = Pr[A]. More generally a collection of events are independent if for any subset, the probability of their intersection is the product of their probabilities. A collection of events is K-WISE INDEPENDENT if any subset of K of them is independent. Similarly, random variables X, Y are independent if for each x,y we have events X=x and Y=y independent. K-wise independence is defined in the same way as with events. Some useful facts: Say X is a random variable uniformly distributed in U = {0, 1, ... , m-1}. Then, clearly for any fixed a, the variable Y = X + a (mod m) is also uniform in U (though X and Y are dependent). If X_1, ..., X_k are independent r.v's uniform in Um and Y is defined to be a - (X_1 + ... X_k) (mod m) for some fixed a, then these k+1 r.v.s are k-wise independent. Just check that Pr[Y = y | X_2 = x_2, X_3 = x_3, ..., X_k = x_k] = 1/m. Proof for protocol: Claim is secure in following sense: for any set of t honest players, only information that the rest of players have, even if they cheat, is the sum of the t players' values. What does this really mean? What is "information"? What we mean is that if you modify the values for these players keeping the sum the same, then the probability distribution on the sequence of outputs produced by those players doesn't change. Proof: consider the t all sent their X_N value to the same honest player. (Doesn't change protocol). For any fixed sequence of coin tosses, all sets of t values with the same sum lead to exactly the same information revealed to the other players. So, all sets with the same sum have the same distribution. [Notes: Coloring algorithm is from: Colin McDiarmid, ``A random recouloring method for graphs and hypergraphs.'' Combinatorics, Probability and Computing, 2:363-365, 1993.]