Notes for 10/12/11 15-859(M) Randomized Algorithms * Probability basics * linearity of expectation Handout: homework 1 -- Gambler's odds problem -- analyzing randomized quicksort * game-theoretic view of randomized algorithms * randomized lower bounds Readings: Today and next time doing Ch1 and parts of Ch2. Then next thurs we'll be looking at probabilistic inequalities in Ch 3.2, 4.1, and then using them in analysis of some algorithms. ============================================================================= One thing: if in class you find me making fuzzy plausible-sounding but not completely clear probabilistic statements, feel free to call me on them. Especially important in this area to be precise and make sure every step is legitimate. E.g., in coloring alg, made statement that expected running time of alg is <= expected time for random walk to reach one of the endpoints, and it's worth at home sitting down and really concincing yourself mathematically about that, and if you're not convinced come talk with me. PROBABILITY BASICS ================== * A probabilistic setting is defined by a SAMPLE SPACE and a PROBABILITY MEASURE. E.g., if we roll 2 dice, the sample space consists of the 36 possible outcomes , and the prob measure assigns prob 1/36 to each. An EVENT is a subset of the sample space (like the event that the dice sum to 3), and the points are called ELEMENTARY EVENTS. In a discrete probability distribution (as opposed to continuous ones), these elementary events have probabilities and the probability of an event is the sum of the probabilities of the elementary events inside it. For example, Pr(dice sum to 3) = 2/36. The event "the first die is a 1" is formally the set of elementary events in which the first die is a 1, and has prob 1/6. The event "the dice sum to 7" also has prob 1/6. Pr(A|B) = Pr(A and B)/Pr(B). [renormalizing as if B is entire sample space]. - E.g., Pr(first die is 1 | dice sum to 3) = 1/2 Pr(A|B) = Pr(B|A)Pr(A)/Pr(B). Bayes rule. Two events A,B are INDEPENDENT if Pr(A and B) = Pr(A)*Pr(B). Equivalently Pr(A|B) = Pr(A). Or Pr(B|A) = Pr(B). (ignoring zeroes) * A RANDOM VARIABLE is a function from elementary events to integers or reals. E.g., another way to talk about these dice is let X_1 = result of first die, X_2 = result of second die, and X = X_1 + X_2. Then might ask: what is Pr(X = 3)? Two RVs X,Y are independent if events X=x, Y=y are indep for all x,y. EXPECTATION ----------- For discrete RVs, in terms of elementary events e, can write as: E[X] = SUM_e Pr(e) * X(e) or, often think of as: (Grouping by value of X(e)) E[X] = SUM_a (Pr(X=a) * a ) E.g., E[sum of two dice] = 1/36 * 2 + 2/36 * 3 + 3/36 * 4 + ... Def: E[X|A]. E.g., E[1st die | sum = 3] E[X|A] = avg_{e in A} X(e) = [1/pr(A)]*SUM_{e in A} Pr(e)X(e). For any partition of space into disjoint events A_1, A_2, ..., E[X] = SUM_i (Pr(A_i) * E[X | A_i]) In other words, to get the average of X overall can compute the average in each A_i, and then weight by their probabilities. In this class, a typical random variable we will be interested in is the running time of our randomized algorithm. When the algorithm makes a k-way decision, we can then say that the expected time is the sum over decisions of the prob of that decision times the expected time given that decision. E.g., like in the algs from last time. VIF (Very Important Fact): LINEARITY OF EXPECTATION --------------------------------------------------- For random variables X and Y, E[X+Y] = E[X] + E[Y] even if not independent. Proof: Use definition. Z=X+Y. E[Z] = sum_e Pr(e) Z(e) = sum_e Pr(e)*[X(e)+Y(e)]... Really useful since allows to decompose complicated RVs into simple ones. E.g., roll 10 dice - what is expected value of sum? CARD SHUFFLING -------------- Take a fresh deck of 52 cards and shuffle iuntil random. How many cards do we expect to be at the same position as when they started? Looking for expectation of an RV X. X = X_1 + .. X_52. X_i = 1 if ith card ended where it started and 0 if not. What is E[X_i]? What is E[X]? GAMBLER'S ODDS PROBLEM ---------------------- Say you go into casino with $k. Say has fair games: at any time t can bet some amount and for even money you have 50/50 chance of winning. Say you have some strategy - determines how much to bet and when and when to leave, but must leave by end of the day. What's best strategy for highest expected amount by the end of the day? * Let's define the random variable X_i to be the gain at time i. If X is the random variable denoting the amount we win by the end of the day, then X = X_1 + X_2 + ... + X_N, where N=#seconds in a day. Note that these X_i's could be horribly dependent (e.g., depending on whether you won or lost in your last bet, you might bet more or less). Nonetheless, if we just look at one of the X_i, it's easy to see that E[X_i] = 0. For instance: let p_ij be the overall probability that you will bet j dollars at time at time i (which might be painful to calculate, but that's ok, we don't need to). So, E[X_i] = SUM_j[p_ij * (0.5*j + 0.5*(-j))] = 0. Now, use linearity of expectation: E[X] = E[X_1 + ... + X_N] = E[X_1] + ... + E[X_N] = 0. (Lesson: simplify by setting up simple random variables and using linearity of expectation). RANDOMIZED QUICKSORT -------------------- Simple fast algorithm for sorting array of size n. 1) pick an element as pivot^* (or halt if array has size 0 or 1) 2) split array into LESS, EQUAL, and GREATER by comparing each element to the pivot. 3) recursively sort LESS and GREATER *version 1: use leftmost element as pivot. But if input is already sorted or reverse-sorted, can take Omega(n^2) time. *version 2: "median of 3" (left, middle, right). What can we say about worst-case time for that? *RANDOMIZED-QUICKSORT: pick a *random* element as pivot. Prove that for *any* input I of size n, expected time E[T(I)] = O(n log n). Key point is bound is worst-case over inputs, and in expectation just over internal randomization of algorithm. To measure running time will count number of comparisons. Key to analysis is to break our random variable X = total number of comparisons down into simpler RVs Define X_ij = 1 if we *did* directly compare the ith smallest and jth smallest elements in the sorting process = 0 if we *didn't*. So, E[X_ij] = Pr(compare ith and jth smallest) n n n n * So, how can we write X? X = SUM SUM X_ij. So, E[X] = SUM SUM E[X_ij] i=1 j=i+1 i=1 j=i+1 - denote ith smallest by e_i, jth smallest by e_j, and conceptually, imagine lining the elements up in sorted order. If pivot we choose is between e_i and e_j then these two end up in different buckets and we *dont* compare them. If pick e_i or e_j as pivot then we *do* compare them. If pick pivot < e_i or > e_j then both end up in same bucket and we repeat. So, can think of as a dart game: we throw a dart at random, and if outside the range we repeat, until we hit e_i or e_j or something in-between at which point we stop. Pr(we end up hitting e_i or e_j) = 2/(j-i+1). This is E[X_ij]. What does this mean? If j=i+1, prob is 2/2. If j=i+2, prob is 2/3. j=i+3, prob is 2/4... - So, have n SUM 2(1/2 + 1/3 + 1/4 + ... + 1/(n-i+1)). i=1 <= SUM 2(H_n - 1), where H_n = 1 + 1/2 + 1/3 + ... + 1/n is called the "nth harmonic number". H_n is in the range [ln(n), ln(n)+1]. Can see that by thinking of as discrete version of the integral of 1/x. - So, our total is < 2n(H_n - 1) <= 2n*ln(n). SORTING LOWER BOUNDS AND DECISION TREE VIEW OF ALGORITHMS ========================================================= A COMPARISON-BASED SORTING ALGORITHM takes as input an array A of n items, and can only gain information about the items by comparing them. Each comparison ``is A[i] > A[j]?'' returns YES or NO and counts as 1 time-step. The algorithm may for free reorder items based on the results of comparisons made. In the end, the algorithm must output a permutation of the input in which all items are in sorted order. THEOREM 1: Any deterministic comparison-based sorting algorithm must perform at least lg(n!) = Omega(n log n) comparisons to sort n elements in the worst case. PROOF #1: Fix some set of n distinct elements a_1,...,a_n and let S be the set of all possible orderings. Now, consider some algorithm ALG. It makes some first comparison Q. For some inputs, the answer will be YES and for some the answer will be NO. Build a tree with nodes corresponding to the comparisons made and edges for the answers. Q will be at the root, and if it gets the answer NO, then the algorithm asks some comparison Q_0, whereas if Q returned YES then the algorithm asks Q_1. The key point is the next query (comparison) is a function of the answers received so far, so there is a well-defined tree. Now, the second key point is that if more that one ordering in S reaches some node, that node can't be a leaf (because otherwise the permutation outputted there would be incorrect for at least one of them). So, we have n! leaves, which means the tree has depth at least lg(n!). PROOF #2: Another way looking at this is as follows. For a deterministic algorithm, the permutation it outputs is solely a function of the series of answers it receives (any two inputs producing the same series of answers will cause the same permutation to be output). So, if an algorithm always made at most $k < \lg(n!)$ comparisons, then there are at most $2^k < n!$ different permutations it can possibly output. In other words, there is some permutation it *can't* output. So, the algorithm will fail on any input for which that permutation is the only correct answer. THEOREM 2: For any deterministic comparison-based sorting algorithm ALG, the average-case number of comparisons (the number of comparisons on average on a *randomly* chosen permutation of n distinct elements) is at least lg(n!). PROOF: The proof is just that if you take a tree of N=n! leaves, not only is the maximum depth at least lg(N), but the *average* depth of leaves is at least lg(N). Can show like this: give each leaf at depth d a weight of 1/2^d. So, the total weight over all leaves is 1 (why?). So, the average weight of a leaf is 1/N. I.e., avg_{leaves L} 2^{-depth(L)} = 1/N. Now, the function 2^{-x} is convex. So, avg_x f(x) >= f(avg(x)). This means 2^{- avg_L depth(L)} <= 1/N, so avg_L (depth(L)) >= lg(N). LOWER BOUNDS FOR RANDOMIZED ALGORITHMS AND GAME-THEORETIC VIEW ============================================================== THEOREM 3: The above bound holds for randomized algorithms too. PROOF: The first step is to argue that with respect to counting comparisons, we can think of a randomized algorithm ALG as a probability distribution over deterministic algorithms. In particular, we can think of a randomized algorithm ALG as a deterministic algorithm with access to a special ``random bit tape'': every time ALG wants to flip a coin, it just pulls the next bit off that tape. In that case, for any *given* run of algorithm ALG, say reading bit-string s from that tape, there is an equivalent deterministic algorithm ALG_s with those bits hardwired in. Algorithm ALG is then a probability distribution over all those deterministic algorithms ALG_s. This means that the expected number of comparisons made by randomized algorithm ALG on some input I is just SUM_s Pr(s)*(Running time of ALG_s on I). So, the expected running time of the randomized algorithm is just an average over deterministic algorithms. Since each deterministic algorithm has average-case running time at least lg(n!), any average over them must too. Formally, the average-case running time of the randomized algorithm is AVG_I SUM_s Pr(s)*(Running time of ALG_s on I) = SUM_s AVG_I Pr(s)*(Running time of ALG_s on I) = SUM_s Pr(s) AVG_I (Running time of ALG_s on I) >= SUM_s Pr(s) lg(n!) = lg(n!) One way to think of this is to think of a matrix with one row for every possible deterministic comparison-based sorting algorithm (there could be a lot of rows!) and one column for every possible permutation of n given input elements (there are a lot of columns too). Entry (i,j) in this matrix contains the running time of algorithm i on input j. The worst-case deterministic lower bound tells us that for each row i there exists a column j_i such that the entry (i, j_i) is large. Can think of this as a game where the adversary gets to move after seeing the algorithm-player move. The average-case deterministic lower bound tells us that actually there is a randomized strategy for the adversary (namely, pick a column uniformly at random) such that even knowing this, the algorithm-player can't beat lg(n!). Moreover, since the adversary has committed, randomization by the algorithm player doesn't help. You can think of this as the "easy direction" of the minimax theorem. If there exists a randomized strategy for the adversary such that no matter what the algorithm player does, it pays more than X in expectation, then there can't be a randomized strategy for the algorithm that guarantees paying less than X in expectation.