15-451 Algorithms: Lecture 1/27/00 Announcements: Lecture Notes are on the course web page. (Provides full details of quicksort analysis sketched at the end of the last class.) Provide photo or url with photo when turn in Homework #1 on Tues. ********** Topics: I. Average case vs. Expected case (cont.) II. Linearity of expectation and quicksort III. Selection: randomized IV. Selection: deterministic ********** I. Average case vs. Expected case (cont.) Expected value of a random variable X: E[X] = sum_x (x * Pr{X=x}) E[f(X)] = sum_x (f(x) * Pr{X=x}) Average-case of deterministic algorithm vs. Expectation of randomized algorithm Recall: average case T(n) = AVG_{inputs I of size n} T(I) expected-time E[T(I)] = sum_o T(I,o) Pr{O=o} worst-case expected-time T(n) = MAX_{inputs I of size n} E[T(I)] E.g., Quicksort: 1) pick an element as pivot (or halt if array has size 0 or 1) 2) split array into LESS, EQUAL, and GREATER by comparing each element to the pivot. 3) recursively sort LESS and GREATER Consider n=3: Set of inputs is {123,132,213,231,312,321}. Count only the number of comparisons, C(n). Deterministic using leftmost as pivot: For I=123, compare (1,2),(1,3),(2,3): C(123) = 3 For I=132, compare (1,3),(1,2),(2,3): C(132) = 3 For I=213, compare (2,1),(2,3): C(213) = 2 For I=231, compare (2,3),(2,1): C(231) = 2 For I=312, compare (3,1),(3,2),(1,2): C(312) = 3 For I=321, compare (3,2),(3,1),(2,1): C(321) = 3 ASSUME all inputs equally likely: average-case C(n=3) = AVG_{all inputs of size 3} C(I) = (3+3+2+2+3+3)/6 = 16/6 = 2.666 Using random pivot: For I=123: With equal probability (1/3), select 1, 2, or 3 as pivot: If select 1 as pivot: compare (1,2),(1,3) With equal probability (1/2), select 2 or 3 as pivot: If select 2 as pivot: compare (2,3) => C = 3 If select 3 as pivot: compare (3,2) => C = 3 If select 2 as pivot: compare (2,1),(2,3) => C = 2 If select 3 as pivot: compare (3,1),(3,2) With equal probability (1/2), select 1 or 2 as pivot: If select 1 as pivot: compare (1,2) => C = 3 If select 2 as pivot: compare (2,1) => C = 3 Thus expected number of comparisons E[C(123)] = 3(1/3)(1/2) + 3(1/3)(1/2) + 2(1/3) + 3(1/3)(1/2) + 3(1/3)(1/2) = 2.666 Note: Order of the input does NOT matter in the above argument. E.g., there is always a 1/3 probability of selecting 1, 2, or 3 as the first pivot. Thus, E[C(I)] = 2.666 for I=123,132,213,231,312,321 as well. Thus, worst-case expected C(n=3) = MAX_{I of size 3} E[C(I)] = 2.666 Both expressions evaluated to 2.666. Is one better than the other? Average case analysis relies on randomness of the inputs: I.e., inputs are drawn according to an assumed probability distribution. But typically no reason to expect inputs are random. Expected case analysis relies on randomness of random variables (e.g. coin tosses): By definition, these are random. In practice, computer's random number generator is used to simulate random variables. Not perfectly random, but quite close to it. ********** II. Linearity of expectation and quicksort Probability axioms: 1. Pr{A} >= 0 for all events A. 2. Pr{S} = 1 3. Pr{A union B} = Pr{A} + Pr{B} if A and B are disjoint. Let X and Y be random variables. For a fixed y, what is sum_x Pr{X=x and Y=y}? = Pr{Y=y}. [Draw diagram] Likewise, for a fixed x, sum_y Pr{X=x and Y=y} = Pr{X=x}. ********** Linearity of Expectation: Thrm: For any 2 random variables X & Y, E[X+Y] = E[X] + E[Y]. Note: Holds even if X and Y are dependent. Proof: Let's assume X and Y take on discrete values so that can use summations instead of integrals. E[X+Y] = SUM_a [ a * Pr(X+Y = a) ] = SUM_b SUM_c [ (b+c) * Pr(X=b and Y=c) ] (just splitting the event that the sum is "a" into all the (disjoint) ways that this could occur) = SUM_b SUM_c [ b * Pr(X=b and Y=c)] + SUM_c SUM_b [ c * Pr(X=b and Y=c)] = SUM_b b * [SUM_c Pr(X=b and Y=c)] + SUM_c c * [SUM_b Pr(X=b and Y=c)] = SUM_b [b * Pr(X=b)] + SUM_c [c * Pr(Y=c)] = E[X] + E[Y] QED Theorem generalizes to the sum of any finite number of random variables. ********** A neat analysis of randomized quicksort: We analyze the expected number of comparisons made by the algorithm, since that determines the running time. We assume all elements are distinct; this is the worst case. We will use the linearity of expectation (tell me when we use it). THEOREM: E[# comparisons] <= 2n*ln(n) Proof: Let e_i = ith smallest element. Define X_ij = 1 if we *did* directly compare e_i and e_j during the algorithm and X_ij = 0 if we *didn't*. The number of comparisons X = SUM_{i=1..n} SUM_j={i+1..n} X_ij. So, E[X] = SUM_{i=1..n} SUM_j={i+1..n} E[X_ij]. [Used linearity here] Imagine lining the elements up in sorted order. Consider i < j. * Pick pivot < e_i or > e_j => both end up in same bucket, no comparisons between them yet, and we select another pivot. * Pick a pivot between e_i and e_j => end up in different buckets, and we *never* compare them. * Pick e_i or e_j as pivot => we *do* compare them (once). So, can think of as a dart game: we throw a dart at random, and if outside the range we repeat, until we hit e_i or e_j or something in-between at which point we stop. What is Pr{hit e_i or e_j} in this game? = 2/(j-i+1). Thus E[X_ij] = sum_x [x * Pr{X_ij=x}] = 0 * Pr{X_ij=0} + 1 * Pr{X_ij=1} = Pr{X_ij=1} = 2/(j-i+1). So E[X] = SUM_{i=1..n} SUM_j={i+1..n} 2/(j-i+1). Let d = j-i+1. Then E[X] = 2 * SUM_{i=1..n} SUM_d={2..n-i+1} 1/d Now, 1 + 1/2 + 1/3 + ... + 1/n is called the "nth harmonic number". Denoted H_n. H_n is in the range [ln(n), ln(n)+1], which you can see by thinking of it as discrete version of the integral of 1/x. So, E[X] = 2 * SUM_{i=1..n} (H_{n-i+1} - 1) < 2 * SUM_{i=1..n} (H_n - 1) = 2n * (H_n - 1) <= 2 n ln(n). QED. ********** III. Selection: randomized SELECTION: Find k-th smallest in unsorted array. E.g., find median score We assume all elements are distinct. Randomized algorithm: just like quicksort, except don't sorting the part we don't need. Called Randomized-Select or QuickSelect. QuickSelect(A, k): 1) pick pivot p at random 2) Split A into LESS and GREATER by comparing each element to p. Count # elements going into each part. Say L = |LESS| 3) If L = k-1, then output p. If L > k-1, output QuickSelect(LESS, k) If L < k-1, output QuickSelect(GREATER, k - (L+1)) Claim: Expected # of comparisons is O(n). Proof: It takes n-1 comparisons to split into two pieces. Random pivot implies equally likely to select the i'th ranked element in array. Thus all splits (0:n-1), (1:n-2),..., (n-1:0) occur with probability 1/n. Will recurse on one of the pieces. Upper bound, so assume always recurses on the larger piece. T(n,k) = expected time to find kth smallest out of n. T(n) = max_k T(n,k). Then T(n) <= (n-1) + (2/n) SUM_{i=n/2..n-1} T(i). = (n-1) + AVG_{i=n/2..n-1} T(i). [Assumes n is even. Doesn't really matter] Now, let's solve by the "guess and check" method. Let's assume inductively that T(i) <= 4i, for i Takes time c'n for some constant c' Step 2: Recursively, find the true median of the n/5 medians. Call this x. --> Takes time at most T(n/5). DANGER: median of medians must be median of everything since each one is the median, right? No. What's an example where it fails? Claim: At most 7/10 of the elements are < x and at most 7/10 are > x. (Believe this for now, will prove it shortly) Step 3: Use x as a pivot to split into LESS and GREATER. --> takes time c'n for some constant c' Step 4: Recurse on the appropriate piece. --> Takes time at most T(7n/10) We get a recurrence for running time of T(n) <= cn + T(n/5) + T(7n/10). which works out to T(n) = O(n). Details in the book, including handling issues of integrality. Finally, let's prove the claim. Question: how bad can median of medians be? Example: {1 2 3 10 11}, {4 5 6 12 13}, {7 8 9 14 15} Medians are 3, 6, 9 Median of medians is 6. Five of the fifteen are less, nine are greater. In general, what's the worst case in terms of having a lot be > x? < < < < < ? ? ? ? < < < < < ? ? ? ? < < < < = > > > > ? ? ? ? > > > > > ? ? ? ? > > > > > If there are g = n/5 groups, then worst case is that for (g-1)/2 groups [all groups whose median is larger than x], all 5 elements in that group are bigger than x, and for the other groups, 2 elements in that group are bigger than x. So, maximum number bigger than x is: 5*(g-1)/2 + 2*(g+1)/2 = (5/2)*(n/5-1) + (n/5 + 1) = n/2 - 5/2 + n/5 + 1 = 5n/10 + 2n/10 - 3/2 < 7n/10. Likewise, the maximum number < x is < 7n/10, and the claim is proved.