15-451 Algorithms: Lecture 1/27/00

Announcements:

Lecture Notes are on the course web page.
(Provides full details of quicksort analysis sketched at the end of the 
last class.)

Provide photo or url with photo when turn in Homework #1 on Tues.

**********

Topics:

  I. Average case vs. Expected case (cont.)
 II. Linearity of expectation and quicksort
III. Selection: randomized
 IV. Selection: deterministic

**********

I. Average case vs. Expected case (cont.)

Expected value of a random variable X:
    E[X] = sum_x (x * Pr{X=x})
    E[f(X)] = sum_x (f(x) * Pr{X=x})

Average-case of deterministic algorithm vs. Expectation of randomized algorithm
Recall:
   average case T(n) = AVG_{inputs I of size n} T(I)
   expected-time E[T(I)] = sum_o T(I,o) Pr{O=o}
   worst-case expected-time T(n) = MAX_{inputs I of size n} E[T(I)]

E.g., Quicksort:
        1) pick an element as pivot
                        (or halt if array has size 0 or 1)
        2) split array into LESS, EQUAL, and GREATER by comparing each
                        element to the pivot.
        3) recursively sort LESS and GREATER

   Consider n=3: Set of inputs is {123,132,213,231,312,321}.
   Count only the number of comparisons, C(n).

   Deterministic using leftmost as pivot: 
       For I=123, compare (1,2),(1,3),(2,3): C(123) = 3
       For I=132, compare (1,3),(1,2),(2,3): C(132) = 3
       For I=213, compare (2,1),(2,3): C(213) = 2
       For I=231, compare (2,3),(2,1): C(231) = 2
       For I=312, compare (3,1),(3,2),(1,2): C(312) = 3
       For I=321, compare (3,2),(3,1),(2,1): C(321) = 3

       ASSUME all inputs equally likely: 
          average-case C(n=3) = AVG_{all inputs of size 3} C(I)
                              = (3+3+2+2+3+3)/6 = 16/6 = 2.666

   Using random pivot:
       For I=123:  
         With equal probability (1/3), select 1, 2, or 3 as pivot:
           If select 1 as pivot: compare (1,2),(1,3)
             With equal probability (1/2), select 2 or 3 as pivot:
               If select 2 as pivot: compare (2,3) => C = 3
               If select 3 as pivot: compare (3,2) => C = 3
           If select 2 as pivot: compare (2,1),(2,3) => C = 2
           If select 3 as pivot: compare (3,1),(3,2)
             With equal probability (1/2), select 1 or 2 as pivot:
               If select 1 as pivot: compare (1,2) => C = 3
               If select 2 as pivot: compare (2,1) => C = 3

         Thus expected number of comparisons 
           E[C(123)] = 3(1/3)(1/2) + 3(1/3)(1/2) + 2(1/3) 
                                   + 3(1/3)(1/2) + 3(1/3)(1/2)
                     = 2.666

       Note: Order of the input does NOT matter in the above argument.
       E.g., there is always a 1/3 probability of selecting 1, 2, or 3 
             as the first pivot.

       Thus, E[C(I)] = 2.666 for I=123,132,213,231,312,321 as well.
       Thus, worst-case expected C(n=3) = MAX_{I of size 3} E[C(I)] = 2.666

   Both expressions evaluated to 2.666.
   Is one better than the other?

   Average case analysis relies on randomness of the inputs:
     I.e., inputs are drawn according to an assumed probability distribution.
     But typically no reason to expect inputs are random.
 
   Expected case analysis relies on randomness of random variables
   (e.g. coin tosses):
     By definition, these are random.
     In practice, computer's random number generator is used to simulate
       random variables.
     Not perfectly random, but quite close to it.

**********

II. Linearity of expectation and quicksort

Probability axioms:
1. Pr{A} >= 0 for all events A.
2. Pr{S} = 1
3. Pr{A union B} = Pr{A} + Pr{B} if A and B are disjoint.

Let X and Y be random variables.
  For a fixed y, what is sum_x Pr{X=x and Y=y}?  = Pr{Y=y}.  [Draw diagram]
  Likewise, for a fixed x, sum_y Pr{X=x and Y=y} = Pr{X=x}.

**********

Linearity of Expectation:

Thrm: For any 2 random variables X & Y,
        E[X+Y] = E[X] + E[Y].

Note: Holds even if X and Y are dependent.

Proof:
Let's assume X and Y take on discrete values so that can use
summations instead of integrals.

  E[X+Y] = SUM_a [ a * Pr(X+Y = a) ]
         = SUM_b SUM_c [ (b+c) * Pr(X=b and Y=c) ]

                (just splitting the event that the sum is "a" into all the
                (disjoint) ways that this could occur)

      = SUM_b SUM_c [ b * Pr(X=b and Y=c)] + SUM_c SUM_b [ c * Pr(X=b and Y=c)]
      = SUM_b b * [SUM_c Pr(X=b and Y=c)] + SUM_c c * [SUM_b Pr(X=b and Y=c)]
      = SUM_b [b * Pr(X=b)] + SUM_c [c * Pr(Y=c)]
      = E[X] + E[Y]    QED

Theorem generalizes to the sum of any finite number of random variables.

**********

A neat analysis of randomized quicksort:

We analyze the expected number of comparisons made by the algorithm,
since that determines the running time.  
We assume all elements are distinct; this is the worst case.  
We will use the linearity of expectation (tell me when we use it).

THEOREM: E[# comparisons] <= 2n*ln(n)

Proof: 
Let e_i = ith smallest element.
Define X_ij = 1 if we *did* directly compare e_i and e_j during the algorithm
   and X_ij = 0 if we *didn't*.

The number of comparisons X = SUM_{i=1..n} SUM_j={i+1..n} X_ij.  
So, E[X] = SUM_{i=1..n} SUM_j={i+1..n} E[X_ij].          [Used linearity here]

Imagine lining the elements up in sorted order. Consider i < j. 
* Pick pivot < e_i or > e_j => both end up in same bucket, 
  no comparisons between them yet, and we select another pivot.
* Pick a pivot between e_i and e_j => end up in different buckets,
  and we *never* compare them.  
* Pick e_i or e_j as pivot => we *do* compare them (once).
  
So, can think of as a dart game: 
we throw a dart at random, and if outside the range we repeat, 
until we hit e_i or e_j or something in-between at which point we stop.

What is Pr{hit e_i or e_j} in this game?
    = 2/(j-i+1).  

Thus E[X_ij] = sum_x [x * Pr{X_ij=x}] = 0 * Pr{X_ij=0} + 1 * Pr{X_ij=1}
             = Pr{X_ij=1} = 2/(j-i+1).

So E[X] = SUM_{i=1..n} SUM_j={i+1..n} 2/(j-i+1).

Let d = j-i+1.
Then E[X] = 2 * SUM_{i=1..n} SUM_d={2..n-i+1} 1/d

Now, 1 + 1/2 + 1/3 + ... + 1/n is called the "nth harmonic number".
Denoted H_n.  H_n is in the range [ln(n), ln(n)+1], which you can
  see by thinking of it as discrete version of the integral of 1/x.

So, E[X] = 2 * SUM_{i=1..n} (H_{n-i+1} - 1) 
         < 2 * SUM_{i=1..n} (H_n - 1)
         = 2n * (H_n - 1) <= 2 n ln(n).  QED.

**********

III. Selection: randomized

SELECTION:   Find k-th smallest in unsorted array.  E.g., find median score

We assume all elements are distinct.

Randomized algorithm: just like quicksort, except don't sorting
the part we don't need.  Called Randomized-Select or QuickSelect.

QuickSelect(A, k):
        1) pick pivot p at random

        2) Split A into LESS and GREATER by comparing each element to p.
                Count # elements going into each part.
                Say L = |LESS|
        3) If L = k-1, then output p.
           If L > k-1, output QuickSelect(LESS, k)
           If L < k-1, output QuickSelect(GREATER, k - (L+1))

Claim: Expected # of comparisons is O(n).

Proof: 
It takes n-1 comparisons to split into two pieces.
Random pivot implies equally likely to select the i'th ranked element in array.
Thus all splits (0:n-1), (1:n-2),..., (n-1:0) occur with probability 1/n.

Will recurse on one of the pieces.
Upper bound, so assume always recurses on the larger piece.

T(n,k) = expected time to find kth smallest out of n.
T(n)   = max_k T(n,k).

Then T(n) <= (n-1) + (2/n) SUM_{i=n/2..n-1} T(i).
           = (n-1) + AVG_{i=n/2..n-1} T(i).

[Assumes n is even. Doesn't really matter]

Now, let's solve by the "guess and check" method.   
Let's assume inductively that T(i) <= 4i, for i<n.

   T(n) <= (n-1) + AVG_{i=n/2..n-1} 4i < (n-1) + 4 * (3n/4) = (n-1)+3n < 4n

QED

**********

IV. Selection: deterministic

[In class, I did not complete this algorithm.  Will continue next class.]

What about a deterministic linear-time algorithm?  
For a long time it was thought this was impossible -- no way better than 
sorting. Turns out you can do it but it's tricker.

Idea is: want to pick a pivot deterministically.  Ideally, we'd like
the pivot to be the median element so that the two sides are the same
size.  But: this is the problem we're already trying to solve!!
Instead, how about we try to guarantee that the pivot will be
roughly in the middle: at most 7/10 above, and at most 7/10 below.

Let T(n) be upper bound on time to find kth smallest out of n.

Step 1: Divide n elements into groups of 5 and find the median of
        each group by brute force.  [ignore integrality issues]

        --> Takes time c'n for some constant c'

Step 2: Recursively, find the true median of the n/5 medians.  Call this x.

        --> Takes time at most T(n/5).

        DANGER: median of medians must be median of everything since
        each one is the median, right?  No.  What's an example where it fails?

        Claim: At most 7/10 of the elements are < x and at most 7/10
        are > x.  (Believe this for now, will prove it shortly)

Step 3: Use x as a pivot to split into LESS and GREATER.

        --> takes time c'n for some constant c'

Step 4: Recurse on the appropriate piece.

        --> Takes time at most T(7n/10)

We get a recurrence for running time of

        T(n) <= cn + T(n/5) + T(7n/10).

which works out to 
        T(n) = O(n).

Details in the book, including handling issues of integrality.

Finally, let's prove the claim.  
Question: how bad can median of medians be?

Example:  {1 2 3 10 11}, {4 5 6 12 13}, {7 8 9 14 15}

Medians are 3, 6, 9

Median of medians is 6.  Five of the fifteen are less, nine are greater.

In general, what's the worst case in terms of having a lot be > x?

   < < < < < ? ? ? ?
   < < < < < ? ? ? ?
   < < < < = > > > >
   ? ? ? ? > > > > >
   ? ? ? ? > > > > >

If there are g = n/5 groups, then worst case is that for (g-1)/2
groups [all groups whose median is larger than x], all 5 elements in
that group are bigger than x, and for the other groups, 2 elements in
that group are bigger than x.  So, maximum number bigger than x is:
        5*(g-1)/2 + 2*(g+1)/2 = (5/2)*(n/5-1) + (n/5 + 1) 
                              = n/2 - 5/2 + n/5 + 1 
                              = 5n/10 + 2n/10 - 3/2 < 7n/10.

Likewise, the maximum number < x is < 7n/10, and the claim is proved.