20 Mar 1996

Outline

Today we'll review for the quiz tomorrow. The quiz will be thirty minutes long, and you'll be permitted an open book and a sheet of notes.

radix sort
heaps & heapsort
hashing (separate chaining, linear probing, double hashing)

Other topics that are fair game for quiz: mergesort, quicksort, recurrences.

Radix sort

Radix sort is a good method for sorting strings. If n strings of length L each, it takes O(n*L) time.

Example (all students in class last year with first names beginning with "E"):

    eb3z eg27 ek2u ek35 ec3h ep2u ey27 ec2q ef2g ef2q ek00 el28 es58

If we do most-significant-first (MSF) radix sort, then we put the ids into buckets based on most significant character, then recursively sort each bucket that has more than one element, starting at next-most significant character. Finally, we concatenate the buckets. We start with one bucket. After the second scatter, we have:

                                  ek35
                ec3h  ef2g        ek2u
         eb3z   ec2q  ef2q  eg27  ek00  el28  ep2u  es58  ey27

And now we recursively sort the second, third, and fifth buckets.

For least-significant-first (LSF) radix sort, we sort stably (remember what stable sorting means?) on the rightmost character:

    ek00 ek35 eg27 ey27 el28 es58 ef2g ec3h ec2q ef2q ek2u ep2u eb3z

(Let's say that digits are less than characters.) Then we move to the left. After step 2:

    ek00 eg27 ey27 el28 ef2g ec2q ef2q ek2u ep2u ek35 ec3h eb3z es58

And after step 3:

    eb3z ec2q ec3h ef2g ef2q eg27 ek00 ek2u ek35 el28 ep2u es58 ey27

(Trace ef2g and ef2q. Notice they get in correct order in 1st step, and then STAY in correct order since the sort is STABLE)

What are the advantages to LSF? There is no recursion. Just a big ``for'' loop. Note: "radix sort" usually refers to LSF

What are the advantages of MSF? Sometimes you will get lucky and buckets get down to size 1 before you're done reading all the string (just like how strcmp sometimes is lucky and doesn't have to read down entire string to check that the two are different)

Heapsort

We can do an example if people want. Try this out: Start with array

  2 1 4 3 6 5 8 7 9

Now turn this into heap, and do the first couple steps of the second phase of heapsort. (You can try graphical program in: /afs/andrew/scs/cs/15-211/handouts/heap_demo.dec)

Quicksort and heapsort are both good ideas if, say, you wanted a really fast way to find the top 100 numbers in some huge array:

Quicksort-based: Use "quickselect" to find 100th largest. Then one final pass to output all 100 largest.
Heapsort-based: create heap. Then do 100 downheaps.

Hashing

We did separate chaining in recitation on Monday. Can do more examples if people want them. Here are some examples for linear probing and double hashing.

Linear probing

To insert key, we compute i = h(key). If A[i] is empty, store there. Otherwise, i = (i+1) % M, and repeat.

To search: apply hash function. get index i. search A[i], A[i+1], etc, with wraparound, until either find key or else find empty spot.

Here's an example when M is 7 and h is the following:

     x    A B C D E F
    h(x)  4 4 5 0 6 6

Say we insert A through E in order. Then the hash table would read:

     0 1 2 3 4 5 6
     D E     A B C

Search for B: start at h(B) and go up until find it or empty spot. Or to search for F, we look at 6,0,1,2. Since 2 is empty, we return ``not found.''

Double hashing

Idea: have 2nd hash function h2. h2(key) tells what increment should be. Say h(key) = i. h2(key) = j. Look in i, i+j, i+2j, i+3j, ... (mod M)

The function h2 usually really simple, like the first letter of string. But make sure it's not ever zero. (Why?) It should be relatively prime to M. (Again, why?)

Say we have the example before, but h2 is as follows:

      x   A B C D E F
    h2(x) 1 2 3 4 5 6

Now we insert A through E in order to get:

    0 1 2 3 4 5 6
    D   E   A C B

Business of being relatively prime to M is that if it is relatively prime, you can show that will eventually try every spot. Eg, case of 'E', first tried 6, 4, 2. If 2 was filled would have tried 0, 5, 3, 1. If not relatively prime, then the insertion will skip at least half of the table's positions.

Avoiding relatively prime numbers is in general difficult, but if we intelligently select a prime hash table size, it quickly becomes much easier.