15-451 Algorithms 09/27/05 * Recap on Treaps Quiz Digit-based sorting/data-structures: * Radix sort * Tries ========================================================================= Treap recap: When each new element comes in, we give it a random priority value. We maintain that our tree has the property that keys are in BST order and priorities are in min-heap order. Insert as in basic BST, but then rotate up to position necessary to maintain heap order. Main claim: same probability distribution as randomized quicksort. E.g., try with: (A:7) (B:3) (C:10) (D:5) (E:6). That's because the root is the element with lowest priority, which is equally likely to be any of them. Then the same holds recursively on each side of the root. Using this fact, can do analysis [which we did very quickly at the end of last class, and maybe we can go over in recitation] showing that whp, the maximum depth is < 8lg(n). ========================================================================= So far, have looked at sorting and storing items whose keys can be anything at all, so long as we have a notion of "<". Today will look at methods for the case that keys are natural things like strings or integers. BUCKET SORT ----------- Say you have n objects to sort, whose keys are all integers in a small range: 1,...,r (e.g., say I have a bunch of homeworks to sort by section). What's a fast way to sort? --> make array of size r, of "buckets" (maybe implement as linked lists) --> Make 1 pass through records. Insert object with key K into bucket K. --> Make pass through array of buckets, read off as you go. Total time: O(n+r). This is called "bucket sort". Notice one thing interesting about it is that by using the magic of indirect addressing, we are making an r-way decision at each step (something you can't do in the comparison model). Unfortunately, bucket-sort is only good if the range of keys is small. RADIX SORT ---------- Suppose our keys are numbers, or strings, that can be viewed as a *sequence* of digits (or characters) each in a range of size r. In that case, there is a natural sorting algorithm called Radix Sort you can use to sort them. ("radix" is the base r). Actually two versions: one that's conceptually easier called Most-Significant-First radix sort, where we go top-down digit by digit (or byte by byte), and another that's trickier to think of but easy to code called Least-Significant-First radix sort where we go in the other direction. Most-significant-First radix sort: --------------------------------- Sort by most significant character/digit first into buckets. Get bucket for 'a', 'b', ... ---- a's ---- | ----- b's ---- | ---- c's ---- | ... Now, recursively sort each bucket, starting with the next most significant digit/character, and so on. (sort of like quicksort). Keep doing this for any bucket that has more than one element. If we ignore time spent scanning empty buckets, then we just spend O(n) time at each level. So, if strings are length L, then total time is O(nL). Time spent scanning empty buckets could be a problem if r is large, but if we view r as a constant, then just goes into the O(). Least-significant-first radix sort: ----------------------------------- Another idea - easier to implement but trickier to see why it works. Let's switch from strings to numbers since it's a little more intuitive. First bucket-sort by LEAST significant digit. That is, sort them into buckets by the ones digit and then concatenate the buckets. Now bucketsort by the tens, and so on. This sounds weird, but the claim is that if we do this in a way that doesn't rearrange the orders of items that go into the same bucket (a sorting method is called "STABLE" if it doesn't rearrange equal keys) then this will end up correctly sorting the items. Let's see what happens on an example: Example: 28, 15, 24, 17, 13, 22 -> sort by least significant: 22, 13, 24, 15, 17, 28 -> sort by next-least: 13, 15, 17, 22, 24, 28 (since STABLE) Why does the algorithm work? Let's prove by induction that after the ith pass, the items are correctly sorted according to the least i digits. This is clearly true for the base case i=1. Now, let's do the general case. We know that after the ith pass the items that differ in the ith digit will be in the desired order with respect to each other (because we just sorted them by that digit!) but what about the items that are equal in this digit? Well, by induction, these were in the desired order *before* we began the ith pass, and since our bucketsort is STABLE, they remain in the correct order afterwards. So we are done. If numbers have L digits, then running time is O(L*(r+n)) = O(Ln) if r = O(n). Advantage: easy to implement since no need to keep buckets separate or even to do recursion: We just have have a loop that for j = L down to 1 calls bucketsort(A,j) which does a bucketsort using the jth character of each string for sorting. Relation to bounds on comparison-based sorting: If we have n different numbers, then length is at least log_r(n). In this case, running time is O(n*log_r(n)). Reason we get this instead of n*log_2(n) is we are using indirect-addressing to make an r-way decision in 1 time step. On the negative side, if some keys are much longer than others, then L could be a lot bigger. On the positive side, each operation is just an operation on a single digit. [technical point: if keys are strings, and they have different lengths, then to match the usual notion of what we mean by "sorted in alphabetical order", we should pad them to the right with blanks that are defined to be less than 'a'. E.g., {car, card} should be viewed as {car_, card}.] Tries ===== (TRIE comes from reTRIEval. also called digital search trees). Tries are to MSF radix sort like binary search trees are to quicksort. Can think of a trie as an adaptive version of MSF radix. In a trie, you put letters on the edges and you walk down the tree reading off the word. In particular, each node will have an array of size r (e.g., r=26 or r=256) of child pointers. To store a string, you just follow down the associated child for each letter from first to last. For instance, say we wanted to store the words: and, bat, add, bag, cat, car. When doing an insert, we end each string with "$" (a special end character) in case some strings are substrings of others. To do a lookup, we just walk down the tree letter-by-letter and then see if the node we get to at the end has a "$" in it. (If we ever get to a null pointer, the we know the key is not there -- e.g., if we look up "apple" then we will notice in the 2nd step that it can't possibly be in the trie). Time to do a search is only O(length of key). Same for doing an insert, if we view r as a constant. (If r is really big then we should worry about the time spent allocating the arrays of child pointers). So, this is really nice. Also, prefix searching is especially easy (e.g., if you wanted to make a text editor that did word-completion). The main drawback is that the overhead is high because you have so much pointer-following to do. E.g., what if we added "automobile" to the above trie? Also, you have a factor of r extra space since you need to have an array of size r in each node. Since the design is so nice conceptually, people have thought about ways of making it more practical. In particular, some things you can do are: --> compress paths that don't branch into single edge. --> only split when needed So, for instance, the set {add, automobile, and, another} would look like: * a| ****** dd$/ |n |utomobile$ * * * d$/ \other$ * * This is called a "Patricia tree" (practical algorithm to retrieve information coded in alphanumeric).