15-451 Algorithms 09/23/03 Digit-based sorting/data-structures * Radix sort * Tries Quiz ========================================================================= So far, have looked at sorting and storing objects whose keys can be anything, so long as we have a notion of "<". Today will look at methods for the case that keys are natural things like strings or integers. Say you have n objects to sort, whose keys are all integers in a small range: 1,...,r (e.g., say I have a list of movies and want to sort by how many stars they got). What is a fast way to sort? --> make array of size r, of "buckets" (maybe implement as linked lists) --> Make 1 pass through records. Insert object with key K into bucket K. --> Make pass through array of buckets, read off as you go. Total time: O(n+r). This is called "bucket sort". Notice one thing interesting about it is that by using the magic of indirect addressing, we are making an r-way decision at each step. Unfortunately, this is only good if the range of the keys is small. RADIX SORT ---------- What if we have keys in a larger range? In that case, one natural extension is just to do this digit-by-digit. Or, if the keys are strings that we want to sort alphabetically, then we can do it character-by-character. Most-significant-character/digit (MSC/MSD) radix sort: ------------------------------------------------------ Sort by most significant character/digit first into buckets: -> get bucket for 'a', 'b', ... ---- a's ---- | ----- b's ---- | ---- c's ---- | ... Now, recursively sort each bucket, starting with the next most significant digit/character, and so on. (sort of like quicksort). Keep doing this for any bucket that has more than one element. If each string has length <= L, what is the time as function of n and L *if* we ignore the time spent scanning over empty buckets? Ans: O(nL) --> In fact, the time is just the minimum number of characters needed to distinguish all the strings. So, if the strings are long, but you can distinguish them all by just looking at the first three characters, then you can use 3 instead of L. How does this relate to our lower bound? One thing to think about: how small can L be, if all n keys are distinct? Ans: it has to be at least log_r(n). So, in the case that all strings are roughly the same length, our time is only n*log_r(n) rather than n*log_2(n), because we are making an r-way split rather than a 2-way split at each step. What about the empty buckets? That can actually be a problem if we get to a case in the recursion where r is lot bigger than the number of elements in the bucket, so most of the time is spent scanning over the empties. Because of this, we might want to switch to a different algorithm once the buckets get small enough. Least-significant-character/digit (LSC/LSD) radix sort: ------------------------------------------------------ Another idea - easier to implement but trickier to see why it works: First bucket-sort by LEAST significant character/digit. That is, sort them into buckets by this character and then concatenate the buckets. Now bucketsort by the next-least, and so on. This sounds weird, but the claim is that if we do this using a STABLE bucket-sort in each step ("stable" means that it doesn't rearrange the orders of items that go into the same bucket) then this will end up correctly sorting the items. Let's see what happens on an example: Example: by, ax, bx, ay -> sort by least significant: ax bx by ay -> sort by next-least: ax ay bx by (since STABLE) A couple technical points: what if the keys are different lengths? If the keys are *strings* then to match the usual notion of what we mean by "sorted in alphabetical order", we should pad them to the right with blanks that are defined to be less than 'a'. If the keys are *numbers* then we pad them to the left with 0s (which is what will happen automatically anyway). E.g., we view {12, 231, 5} as {012, 231, 005} and we view {hi, hello, bye} as {hi---, hello, bye--}. Why does the algorithm work? Let's prove by induction that after the ith pass, the items are correctly sorted according to the rightmost i characters/digits. This is clearly true for the base case i=1. Now, let's do the general case. We know that after the ith pass the items that differ in the ith-to-last character will be in the desired order with respect to each other (because we just sorted them by that character!) but what about the items that are equal in this character? Well, by induction, these were in the desired order *before* we began the ith pass, and since our bucketsort is STABLE, they remain in the correct order afterwards. So we are done. Advantage: easy to implement since no need to keep buckets separate like above. We just have have a loop that for j = L down to 1 calls bucketsort(A,j) which does a bucketsort using the jth character of each string for sorting. Also, there's no issue about "what to do with the empty buckets once n gets small in the recursive calls" since there's no recursion. So long as r < n (where n = total number of strings), we get a total time of O(nL). Why does this work? At the end, will be sorted by most significant. For those with same most-significant, will be in correct order by next-to-most significant (since using stable bucket sort), and so on. Tries ===== (TRIE comes from reTRIEval. also called digital search trees). Tries are to radix sort like binary search trees are to quicksort. Can think of a trie as an adaptive version of MSC radix. In a trie, you put letters on the edges and you walk down the tree reading off the word. In particular, each node will have an array of size r (e.g., r=26 or r=256) of child pointers. To store a string, you just follow down the associated child for each letter from first to last. For instance, say we wanted to store the words: and, bat, add, bag, cat, car. When doing an insert, we end each string with "$" (a special end character) in case some strings are substrings of others. To do a lookup, we just walk down the tree letter-by-letter and then see if the node we get to at the end has a "$" in it. (If we ever get to a null pointer, the we know the key is not there -- e.g., if we look up "apple" then we will notice in the 2nd step that it can't possibly be in the trie). Time to do a search is only O(length of key). Same for doing an insert, if we view r as a constant. (If r is really big then we should worry about the time spent allocating the arrays of child pointers). So, this is really nice. Also, prefix searching is especially easy (e.g., if you wanted to make a text editor that did word-completion). The main drawback is that the overhead is high because you have so much pointer-following to do. E.g., what if we added "automobile" to the above trie? Also, you have a factor of r extra space since you need to have an array of size r in each node. Since the design is so nice conceptually, people have thought about ways of making it more practical. In particular, some things you can do are: --> compress paths into single edge. --> only split when needed So, for instance, the set {add, automobile, and, another} would look like: * a| ****** dd$/ |n |utomobile$ * * * d$/ \other$ * * This is called a "Patricia tree" (practical algorithm to retrieve information coded in alphanumeric).