15-451 Algorithms 09/28/04
Digit-based sorting/data-structures Quiz
* Radix sort
* Tries
=========================================================================
So far, have looked at sorting and storing items whose keys can be
anything at all, so long as we have a notion of "<". Today will look at
methods for the case that keys are natural things like strings or integers.
Say you have n objects to sort, whose keys are all integers in a small
range: 1,...,r (e.g., say I have a list of movies and want to sort by
how many stars they got). What is a fast way to sort?
--> make array of size r, of "buckets" (maybe implement as linked lists)
--> Make 1 pass through records. Insert object with key K into bucket K.
--> Make pass through array of buckets, read off as you go.
Total time: O(n+r).
This is called "bucket sort". Notice one thing interesting about it
is that by using the magic of indirect addressing, we are making an
r-way decision at each step (something you can't do in the comparison
model). Unfortunately, bucket-sort is only good if the range of keys is small.
RADIX SORT
----------
What if we have keys in a larger range, but the keys are numbers (or
strings) that can be viewed as a sequence of digits (or characters).
In that case, there are two natural generalizations: one that's
conceptually easier called Most-Significant-First radix sort, where we
go top-down digit by digit (or byte by byte), and another that's
trickier to think of but easy to code called Least-Significant-First
radix sort where we go in the other direction. "radix" is the "r".
E.g., r might be 10 or 256.
Most-significant-First radix sort:
---------------------------------
Sort by most significant character/digit first into buckets. Get
bucket for 'a', 'b', ...
---- a's ---- | ----- b's ---- | ---- c's ---- | ...
Now, recursively sort each bucket, starting with the next most significant
digit/character, and so on. (sort of like quicksort). Keep doing
this for any bucket that has more than one element.
If we ignore the time spent scanning empty buckets (the "+r" part)
then we just spend O(n) time at each level. So, if strings are length
L, then the total time is O(nL). In fact, it may end a lot earlier
if, e.g., we can distinguish all the strings from just their first few
characters.
How does this relate to our lower bound? One thing to think about:
how small can L be, if all n keys are distinct? Ans: it has to be at
least log_r(n). So, in the case that all strings are roughly this
length, our time is only n*log_r(n) rather than n*log_2(n), because we
are making an r-way split rather than a 2-way split at each step.
What about the empty buckets? That can actually be a problem if we
get to a case in the recursion where the number of elements in a
bucket gets small compared to r (e.g., "algorithm" and "algorithmic")
so at that point you might want to switch to a different algorithm or
do something trickier. But if we view r as constant, then overall
it's just O(nL), which is how long it takes to read the input anyway.
Least-significant-first radix sort:
-----------------------------------
Another idea - easier to implement but trickier to see why it works:
First bucket-sort by LEAST significant digit. That is, sort them into
buckets by the ones digit and then concatenate the buckets. Now
bucketsort by the tens, and so on. This sounds weird, but the claim
is that if we do this using a STABLE bucket-sort in each step
("stable" means that it doesn't rearrange the orders of items that go
into the same bucket) then this will end up correctly sorting the
items. Let's see what happens on an example:
Example: 28, 15, 24, 17.
-> sort by least significant: 24, 15, 17, 28
-> sort by next-least: 15, 17, 24, 28 (since STABLE)
Why does the algorithm work? Let's prove by induction that after the
ith pass, the items are correctly sorted according to the least i
digits. This is clearly true for the base case i=1. Now,
let's do the general case. We know that after the ith pass the items that
differ in the ith digit will be in the desired order with
respect to each other (because we just sorted them by that digit!)
but what about the items that are equal in this digit? Well, by
induction, these were in the desired order *before* we began the ith
pass, and since our bucketsort is STABLE, they remain in the correct
order afterwards. So we are done.
Running time is O(L*(r+n)) = O(Ln) if r = O(n).
Advantage: easy to implement since no need to keep buckets separate or
even to do recursion: We just have have a loop that for j = L down to 1 calls
bucketsort(A,j) which does a bucketsort using the jth character of
each string for sorting.
If have big numbers and small numbers, could add a check that
identifies the part of the array that's done so don't keep re-sorting it.
[technical point: if keys are strings, and they have different
lengths, then to match the usual notion of what we mean by "sorted in
alphabetical order", we should pad them to the right with blanks that
are defined to be less than 'a'. E.g., {car, card} should be viewed
as {car_, card}.]
Tries
=====
(TRIE comes from reTRIEval. also called digital search trees).
Tries are to MSF radix sort like binary search trees are to quicksort.
Can think of a trie as an adaptive version of MSF radix.
In a trie, you put letters on the edges and you walk down
the tree reading off the word. In particular, each node will have an
array of size r (e.g., r=26 or r=256) of child pointers. To store a
string, you just follow down the associated child for each letter from
first to last. For instance, say we wanted to store the words:
and, bat, add, bag, cat, car.
When doing an insert, we end each string with "$" (a special end
character) in case some strings are substrings of others. To do a
lookup, we just walk down the tree letter-by-letter and then see if
the node we get to at the end has a "$" in it. (If we ever get to a
null pointer, the we know the key is not there -- e.g., if we look up
"apple" then we will notice in the 2nd step that it can't possibly be
in the trie).
Time to do a search is only O(length of key). Same for doing an
insert, if we view r as a constant. (If r is really big then we
should worry about the time spent allocating the arrays of child
pointers). So, this is really nice. Also, prefix searching is
especially easy (e.g., if you wanted to make a text editor that did
word-completion). The main drawback is that the overhead is high
because you have so much pointer-following to do. E.g., what if we
added "automobile" to the above trie? Also, you have a factor of r
extra space since you need to have an array of size r in each node.
Since the design is so nice conceptually, people have thought about
ways of making it more practical. In particular, some things you can
do are:
--> compress paths that don't branch into single edge.
--> only split when needed
So, for instance, the set {add, automobile, and, another} would look like:
*
a|
******
dd$/ |n |utomobile$
* * *
d$/ \other$
* *
This is called a "Patricia tree" (practical algorithm to retrieve
information coded in alphanumeric).