15-451 Algorithms 9/26/2000
D. Sleator
* TRIEs
* DAWGs
* radix sort
[Quiz 1 for the first 1/2 hour]
----------------------------------------------------------------------------
Tries
-----
Consider the following task. We have a dictionary of words.
We want to make a representation of the dictionary that lets us solve
the following problem quickly:
Given a set of letters, enumerate all the words that can be made with
those letters.
(For what tasks might this be useful?)
We want the algorithm to run in time that is faster than simply
enumerating all the words of the dictionary and seeing if they satisfy
the constraint.
(Splay trees can be used to solve this problem, but the solution we
present here is much faster and more space-efficient.)
Solution: Tries
(TRIE comes from reTRIEval. also called digital search trees).
We build a tree where each edge of the tree is labeled with a letter,
and each node represents all the words that have as a prefix the path
from the root to this node. Here's an example of a trie storing the
following words:
.
car /|\
cars e/d| \c
cat / | \
cats . . .
do a/ o| a\
dog / | \
dogs . * .
done t/ \r n/ \g t/ \r
ear * * . * * *
ears s| s| e| s| s| s|
eat | | | | | |
eats * * * * * *
The nodes labeled with "*" are those such that the path from the root to
that node is a word of the dictionary, and the node labeled with "." are
nodes that don't represent a word of the dictionary.
One way to implement this is to have an array at each node of 26
pointers (assuming we're only working with letters). This wastes a lot
of space, because the typical branching factor is much less than 26.
One way to improve this is to store a pair of arrays at each node, where
the length of the array is the number of children of the node. One
array stores the labels of the edges to children from that node. The
other array stores points to those nodes. So to follow, for example the
letter "t" from a node, you do binary search in the label array for "t",
and if it's there, follow the pointer from the corresponding position in
the child pointer array.
Note that this data structure is just representing the set of dictionary
words (and it doesn't, for example, store any information about the
words themselves). In this case we can use a further trick to compress
this representation even more.
Dawg
----
The above diagram is really just a finite state machine. The arcs are
labeled with letters from the input, and the accepting states are simply
those labeled with a "*". By allowing ourselves to reuse portiions of
the tree (making it into a DAG), it can be substantially further
compressed.
DAG = Directed Acyclic Graph
Dawg = Directed Acyclic Word Graph
This is called a Dawg. The Dawg that results form the above set of
words has only 8 nodes. (You should be able to construct this
yourself). There is an algorithm for constructing the equivalent finite
state machine with the minimum number of states.
The Dawg uses the same representation as the Trie, and the same
algorithms can be run using it.
Some numbers:
A scrabble dictionary of 94,240 words can be represented as a
117,150 node trie (with 179,618 edges).
When converted to a Dawg, the same dictionary has 19,853 nodes
(and can be stored in 175K)
These representations can be used to write a scrabble program that
can find the optimal move in a scrabble position in a fraction of a
second. (See "The World's Fastest Scrabble Program", by Appel and
Jacobson, JACM, May 1988.)
Radix Sort
---------
(Read pages 128 and 129 of the Manber.)
Describe the old IBM card sorting machines and how they work
Here's an implementation of Radix Distribution
Sort. it sorts an array a[] of integers, and it
uses a temporary array b[] for buffering stuff.
#define bits(x,k,j) (((x)>>(k)) & (~(~0<<(j))))
#define w 32 /* The length of the word in bits */
#define m 8 /* numbrer of bits processed per pass */
#define M 256 /* two to the power m */
void dist_radix(int a[], int b[], int N) {
int i, j, pass, count[M];
for (pass=0; pass < w; pass +=m) {
for (j=0; j=0; i--) b[--count[bits(a[i],pass,m)]] = a[i];
for (i=0; i