15-451 Algorithms 9/23/04
* Search trees
- simple binary search trees Quiz next Tues. 30 min.
- B-trees 1 sheet notes
- treaps
=====================================================================
For the next few lectures we'll be looking at several important
data-structures. A data-structure is a method for storing data so
that operations you care about can be performed quickly.
Defn: a DICTIONARY data structure is a DS that lets you perform
operations insert and lookup (and maybe delete).
E.g., maybe we are the phone company and we have a database of people
and their phone numbers (plus addresses, billing information and so
on). As new people come in, we'd like to be able to insert them into
our database. And, given a name x we'd like to be able to do
lookup(x) to find out their phone number and other info.
Specifically, operations in a Dictionary data structure are:
- insert(key, object) // e.g., word and definition, name and telephone #,etc
- lookup(key) // returns object
- delete(key) // Might or might not care about this.
One option: sorted array. Then, lookup takes O(log n) time using
binary search. But insert may take Omega(n) time in the worst case
because we have to shift everything to the right in order to make
room for the new key. How about an unsorted array? Then insert is
O(1) but lookup may take Omega(n). Last time we saw a DS that
consisted of an unsorted *set* of sorted arrays where insert was O(log
n) amortized and search was O(log^2 n). Today: search trees.
A binary search tree is a binary tree in which each node stores a
(key, object) pair such that all descendants to the left have smaller
keys and all descendants to the right have larger keys (let's not
worry about the case of multiple equal keys). To do a lookup operation
you simply walk down from the root, going left or right depending on
whether the query is smaller or larger than the key in the node, until
you get to the correct key or walk off the tree. We will also talk
about non-binary search trees that potentially have more than one key
in each node, and nodes may have more than two children.
For the rest of this discussion, we'll ignore the "object" part of
things. Just worry about the keys since that's all that matters as
far as understanding the data structures is concerned.
SIMPLE BINARY SEARCH TREES
==========================
The simplest way to maintain a binary search tree is to implement the
insert operations as follows.
insert(x): walk down the tree as if doing a lookup. Insert x
into a new leaf at the end.
E.g., how about inserting: C A R N E G I E' (where E' > E).
plusses: easy to do (deletes are slightly painful)
minuses: bad worst-case behavior. In fact, behaves exactly like
quicksort using leftmost element as pivot. If elements are in sorted
order, will get very unbalanced tree where operations take time Omega(n).
Today: two ways to fix this problem. First is a particularly nice
method used in database applications called B-trees. We'll then look
at another called treaps, which is a lot like randomized qsort, but
trickier since keys are coming in one at a time.
IMPORTANT IDEA: the problem with the basic bst was that we weren't
maintaining balance in the tree. On the other hand, if we try to
maintain a perfectly balanced bst, it will be too expensive
rearranging things. So, we want to give ourselves some slack. In AVL
trees, you give yourself slack by allowing the tree to be "almost
balanced". In the median-finding algorithm, we gave ourselves slack
by allowing the pivot to be "near" the middle. For B-trees, we will
make the tree perfectly balanced, but give ourselves slack by allowing
some nodes to have a higher degree (number of children) than others.
B-trees and 2-3-4 trees
=======================
B-tree is a search tree where for some pre-specified t>=2,
* each node has between t-1 and 2t-1 keys in it (except root can
have fewer). Keys are stored in a sorted array.
* each non-leaf has degree = (# keys)+1. So, in range [t, 2t], except
root is [2,2t]. E.g., if array is [a1, a2, ..., a10],
then there is one child for things less than a1, one child for things
between a1 and a2, ..., one child for things bigger than a10.
* All leaves at same depth.
(Case of t=2 is called 2-3-4 tree since degrees are 2,3,or 4)
E.g., here is a tree for t=3. (3-6 tree) (max size of node is 5)
[H M R]
/ | |
[A B C D] [K L] [N O] [T Y Z]
Idea is that by using flexibility in sizes and degrees of nodes, we'll
be able to keep trees perfectly balanced.
Rules for search and insert turn out to be pretty easy:
- Search: just do binary search in array at root, then go to
appropriate child and recurse.
- Insert:
1) act like you're doing a search, but if you ever encounter a FULL
node (FULL means it has the maximum 2t-1 keys in it), pull the median
element up to parent and split the node. This takes O(t) time since
need to insert into parent and move half of the keys into a new array.
2) then insert into the leaf.
Example:
insert "E" above, then insert "F":
- when doing part (1) for F we will pull the "C" up to root
and split the node. End result:
[ C H M R ]
/ | | \ \
[A B] [D E F] [K L] [N O] [T Y Z]
Question: Do we maintain requirement of at least t-1 keys per non-root
node? Yes, because we split the node at the median.
Question: Can the parent get *over-full*? Answer: no, since if it was
full we would have already split it on the way down.
insert "S", "U", "V":
[ C H M R U ]
/ | | | \ \
[A B] [D E F] [K L] [N O] [S T] [V Y Z]
now, insert "P". Doing this will bring M up to new root.
[ M ]
/ \
[ C H ] [ R U ]
/ | | | | \
[A B] [D E F] [K L] [N O P] [S T] [V Y Z]
Question: is tree always height-balanced? (All leaves at same depth).
Answer: yes, since only grow tree "up".
So, we have maintained properties. What about running time? Say we
do binary search at each node to find position:
Time for search is O(depth * log(t))
What is maximum depth? O(log_t(n)).
E.g., if t = 1000, n = 1 billion, then depth is at most 3 (or 2 if you
count starting at 0)
Interesting thing: the t's cancel in expression for search time:
Time for search is: O(log_t(n) * log(t)) = O(log n)
Inserts: same as search but we also need to split nodes which takes
O(t) time each. In the worst case, we need to do this every step of
the way down.
Worst-case time for insert = search-time + splitting-time +
time-to-insert-into-leaf.
where
splitting-time = O(depth * t) and leaf-insert is O(t).
If t is constant, this is O(log n). Interesting thing is that even if
t is large, amortized analysis comes to our rescue. In particular, if
we create a tree from n inserts, I claim we can have made at most
O(n/t) splits total in the process. Why? -- Because each split
creates a new node, and there are O(n/t) nodes total. So *total* time
spent on splits over all n inserts is O(n), which means that we are
only spending O(1) time on average per split.
So, the amortized time per insert is:
O(log n) + O(1) + O(t) = O(log n + t)
More facts about B-trees:
* used a lot in databases in practice because fits in nicely with memory
heirarchy. Basically, define t so that node fits into a page. Keep
root in fast memory (and maybe even its children), so in above
example, each insert or lookup would use only 1 or 2 disk accesses.
* if use t=2, have 2-3-4 tree. What's special about 2-3-4 trees is
that they can be implemented efficiently as binary trees using
"red-black-trees".
=======================================================
Treaps
======
Going back to binary search trees: we saw how a standard BST is like
quicksort using the leftmost element as a pivot, with all of its
worst-case problems. Question: can we come up with a method that's
instead like *randomized* quicksort? The problem is that we don't
have all the elements at the start, so it's not so obvious. But it
turns out we *can*, and the idea is called "treaps".
Treaps: the idea for a treap is that when an element x is inserted, we
also give it a random *priority* value. Think of the priorities as
giving the order they are supposed to be chosen as pivots. (And think
of them as real numbers so we don't get any ties). In particular, the
property we will require is that if v is a child of u, then
priority(v) > priority(u). For example:
N:5
/ \
C:10 R:7
/
A:12
So, the keys are search-tree ordered and the priorities are
heap-ordered, which is why this is called a "treap".
Might ask: is it always possible to have keys in bst order and
priorities in heap order? Answer: yes, just think of standard BST if
nodes came in order of priority.
Question: how do you do an insert? Answer: just do usual BST insert,
putting the element at the leaf, and then rotate up the tree if parent
has a larger priority value. Remember how tree rotations work:
x y
/ \ / \
T1 y <------> x T3
/ \ / \
T2 T3 T1 T2
For instance, if doing CARNEGIE', and so far we have the tree above,
and we now insert E with priority 15 (so no rotations) and then G with
8, we would do:
N:5 N:5 N:5
/ \ / \ / \
C:10 R:7 ---> C:10 R:7 ---> G:8 R:7
/ \ / \ /
A:12 E:15 A:12 G:8 C:10
\ / / \
G:8 E:15 A:12 E:15
Need to prove this maintains treap property. First, search-tree
property on keys is maintained since it's not affected by rotations.
What about heap property? Proof is that initially, all descendant
relations are OK (if v is descendant of u then priority(v) >
priority(u)) except for case that v is the new node (note: heap is
defined in terms of children but implies for descendants by transitivity).
Now, when we do a rotation (e.g., see generic picture above), the only
new descendant relation we add is that x and T1 become descendants of
y. But since priority(x) > priority(y), and priority(T1) >
priority(x) by assumption, these are all OK. So, we maintain
invariant. Finally when new node has priority > its parent, all
descendant relations are OK and we're done.
So, insert can be done in time O(time for search) since at worst the
number of rotations = the number of steps on the way down. Can
actually show that the *expected* number of rotations per insert is
O(1).
Now, we're almost done. Inserts and searches are both O(depth of
tree). What about the depth of the tree? When we analyzed randomized qsort,
we showed that in expectation, the total number of comparisons is O(n
log n). This means that in expectation, the sum of all node depths is
O(n log n).
But you can actually show something a lot stronger: in fact, with high
probability, the maximum node depth is O(log n). [which also implies
that the qsort bound is "with high probability"]. Here is a sketch of
one way to prove it:
Proof: let's go back to our "dart-throwing" argument for qsort. Let's
line up the elements in sorted order, and pick one X that we care
about. We can think about the depth of this node like this: we throw
a dart at random into this sorted list, representing whatever element
is at the root of the tree, and whereever it lands, it
cuts the list and the part that X isn't on disappears. We do it
again, representing the next node on the path to X, and keep going
until only X is left. Now, if you think about it, whether the dart
lands to the left or right of X, it has a 50/50 chance of deleting
half of that side of the list. This can happen at most lg(n) times
to the left of X, and at most lg(n) times to the right.
So, we can think of it like this: each dart/pivot is like a coin toss
with a 50% chance of heads. We have to stop sometime before getting
2*lg(n) heads. There's a nice bound called "Hoeffding's inequality"
that says that if you flip a coin t times, the chance of getting < t/4
heads is at most e^{-t/8}. So, if we flip 8*lg(n) times, the chance
of getting at most 2*lg(n) heads is at most e^{-lg(n)} = e^{-ln(n)/ln(2)}
= 1/n^{1.44}. Even if you multiply this by n to account for the fact
that we want this to be true for *every* node X, you still get that
with high probability, each element has depth < 8*lg(n).