15-451 Algorithms 9/14/00
* Search trees
- simple binary search trees
- connection to quicksort
- B-trees
=====================================================================
Imagine we are the phone company and we have a database of people and
their phone numbers (plus addresses, billing information and so on).
We'd like to store this information in a way that makes it easy to
look someone up. One thing we could do is to store our database in a
sorted array. Then given a lookup(x) request we could use binary
search and find what we want in O(log n) time, where n is the number
of entries. This is basically a phonebook.
The problem with using a sorted array is that inserting a new element
takes Omega(n) time because we have to shift everything to the
right in order to make room. Deletes also take Omega(n) time. If
we expect to have to do these often, then a better approach is to use
a search tree.
A binary search tree is a binary tree in which each node has a KEY
(in our example, a key is a person's name), such that all
descendants to the left have smaller (or equal) keys and all
descendants to the right have larger (or equal) keys. To do a lookup
operation in a binary search tree you simply walk down from the root,
going left or right or stopping depending on whether the query is
smaller, larger, or equal to the key in the node. We will also talk
about general (not necessarily binary) search tree will potentially
have more than one key in each node, and nodes may have more than two
children.
SIMPLE BINARY SEARCH TREES
-------------------------
The simplest way to maintain a binary search tree is to implement the
insert operations as follows.
insert(x): If the tree is empty then we put x in the root.
Otherwise, we compare it to the root: if x is smaller then we
recursively insert on the left; otherwise we recursively insert on the
right.
E.g., short andrew ids of students in this class (alphabetically by
last name):
{joa, hhd, ebd, dd2, mcd, etf, ljf, rfu, mdh}
plusses: easy to do (deletes are painful but we won't talk about them)
minuses: bad worst-case behavior. worst-case time to insert n
elements? Theta(n^2). Worst-case lookup: Theta(n).
plusses: has good average-case behavior (i.e., if inputs are in a
random order). Claim: if inputs in a random order, then expected time
to do n inserts is O(n log n). Why? It's not obvious from first
principles, but I claim we have already proved this fact. The reason
is that we are doing EXACTLY THE SAME SET OF COMPARISONS as we would
with quicksort using the leftmost element as the pivot.
This also means average depth is O(log n). So, lookups are fast
on average.
Can we fix this problematic worst-case behavior? Answer: yes. Here
is a particularly nice method used in database applications called
B-trees.
B-trees and 2-3-4 trees
------------------------
B-tree is a search tree where for some pre-specified t>=2,
* each node has between t-1 and 2t-1 keys in it (special case: root has
between 1 and 2t-1 keys in it), in a sorted array.
* each non-leaf has degree = (# keys)+1. So, in range [t, 2t], except
root in range [2,2t]. E.g., if array is A[1]...A[10], then there
is one child for things less than A[1], one child for things
between A[1] and A[2], ..., one child for things bigger than
A[10].
* Flexibility in # keys in node allows us to keep perfectly height-balanced.
(Case of t=2 is called 2-3-4 tree since degrees are 2,3,or 4)
E.g., here is a tree for t=3. (3-6 tree)
[H M R]
/ | | \
[A B C D] [K L] [N O] [T Y Z]
Idea is that by using flexibility in sizes and degrees of nodes, we'll
be able to keep trees perfectly balanced.
Rules for search and insert turn out to be pretty easy:
- Search: just do binary search in array at root, then go to
appropriate child and recurse.
- Insert:
1) act like you're doing a search, but if you ever encounter a FULL
node (FULL means it has the maximum 2t-1 keys in it), pull the median
element up to parent and split the node. This takes O(t) time
since need to insert into parent and move half of the keys into a new
array.
2) then insert into the leaf.
Example:
insert "E" above, then insert "F":
- when doing part (1) for F we will pull the "C" up to root
and split the node. End result:
[ C H M R ]
/ | | \ \
[A B] [D E F] [K L] [N O] [T Y Z]
Question: Do we maintain requirement of at least t-1 keys per non-root
node? Yes, because one of the two halves gets t-1 keys and the other
gets t keys.
Question: Can the parent get *over-full*? Answer: no, since if it was
full we would have already split it on the way down.
insert "S", "U", "V":
[ C H M R U ]
/ | | | \ \
[A B] [D E F] [K L] [N O] [S T] [V Y Z]
now, insert "P". Doing this will bring M up to new root.
[ M ]
/ \
[ C H ] [ R U ]
/ | | | | \
[A B] [D E F] [K L] [N O P] [S T] [V Y Z]
Question: is tree always height-balanced? (All leaves at same
depth). Answer: yes, since only grow tree "up".
So, we have maintained properties. What about running time. Say we
do linear search at each node to find position:
Time for search is O(depth * t)
Say we use binary search:
Time for search is O(depth * log(t))
What is maximum depth? O(log_t(n)).
E.g., if t = 1000, n = 1 billion, then depth is at most 3 (or 2 if you
count starting at 0)
What does this say about search time if use binary search?
Time for search is: O(log_t(n) * log(t)) = O(log n)
Inserts: same as search but we also need to split nodes which takes
O(t) time each. In the worst case, we need to do this every step of
the way down.
Worst-case time for insert is O(depth * t)
This is fine if t is small. Not so good if t is large. *BUT* notice
the following: if we create tree from n inserts, I claim we can have
made at most O(n/t) splits total in the process. Why? -- Because each
split creates a new node, and there are O(n/t) nodes total. So
*total* time spent on splits over all n inserts is O(n), which means
that we are only spending O(1) time on average per split.
This is a very different kind of "on average" than we've seen before.
Not assuming input is random. Not flipping coins in our algorithm.
What we're saying is that for any sequence of n inserts, even though
some might be more expensive than others, the total time spent on
splits is O(n) and so our *amortized time per insert* spent on splits
is just O(1).
Actually, the most costly thing in the insert is putting
the element into the leaf, which takes time O(t).
So, the amortized worst-case time per insert is:
O(log n) + O(1) + O(t) = O(log n + t)
[which is better than the O(depth * t) worst-case time]
(We'll talk more about using amortization to analyze algorithms next week)
More facts about B-trees:
* used a lot in databases in practice because fits in nicely with memory
heirarchy. Basically, define t so that node fits into a page. Keep
root in fast memory (and maybe even its children), so in above
example, each insert or lookup would use only 1 or 2 disk accesses.
* if use t=2, have 2-3-4 tree. What's special about 2-3-4 trees is
that they can be implemented efficiently as binary trees using
"red-black-trees".