Binary Search Trees
15-451 February 10, 2004
Danny Sleator
---------------------------------------
Summary:
The Dictionary ADT problem
One approach: Binary search trees
Problems with the approach (unbalance)
Ways of dealing with this:
- Balanced Search Trees (AVL)
- Randomized Search Trees (treeps)
- Splay Trees (next time)
Things trees can do that a hash table can't
- successor, predecessor
- Maintaining and searching by rank
- splitting and joining
- prefix matching
---------------------------------------
Dictionary ADT (abstract date type): Represent a set of keys each with
an associated piece of data allowing three kinds of operations:
find(k) Return the data associated with key k, null otherwise.
insert(k,d) Insert a new key k (not already in the set) into
the set with data d.
remove(k) Delete the key k from the set (assuming its in the set).
Consider a sequence of M operations, where the maximum size of the
set is N. We'd like to study ways of solving this problem, and
analyze their running time as a function of m and n.
Consider solving this problem with a linked list, and with a sorted
array.
linked list Sorted array
--------------------------------------------------------------------
find() O(N) O(log N)
insert() O(1) O(N)
remove() O(1) O(N)
The log(N) bound is obtained using binary search. Notice that
insert() is fast for lists but slow for an array. But find is fast
for an array and slow for a list. Our goal is a data structure that
efficient for all of these operations.
One approach: Hashing
O(M) time total (O(1) per operation)
(Later in the course)
Another approach: Binary Search Trees
O(M log N) time
(we'll see later why binary search trees are useful
despite a worse bound than hashing)
A binary search tree is a class of data structures where
(1) Each node stores a piece of data
(2) Each node has two pointers to two other binary search trees
(3) The overall structure of the pointers is a tree
(there's a root, exactly one way to reach every node
from the root.)
We assume that the keys are from an ordered universe.
This means that for any two keys A and B exactly one of the following
holds:
A < B A = B A > B
Furthermore the relation is transitive, so that if A > B and B > C
then A > C. In java's terminology we say that this set of keys
obeys the "Comparable" interface. A.compareTo(B) < 0 iff
A < B, etc.
The tree is ordered according to the keys: At every node n all the
keys in the left subtree of n are less than n.key, and all those in
the right subtree are greater than n.key. For example, here's a tree
of 10 nodes.
5
/ \
3 6
/ \ \
1 4 9
/ \ /
0 2 8
/
7
This ordering is called SYMMETRIC ORDER or INORDER.
Algorithm for find()
Simply walk down the tree doing comparisions. (Draw
pictures.)
Insertion algorithm
Same as a search, but put the new node in at
the place where you ran off the bottom of the tree.
for example if we insert 8.5 into the above tree
we get:
5
/ \
3 6
/ \ \
1 4 9
/ \ /
0 2 8
/ \
7 8.5
Removal (Deletion) algorithm
First find the node you want to delete, call it t. If t's
right subtree is empty, replace t by its left subtree, and stop.
Otherwise let L be the leftmost node in t's right subtree. Swap
t and L. Now replace t by t's right subtree.
(Draw pictures)
THE PROBLEM OF UNBLALANCE
If you keep inserting into a tree, it can become "unbalanced", so that
searches (and insertions) become very slow. O(N) in an N-node tree.
This blows our chance to obtain O(log N) running time per operation,
and defeats our purpose, which is efficiency.
Example of how this can happen: Inserting the keys into the tree
in sorted order.
Two approaches to solving this:
- balance conditions: AVL trees, Red-Black trees, AA trees, etc
- self-adjusting trees.
BALANCED TREES
The idea of balance: Maintain a structural invariant of the tree so that
the depth of the tree is bounded by O(log N). Requiring perfect balance
is neither necessary desirable. (Maintaining perfect balance would
require massive work to restructure the tree in the event of an
insertion. Illustrate with a picture.)
There are many forms of balanced trees: Height balanced trees (AVL
Trees) Red-Black trees, AA trees weight balanced trees, etc. You
should understand the basic idea of height balanced trees.
AVL TREES
Define the height of a binary tree to be the number of nodes on the
longest path down the tree. This is the maximum number of nodes you'd
have to visit in a search of the tree.
The idea of height balance: The height of the left and right subtrees
from any node differ by at most 1.
Let h(T) be the height of a tree (or a node in a tree).
The balance condition states that at every node n
the following holds:
n
/ \
/ \
L R
| h(L) - h(R) | <= 1
Theorem: Consider an AVL tree of N nodes and height h. Then
h
N >= Phi -1
Where Phi = 1.618... = the golden ratio = (1+sqrt(5))/2
Proof: The proof is by induction. There are two base cases, the case
of N=0 and the case of N=1. If N=0 than h=0 and the formula holds
with equality. If N=1 then h=1 and the formula holds.
When N>1, let N_L be the number of nodes in the left subtree and
let N_R be the number of nodes in the right subtree
let h_L be the height of the left subtree and
let h_R be the height of the right subtree
By induction we know that:
h_L
N_L >= Phi -1
and
h_R
N_R >= Phi -1
Note that N=N_L+N_R + 1 So adding these two inequalities together
gives:
h_L h_R
N = N_L + N_R + 1 >= Phi + Phi -1 (1)
Now there are three cases, depending on the relationship between
h and h_L and h_R.
Case 1: h_L = h-1
h_R = h-2
Inequality (1) becomes:
h-1 h-2
N >= Phi + Phi -1
h-2
>= Phi (1 + Phi) -1
2
But 1+Phi = Phi (definition of Phi) So this becomes:
h-2 2 h
N >= Phi Phi -1 = Phi -1
Which is what we wanted.
Case 2: h_L = h-2
h_R = h-1
This is the same as case 1 by symmetry.
Case 3: h_L = h-1
h_R = h-1
This shows N is even bigger than case 1, because
h-1 h-2
Phi > Phi
This completes the proof of the theorem. QED.
Why is this theorem useful? Let's think about h as a function of N.
We know that
h
N >= Phi -1
Therefore:
Log_Phi(N+1) >= h
In other words:
h = O(log N)
So even though these trees are not perfectly balanced, they're
still balanced enough to give O(log N).
Here are some examples of the "most unbalanced" AVL trees
for a given height:
T(1) = o
T(2) = o
/
o
T(3) = o
/ \
o o
/
o
o
T(4) = / \
/ \
o o
/ \ /
o o o
/
o
MAINTAINING HEIGHT BALANCE in AVL TREES
We keep extra information ("balance bits") in each node that tells which
of the three balance conditions hold: left higher, right higher, left
and right equal.
Whenever an update (like an insertion of deletion) occurs, an
algorithm is applied to the tree to rebalance it. This algorithm uses
the balance bits on the nodes along the path from the root to the
change, and applies ROTATIONS to rebalance the tree, and update the
balance bits. It turns out that under insertion or deletion it is
possible to rebalance the tree in O(log N) time.
ROTATIONS
A rotation in a binary search tree is a restructuring operation that
preserves the order of the nodes but changes the depths of some of the
nodes.
Here's the general form of rotation:
z z
/ /
y x
/ \ right rotation about y-x / \
x C ======================> A y
/ \ / \
A B B C
Here are the pointer adjustments;
y.left = x.right;
x.right = y;
z.left = x;
(If y was a right child of z instead of a left, then the last of these
would be z->right = x. If there were parent pointers, then the parents
of x, y, and the root of B would have to be adjusted also.)
Thinking of a binary tree as a representation of a fully parenthesized
expression, then the rotation is nothing more than an application of the
associative law.
+ +
/ \ / \
+ C ======================> A +
/ \ / \
A B B C
(A+B)+C A+(B+C)
Here's an example of how rotations can restore balance in an AVL tree
under an insertion.
Say we have this tree:
4
/ \
3 5
/
2
Now we insert 1 into this tree, getting:
4
/ \
3 5
/ (A)
2
/
1
Now nodes 3 and 4 are unbalanced. If we rotate right at node 3, we
get this tree:
4
/ \
2 5 (B)
/ \
1 3
Another case. Suppose the tree looked like this and we inserted 2:
4 4
/ \ / \
3 5 3 5
/ /
1 1 (C)
\
2
Now 3 and 4 are unbalance again. Doing a rotation about 3 does this:
4
/ \
1 5
\
3
/
2
And no progress toward balance has been made. The answer in
this case is a "double rotation", which is 2 rotations. First
we left rotate about 1 to transform (C) into (A). Then we
do what we did to fix (A), namely rotate about 3.
It's possible, but rather complicated, to maintain the AVL balance
condition when it's violated after an insertion or deletion of a
node.