Binary Search Trees
15-451 February 10, 2004
Danny Sleator
---------------------------------------------------------------------
Random Search Trees
Various schemes have been proposed for using randomization to "balance"
search trees. It's based on the fact that a "random" tree is balanced.
Two approaches:
Skip Lists (Pugh)
Treaps (Aragon and Seidel)
We'll present treaps.
Say we're storing a set of keys 1...n in a binary tree. The keys are
stored in symmetric order by their value. So the left subtree of a key
k contains only keys less than k, and the right subtree contains only
keys greater than k.
Say that each key also has a unique priority p(k). (Drawn from a
totally ordered set.) Supose we assert that the tree is not only
symmetric ordered by key, but also heap ordered by priority, with the
highest priority at the root.
Lemma: for any assignment of the priorities, there's a unique tree
satisfiying the symmetric and heap order constraints.
Proof: You can build the required tree simply by inserting the elements
in decreasing order of priority into an initially empty tree. We can
show it's unique by induction. Say you have two different trees
supposedly satisfying these constraints. If they have different nodes
at the roots, then the tree with the root with the lower priority must
have a violation. If the roots are the same, then the two left subtrees
must have the same sets of nodes in them, and similarly for the right
subtrees, and they therefore must be identical by induction. QED.
We will assume that we can generate a random priority without
collisions.
Here are the algorithms for the standard operations using Treaps:
Search: Normal binary search
Insertion: Search down from the root till we come to the null pointer
where the new node ought to go. Insert the new node there
with a random priority. Now rotate that node up the tree
until heap order is restored.
Deletion: Say x is the node we want to delete. Rotate x down the
tree till it has no children. Then simply remove x. At
each step rotate x with its child of highest priority.
Given a tree, we'll refer to the _ancestor relation_ among the nodes in
the tree. A node a is an ancestor of a node b if the path from b to the
root includes a. A tree is heap ordered if and only if for every pair
(a,b) where a is an ancestor of b, we have that p(a) >= p(b).
Insertion works because as we rotate x (the inserted node) up the tree
heap order (among nodes other than x) is preserved. To see this,
consider moving x up the tree by doing a left rotation in the following
picture:
b
/ \
A x
/ \
C D
becomes:
x
/ \
b D
/ \
A C
The ancestor relation (ignoring node x) is preserved with the exception
that node b was an ancestor of all nodes in tree D before, but not
after. Therefore if the nodes (excluding x) were heap ordered before
the rotation then they are heap ordered after the rotation.
Insertion terminates when x has a priority less than its parent. At
this point heap order has been restored to all nodes including x.
The above picture also helps in understanding deletion. In this case x
moves down the tree (the bottom picture is converted by a right rotation
into the top picture.) Again the ancestor relation (ignoring x) is
preserved, except that b is now an ancestor of all the nodes in subtree
D. We chose to rotate b up because its priority is greater than the
root of D. Thus heap order is preserved.
(Do some examples)
Running-time Analysis.
Theorem 1: The expected length of a path to a node m in a treap of n nodes
is:
H[m] + H[n-m+1] - 1
(Where H[k] = 1/1 + 1/2 + 1/3 + ... + 1/k (harmonic number))
To analyze insertions and deletions we'll prove this:
Theorem 2: The expected number of rotations in a deletion (or insertion)
is less than 2.
Note that H[m] <= 1+ln(m)
Before we prove these theorems we'll need the following lemma:
Ancestor Lemma: Let m=m.
4 4 9
/ \ / \ / \
3 5 3 5 7 10
/ \ / \ / \
1 9 1 6 6 8
\ / \ \
2 7 10 2
/ \
6 8 keys <=6 keys >= 6
Proof of Theorem 1:
The actual priority values are not important, only the order. So we can
replace the notion of priority with a permutation of all of the keys.
The permutation is simply the keys sorted by decreasing order of
priority.
So if we have such a permutation, the tree generated is simply the one
obtained by inserting the keys in the order they appear in the
permutation.
For a given key m, we want to determine the expected depth of m in the
tree, averaged over all permutations of the keys.
The ancestors of key m consists of keys m and =m. The
ancestor lemma shows that the set of ancestors of m that are <=m
depends only on the priorities of the keys <=m, and not on the
priorities of the other keys. (In terms of the permutation, this
means that the ancestors of m that are =m. (Then subtract 1 cause m was
counted twice.)
So consider a treap built with keys 1...m. m is the rightmost node in
this tree, and the number of ancestors of m is precisely the number of
nodes on the rightmost path of this tree.
How does a node get onto the rightmost path of this tree? This
happens for a key k precisely if at the moment it's inserted, it is
greater than all the keys inserted so far. (If it's less than some
key already there, it will be shunted off on a left branch.)
In terms of the permutation of keys (order of insertion) such a key k
is called a "left-right maximum". So our problem becomes precisely
the problem of computing the expected number of left-right maxima in a
permutation of 1...m.
Given a random permutation, we can write a check mark next to each
element that has these properties. Let X[m] be the expected number of
check marks in a permutation of 1...m. We'll write a recurrence for
X[]. We know that X[1] = 1.
We can write a recurrence for X[] as follows. Generate a random
permutation of 1...m by first generating a random permutation of 2...m
then inserting 1 randomly in it. After generating the permutation of
2...m we label the keys with check marks. Then we insert 1. This
operation does not change any of the already existing check marks
because 1 is less than all the other elements. The number of check
marks before inserting 1 is just X[m-1]. By linearity of expectation,
X[m] is just X[m-1] plus the expected number of check marks on 1. This
is just the probability that 1 is marked. This probability is just 1/m
because 1 is only marked if it's the first element of the permutation.
This gives:
X[1] = 1.
X[m] = X[m-1] + 1/m.
The solution of this recurrence is just X[m] = H[m].
What happens for the keys >=m? The analysis is exactly the same, but
the number of keys in that tree is n-m+1 instead of m.
So the expected number of ancestors of m, counting m just once is:
X[m] + X[n-m+1] - 1
Which is
H[m] + H[n-m+1] - 1.
QED.
----------------------- OPTIONAL MATERIAL --------------------------
Now we can prove Theorem 2. Say we're deleting m. Consider the sum of
the number of nodes on the right path from the left child of m and the
number of nodes on the left path from the right child of m. This
quantity decreases by exactly one on each rotation. (Draw pictures).
So our goal is to evaluate the expected value of this sum, which is the
sum of the expected values of each part.
Let the two subtrees of m be A and B. Consider the treap that would be
built by using only elements 1...m. Let A' be the left subtree of m in
this treap. By the ancestor lemma, the set of nodes in A is the same as
the set of nodes in A'. In fact A and A' are equal, because the same
nodes are being inserted in the same order.
Therefore we can do just what we did in the proof of theorem 1. We
simply restrict ourselves to what happens to the stuff in 1...m,
compute the expected length of the path, and add it to the expectation
for the other side.
So the question is, which nodes end up being on the right path of A?
For a node k to end up there, it has to be inserted after m. And it has
to be greater than any node inserted so far (except m). Here's the
picture.
x
\
x
\
x
\
m
/
x
\
x
\
k
If k was less than any of the xs, it would not end up there. If k is
greater than any x, it does end up there. And once on the path of
interest, it stays there.
So suppose we take a random permutation of 1...m, and we check any item k
that occurs after m but is greater than any element to its left (except
m). What is the expected number of checks?
Let Y[m] be this quantity. Y[1] = 0. One way of generating these marks
is to first generate a random permutation of 2...m, install the marks,
then insert 1 in a random position. The insertion of 1 does not change
any of the marks installed before. The expected number of marks on 1 is
the probably that 1 is marked. This only happens if the permutation
starts out [m1...]. The probability of this happening is 1/m(m-1).
Thus we get the following recurrence.
Y[1] = 0
Y[m] = Y[m-1] + 1/m(m-1)
The solution of this is just Y[m] = (m-1)/m.
QED.