Binary Search Trees 15-451 February 10, 2004 Danny Sleator --------------------------------------------------------------------- Random Search Trees Various schemes have been proposed for using randomization to "balance" search trees. It's based on the fact that a "random" tree is balanced. Two approaches: Skip Lists (Pugh) Treaps (Aragon and Seidel) We'll present treaps. Say we're storing a set of keys 1...n in a binary tree. The keys are stored in symmetric order by their value. So the left subtree of a key k contains only keys less than k, and the right subtree contains only keys greater than k. Say that each key also has a unique priority p(k). (Drawn from a totally ordered set.) Supose we assert that the tree is not only symmetric ordered by key, but also heap ordered by priority, with the highest priority at the root. Lemma: for any assignment of the priorities, there's a unique tree satisfiying the symmetric and heap order constraints. Proof: You can build the required tree simply by inserting the elements in decreasing order of priority into an initially empty tree. We can show it's unique by induction. Say you have two different trees supposedly satisfying these constraints. If they have different nodes at the roots, then the tree with the root with the lower priority must have a violation. If the roots are the same, then the two left subtrees must have the same sets of nodes in them, and similarly for the right subtrees, and they therefore must be identical by induction. QED. We will assume that we can generate a random priority without collisions. Here are the algorithms for the standard operations using Treaps: Search: Normal binary search Insertion: Search down from the root till we come to the null pointer where the new node ought to go. Insert the new node there with a random priority. Now rotate that node up the tree until heap order is restored. Deletion: Say x is the node we want to delete. Rotate x down the tree till it has no children. Then simply remove x. At each step rotate x with its child of highest priority. Given a tree, we'll refer to the _ancestor relation_ among the nodes in the tree. A node a is an ancestor of a node b if the path from b to the root includes a. A tree is heap ordered if and only if for every pair (a,b) where a is an ancestor of b, we have that p(a) >= p(b). Insertion works because as we rotate x (the inserted node) up the tree heap order (among nodes other than x) is preserved. To see this, consider moving x up the tree by doing a left rotation in the following picture: b / \ A x / \ C D becomes: x / \ b D / \ A C The ancestor relation (ignoring node x) is preserved with the exception that node b was an ancestor of all nodes in tree D before, but not after. Therefore if the nodes (excluding x) were heap ordered before the rotation then they are heap ordered after the rotation. Insertion terminates when x has a priority less than its parent. At this point heap order has been restored to all nodes including x. The above picture also helps in understanding deletion. In this case x moves down the tree (the bottom picture is converted by a right rotation into the top picture.) Again the ancestor relation (ignoring x) is preserved, except that b is now an ancestor of all the nodes in subtree D. We chose to rotate b up because its priority is greater than the root of D. Thus heap order is preserved. (Do some examples) Running-time Analysis. Theorem 1: The expected length of a path to a node m in a treap of n nodes is: H[m] + H[n-m+1] - 1 (Where H[k] = 1/1 + 1/2 + 1/3 + ... + 1/k (harmonic number)) To analyze insertions and deletions we'll prove this: Theorem 2: The expected number of rotations in a deletion (or insertion) is less than 2. Note that H[m] <= 1+ln(m) Before we prove these theorems we'll need the following lemma: Ancestor Lemma: Let m=m. 4 4 9 / \ / \ / \ 3 5 3 5 7 10 / \ / \ / \ 1 9 1 6 6 8 \ / \ \ 2 7 10 2 / \ 6 8 keys <=6 keys >= 6 Proof of Theorem 1: The actual priority values are not important, only the order. So we can replace the notion of priority with a permutation of all of the keys. The permutation is simply the keys sorted by decreasing order of priority. So if we have such a permutation, the tree generated is simply the one obtained by inserting the keys in the order they appear in the permutation. For a given key m, we want to determine the expected depth of m in the tree, averaged over all permutations of the keys. The ancestors of key m consists of keys m and =m. The ancestor lemma shows that the set of ancestors of m that are <=m depends only on the priorities of the keys <=m, and not on the priorities of the other keys. (In terms of the permutation, this means that the ancestors of m that are =m. (Then subtract 1 cause m was counted twice.) So consider a treap built with keys 1...m. m is the rightmost node in this tree, and the number of ancestors of m is precisely the number of nodes on the rightmost path of this tree. How does a node get onto the rightmost path of this tree? This happens for a key k precisely if at the moment it's inserted, it is greater than all the keys inserted so far. (If it's less than some key already there, it will be shunted off on a left branch.) In terms of the permutation of keys (order of insertion) such a key k is called a "left-right maximum". So our problem becomes precisely the problem of computing the expected number of left-right maxima in a permutation of 1...m. Given a random permutation, we can write a check mark next to each element that has these properties. Let X[m] be the expected number of check marks in a permutation of 1...m. We'll write a recurrence for X[]. We know that X[1] = 1. We can write a recurrence for X[] as follows. Generate a random permutation of 1...m by first generating a random permutation of 2...m then inserting 1 randomly in it. After generating the permutation of 2...m we label the keys with check marks. Then we insert 1. This operation does not change any of the already existing check marks because 1 is less than all the other elements. The number of check marks before inserting 1 is just X[m-1]. By linearity of expectation, X[m] is just X[m-1] plus the expected number of check marks on 1. This is just the probability that 1 is marked. This probability is just 1/m because 1 is only marked if it's the first element of the permutation. This gives: X[1] = 1. X[m] = X[m-1] + 1/m. The solution of this recurrence is just X[m] = H[m]. What happens for the keys >=m? The analysis is exactly the same, but the number of keys in that tree is n-m+1 instead of m. So the expected number of ancestors of m, counting m just once is: X[m] + X[n-m+1] - 1 Which is H[m] + H[n-m+1] - 1. QED. ----------------------- OPTIONAL MATERIAL -------------------------- Now we can prove Theorem 2. Say we're deleting m. Consider the sum of the number of nodes on the right path from the left child of m and the number of nodes on the left path from the right child of m. This quantity decreases by exactly one on each rotation. (Draw pictures). So our goal is to evaluate the expected value of this sum, which is the sum of the expected values of each part. Let the two subtrees of m be A and B. Consider the treap that would be built by using only elements 1...m. Let A' be the left subtree of m in this treap. By the ancestor lemma, the set of nodes in A is the same as the set of nodes in A'. In fact A and A' are equal, because the same nodes are being inserted in the same order. Therefore we can do just what we did in the proof of theorem 1. We simply restrict ourselves to what happens to the stuff in 1...m, compute the expected length of the path, and add it to the expectation for the other side. So the question is, which nodes end up being on the right path of A? For a node k to end up there, it has to be inserted after m. And it has to be greater than any node inserted so far (except m). Here's the picture. x \ x \ x \ m / x \ x \ k If k was less than any of the xs, it would not end up there. If k is greater than any x, it does end up there. And once on the path of interest, it stays there. So suppose we take a random permutation of 1...m, and we check any item k that occurs after m but is greater than any element to its left (except m). What is the expected number of checks? Let Y[m] be this quantity. Y[1] = 0. One way of generating these marks is to first generate a random permutation of 2...m, install the marks, then insert 1 in a random position. The insertion of 1 does not change any of the marks installed before. The expected number of marks on 1 is the probably that 1 is marked. This only happens if the permutation starts out [m1...]. The probability of this happening is 1/m(m-1). Thus we get the following recurrence. Y[1] = 0 Y[m] = Y[m-1] + 1/m(m-1) The solution of this is just Y[m] = (m-1)/m. QED.