Binary Search Trees 15-451 February 2, 2006 Danny Sleator --------------------------------------- Summary: The Dictionary ADT problem One approach: Binary search trees Problems with the approach (unbalance) Ways of dealing with this: - Balanced Search Trees (Red-Black/AVL) - Randomized Search Trees (treeps) - Splay Trees (next time) Things trees can do that a hash table can't - successor, predecessor - Maintaining and searching by rank - splitting and joining - prefix matching --------------------------------------- Dictionary ADT (abstract date type): Represent a set of keys each with an associated piece of data allowing three kinds of operations: find(k) Return the data associated with key k, null otherwise. insert(k,d) Insert a new key k (not already in the set) into the set with data d. remove(k) Delete the key k from the set (assuming its in the set). Consider a sequence of M operations, where the maximum size of the set is N. We'd like to study ways of solving this problem, and analyze their running time as a function of m and n. Consider solving this problem with a linked list, and with a sorted array. linked list Sorted array -------------------------------------------------------------------- find() O(N) O(log N) insert() O(1) O(N) remove() O(1) O(N) The log(N) bound is obtained using binary search. Notice that insert() is fast for lists but slow for an array. But find is fast for an array and slow for a list. Our goal is a data structure that efficient for all of these operations. One approach: Hashing O(M) time total (O(1) per operation) (Later in the course) Another approach: Binary Search Trees O(M log N) time (we'll see later why binary search trees are useful despite a worse bound than hashing) A binary search tree is a class of data structures where (1) Each node stores a piece of data (2) Each node has two pointers to two other binary search trees (3) The overall structure of the pointers is a tree (there's a root, exactly one way to reach every node from the root.) We assume that the keys are from an ordered universe. This means that for any two keys A and B exactly one of the following holds: A < B A = B A > B Furthermore the relation is transitive, so that if A > B and B > C then A > C. In java's terminology we say that this set of keys obeys the "Comparable" interface. A.compareTo(B) < 0 iff A < B, etc. The tree is ordered according to the keys: At every node n all the keys in the left subtree of n are less than n.key, and all those in the right subtree are greater than n.key. For example, here's a tree of 10 nodes. 5 / \ 3 6 / \ \ 1 4 9 / \ / 0 2 8 / 7 This ordering is called SYMMETRIC ORDER or INORDER. Algorithm for find() Simply walk down the tree doing comparisions. (Draw pictures.) Insertion algorithm Same as a search, but put the new node in at the place where you ran off the bottom of the tree. for example if we insert 8.5 into the above tree we get: 5 / \ 3 6 / \ \ 1 4 9 / \ / 0 2 8 / \ 7 8.5 Removal (Deletion) algorithm First find the node you want to delete, call it t. If t's right subtree is empty, replace t by its left subtree, and stop. Otherwise let L be the leftmost node in t's right subtree. Swap t and L. Now replace t by t's right subtree. (Draw pictures) THE PROBLEM OF UNBLALANCE If you keep inserting into a tree, it can become "unbalanced", so that searches (and insertions) become very slow. O(N) in an N-node tree. This blows our chance to obtain O(log N) running time per operation, and defeats our purpose, which is efficiency. Example of how this can happen: Inserting the keys into the tree in sorted order. Three approaches to solving this: - balance conditions: AVL trees, Red-Black trees, AA trees, etc - self-adjusting trees. - randomized search trees (treaps) BALANCED TREES The idea of balance: Maintain a structural invariant of the tree so that the depth of the tree is bounded by O(log N). Requiring perfect balance is neither necessary desirable. (Maintaining perfect balance would require massive work to restructure the tree in the event of an insertion. Illustrate with a picture.) Some "balance information" is also always required to be stored in the nodes. There are many forms of balanced trees: Height balanced trees (AVL Trees) red-black trees, AA trees weight balanced trees, etc. Here we'll talk about red-black trees in detail below. In all these forms of binary search trees, Rotations are used to rebalance the tree when necessary. A rotation is a local restructuring operation. ROTATIONS A rotation in a binary search tree is a restructuring operation that preserves the order of the nodes but changes the depths of some of the nodes. Here's the general form of rotation: z z / / y x / \ right rotation about y-x / \ x C ======================> A y / \ / \ A B B C Here are the pointer adjustments; y.left = x.right; x.right = y; z.left = x; (If y was a right child of z instead of a left, then the last of these would be z->right = x. If there were parent pointers, then the parents of x, y, and the root of B would have to be adjusted also.) Thinking of a binary tree as a representation of a fully parenthesized expression, then the rotation is nothing more than an application of the associative law. + + / \ / \ + C ======================> A + / \ / \ A B B C (A+B)+C A+(B+C) RED-BLACK TREES In red-black trees, each node has a color, either red or black, subject to the following constrants: (i) All missing (external) nodes are regarded as black; (ii) all paths from the root to a missing node contain the same number of black nodes; (iii) the parent of each red node (if it has one) is black. Theorem: In an n-node red-black tree, the number of nodes on any path from the root to a missing node (not counting the missing node) is at most floor(2 log(n+1)). (binary log, as always) Proof: Transform the tree by deleting all the red nodes. Simply replace each red node by its right child. (The order in which you do this does not matter, since the children of each red node are black.) Now we have a tree consisting only of black nodes, and all the paths are exactly the same length by property (ii). The tree is perfectly balanced, such as this picture: o / \ / \ / \ o o / \ / \ o o o o / \ / \ / \ / \ There are 7 nodes, and 8 missing nodes. In general the number of nodes here is (2^b)-1, where the paths have b nodes on them. The number of nodes in the original tree, n, is at least this. i.e. n >= (2^b)-1 In the original tree, the path lengths can be at most 2b, because above each black node on a path, there could be a red node. Let d be the length of the longest path of nodes in the original tree. then we have: d <= 2b That is b >= d/2 Therefore n >= (2^(d/2))-1 log(n+1) >= d/2 d <= 2 log(n+1) Q.E.D. Now we need to show how to do insertion and deletion in red-black trees. INSERTION in Red-Black trees Insert the new node in the ordinary fashion. Color it red. Call the new node v. There may be a color violation between v and its parent. Here are the cases. (In these diagrams o represents a red node, * represents a black node. v is a red node with a possible violatino with its parent. Symmetrical variants are not shown.) (1) v = the root. In this case there is no violation. Terminate. (2) v's parent is black. Again, no violation. Terminate. (3) v's parent is red. Now there is a violation that must be fixed. (3.1) v's parent is the root o * / -----> / and terminate. v recolor v (3.2) v's grandparent must be black. here's what we do. (a) * v / \ / \ o o -----> * * continue / recolor / v o ---OR---- * v / \ / \ o o -----> * * continue \ recolor \ v o (b) * * / \ / \ o * -----> o o terminate / rotate \ v * (c) * * / \ / \ o * -----> o o terminate \ 2 rotations \ v * DELETION in Red-Black trees In this discussion, a node is said to be short if the number of black nodes on the paths from it to it's empty nodes is one too small compared to its sibling. In this analysis there may at any time be one short node. Let x be the node we want to delete. If x is missing both of its children, then simply delete it, and the empty node that replaces it becomes short (if x was black). This is corrected as shown below. If x is missing one of its children, then replace x by its non-missing child z. If either x or z was red, color z black, and stop. if both were black, then z becomes short, and this must be corrected as explained below. If x has both of its children, then let y be the rightmost node in the left subtree of x. Now we swap the contents of the nodes y and x. At this point x has no right child, and we delete x using the method above. It remains to handle the possible short subtree that might exist as a result of doing the procedure described above. Let the short tree be rooted at a node called s, which is always black. Now there are some cases. (1) s's parent is red o * / \ / \ s * -------> * o terminate / \ recolor / \ * * * * o o / \ / \ s * -------> * * terminate \ rotate / o * o o / \ / \ s * -----------> * * terminate / \ 2 rotations / \ o * * * (2) s's parent is black * * / \ --------> / go to case (1) s o rotation o / s * s / \ / \ s * -------> * o continue / \ recolor / \ * * * * * * / \ / \ s * -------> * * terminate \ rotate / o * * * / \ / \ s * -----------> * * terminate / \ 2 rotations / \ o * * * This completes the algorithm for Deletion. AVL TREES (not done in this lecture) Define the height of a binary tree to be the number of nodes on the longest path down the tree. This is the maximum number of nodes you'd have to visit in a search of the tree. The idea of height balance: The height of the left and right subtrees from any node differ by at most 1. Let h(T) be the height of a tree (or a node in a tree). The balance condition states that at every node n the following holds: n / \ / \ L R | h(L) - h(R) | <= 1 Theorem: Consider an AVL tree of N nodes and height h. Then h N >= Phi -1 Where Phi = 1.618... = the golden ratio = (1+sqrt(5))/2 OPTIONAL MATERIAL ----------------- Proof: The proof is by induction. There are two base cases, the case of N=0 and the case of N=1. If N=0 than h=0 and the formula holds with equality. If N=1 then h=1 and the formula holds. When N>1, let N_L be the number of nodes in the left subtree and let N_R be the number of nodes in the right subtree let h_L be the height of the left subtree and let h_R be the height of the right subtree By induction we know that: h_L N_L >= Phi -1 and h_R N_R >= Phi -1 Note that N=N_L+N_R + 1 So adding these two inequalities together gives: h_L h_R N = N_L + N_R + 1 >= Phi + Phi -1 (1) Now there are three cases, depending on the relationship between h and h_L and h_R. Case 1: h_L = h-1 h_R = h-2 Inequality (1) becomes: h-1 h-2 N >= Phi + Phi -1 h-2 >= Phi (1 + Phi) -1 2 But 1+Phi = Phi (definition of Phi) So this becomes: h-2 2 h N >= Phi Phi -1 = Phi -1 Which is what we wanted. Case 2: h_L = h-2 h_R = h-1 This is the same as case 1 by symmetry. Case 3: h_L = h-1 h_R = h-1 This shows N is even bigger than case 1, because h-1 h-2 Phi > Phi This completes the proof of the theorem. QED. Why is this theorem useful? Let's think about h as a function of N. We know that h N >= Phi -1 Therefore: Log_Phi(N+1) >= h In other words: h = O(log N) So even though these trees are not perfectly balanced, they're still balanced enough to give O(log N). Here are some examples of the "most unbalanced" AVL trees for a given height: T(1) = o T(2) = o / o T(3) = o / \ o o / o o T(4) = / \ / \ o o / \ / o o o / o MAINTAINING HEIGHT BALANCE in AVL TREES We keep extra information ("balance bits") in each node that tells which of the three balance conditions hold: left higher, right higher, left and right equal. Whenever an update (like an insertion of deletion) occurs, an algorithm is applied to the tree to rebalance it. This algorithm uses the balance bits on the nodes along the path from the root to the change, and applies ROTATIONS to rebalance the tree, and update the balance bits. It turns out that under insertion or deletion it is possible to rebalance the tree in O(log N) time. Here's an example of how rotations can restore balance in an AVL tree under an insertion. Say we have this tree: 4 / \ 3 5 / 2 Now we insert 1 into this tree, getting: 4 / \ 3 5 / (A) 2 / 1 Now nodes 3 and 4 are unbalanced. If we rotate right at node 3, we get this tree: 4 / \ 2 5 (B) / \ 1 3 Another case. Suppose the tree looked like this and we inserted 2: 4 4 / \ / \ 3 5 3 5 / / 1 1 (C) \ 2 Now 3 and 4 are unbalance again. Doing a rotation about 3 does this: 4 / \ 1 5 \ 3 / 2 And no progress toward balance has been made. The answer in this case is a "double rotation", which is 2 rotations. First we left rotate about 1 to transform (C) into (A). Then we do what we did to fix (A), namely rotate about 3. The detailed algorithm for maintaining AVL trees is a bit more complicated than that for Red-Black trees. Also, they don't have the property that only a constant number of rotations need to be done on an insertion or deletion.