Binary Search Trees 15-451 February 10, 2004 Danny Sleator --------------------------------------- Summary: The Dictionary ADT problem One approach: Binary search trees Problems with the approach (unbalance) Ways of dealing with this: - Balanced Search Trees (AVL) - Randomized Search Trees (treeps) - Splay Trees (next time) Things trees can do that a hash table can't - successor, predecessor - Maintaining and searching by rank - splitting and joining - prefix matching --------------------------------------- Dictionary ADT (abstract date type): Represent a set of keys each with an associated piece of data allowing three kinds of operations: find(k) Return the data associated with key k, null otherwise. insert(k,d) Insert a new key k (not already in the set) into the set with data d. remove(k) Delete the key k from the set (assuming its in the set). Consider a sequence of M operations, where the maximum size of the set is N. We'd like to study ways of solving this problem, and analyze their running time as a function of m and n. Consider solving this problem with a linked list, and with a sorted array. linked list Sorted array -------------------------------------------------------------------- find() O(N) O(log N) insert() O(1) O(N) remove() O(1) O(N) The log(N) bound is obtained using binary search. Notice that insert() is fast for lists but slow for an array. But find is fast for an array and slow for a list. Our goal is a data structure that efficient for all of these operations. One approach: Hashing O(M) time total (O(1) per operation) (Later in the course) Another approach: Binary Search Trees O(M log N) time (we'll see later why binary search trees are useful despite a worse bound than hashing) A binary search tree is a class of data structures where (1) Each node stores a piece of data (2) Each node has two pointers to two other binary search trees (3) The overall structure of the pointers is a tree (there's a root, exactly one way to reach every node from the root.) We assume that the keys are from an ordered universe. This means that for any two keys A and B exactly one of the following holds: A < B A = B A > B Furthermore the relation is transitive, so that if A > B and B > C then A > C. In java's terminology we say that this set of keys obeys the "Comparable" interface. A.compareTo(B) < 0 iff A < B, etc. The tree is ordered according to the keys: At every node n all the keys in the left subtree of n are less than n.key, and all those in the right subtree are greater than n.key. For example, here's a tree of 10 nodes. 5 / \ 3 6 / \ \ 1 4 9 / \ / 0 2 8 / 7 This ordering is called SYMMETRIC ORDER or INORDER. Algorithm for find() Simply walk down the tree doing comparisions. (Draw pictures.) Insertion algorithm Same as a search, but put the new node in at the place where you ran off the bottom of the tree. for example if we insert 8.5 into the above tree we get: 5 / \ 3 6 / \ \ 1 4 9 / \ / 0 2 8 / \ 7 8.5 Removal (Deletion) algorithm First find the node you want to delete, call it t. If t's right subtree is empty, replace t by its left subtree, and stop. Otherwise let L be the leftmost node in t's right subtree. Swap t and L. Now replace t by t's right subtree. (Draw pictures) THE PROBLEM OF UNBLALANCE If you keep inserting into a tree, it can become "unbalanced", so that searches (and insertions) become very slow. O(N) in an N-node tree. This blows our chance to obtain O(log N) running time per operation, and defeats our purpose, which is efficiency. Example of how this can happen: Inserting the keys into the tree in sorted order. Two approaches to solving this: - balance conditions: AVL trees, Red-Black trees, AA trees, etc - self-adjusting trees. BALANCED TREES The idea of balance: Maintain a structural invariant of the tree so that the depth of the tree is bounded by O(log N). Requiring perfect balance is neither necessary desirable. (Maintaining perfect balance would require massive work to restructure the tree in the event of an insertion. Illustrate with a picture.) There are many forms of balanced trees: Height balanced trees (AVL Trees) Red-Black trees, AA trees weight balanced trees, etc. You should understand the basic idea of height balanced trees. AVL TREES Define the height of a binary tree to be the number of nodes on the longest path down the tree. This is the maximum number of nodes you'd have to visit in a search of the tree. The idea of height balance: The height of the left and right subtrees from any node differ by at most 1. Let h(T) be the height of a tree (or a node in a tree). The balance condition states that at every node n the following holds: n / \ / \ L R | h(L) - h(R) | <= 1 Theorem: Consider an AVL tree of N nodes and height h. Then h N >= Phi -1 Where Phi = 1.618... = the golden ratio = (1+sqrt(5))/2 Proof: The proof is by induction. There are two base cases, the case of N=0 and the case of N=1. If N=0 than h=0 and the formula holds with equality. If N=1 then h=1 and the formula holds. When N>1, let N_L be the number of nodes in the left subtree and let N_R be the number of nodes in the right subtree let h_L be the height of the left subtree and let h_R be the height of the right subtree By induction we know that: h_L N_L >= Phi -1 and h_R N_R >= Phi -1 Note that N=N_L+N_R + 1 So adding these two inequalities together gives: h_L h_R N = N_L + N_R + 1 >= Phi + Phi -1 (1) Now there are three cases, depending on the relationship between h and h_L and h_R. Case 1: h_L = h-1 h_R = h-2 Inequality (1) becomes: h-1 h-2 N >= Phi + Phi -1 h-2 >= Phi (1 + Phi) -1 2 But 1+Phi = Phi (definition of Phi) So this becomes: h-2 2 h N >= Phi Phi -1 = Phi -1 Which is what we wanted. Case 2: h_L = h-2 h_R = h-1 This is the same as case 1 by symmetry. Case 3: h_L = h-1 h_R = h-1 This shows N is even bigger than case 1, because h-1 h-2 Phi > Phi This completes the proof of the theorem. QED. Why is this theorem useful? Let's think about h as a function of N. We know that h N >= Phi -1 Therefore: Log_Phi(N+1) >= h In other words: h = O(log N) So even though these trees are not perfectly balanced, they're still balanced enough to give O(log N). Here are some examples of the "most unbalanced" AVL trees for a given height: T(1) = o T(2) = o / o T(3) = o / \ o o / o o T(4) = / \ / \ o o / \ / o o o / o MAINTAINING HEIGHT BALANCE in AVL TREES We keep extra information ("balance bits") in each node that tells which of the three balance conditions hold: left higher, right higher, left and right equal. Whenever an update (like an insertion of deletion) occurs, an algorithm is applied to the tree to rebalance it. This algorithm uses the balance bits on the nodes along the path from the root to the change, and applies ROTATIONS to rebalance the tree, and update the balance bits. It turns out that under insertion or deletion it is possible to rebalance the tree in O(log N) time. ROTATIONS A rotation in a binary search tree is a restructuring operation that preserves the order of the nodes but changes the depths of some of the nodes. Here's the general form of rotation: z z / / y x / \ right rotation about y-x / \ x C ======================> A y / \ / \ A B B C Here are the pointer adjustments; y.left = x.right; x.right = y; z.left = x; (If y was a right child of z instead of a left, then the last of these would be z->right = x. If there were parent pointers, then the parents of x, y, and the root of B would have to be adjusted also.) Thinking of a binary tree as a representation of a fully parenthesized expression, then the rotation is nothing more than an application of the associative law. + + / \ / \ + C ======================> A + / \ / \ A B B C (A+B)+C A+(B+C) Here's an example of how rotations can restore balance in an AVL tree under an insertion. Say we have this tree: 4 / \ 3 5 / 2 Now we insert 1 into this tree, getting: 4 / \ 3 5 / (A) 2 / 1 Now nodes 3 and 4 are unbalanced. If we rotate right at node 3, we get this tree: 4 / \ 2 5 (B) / \ 1 3 Another case. Suppose the tree looked like this and we inserted 2: 4 4 / \ / \ 3 5 3 5 / / 1 1 (C) \ 2 Now 3 and 4 are unbalance again. Doing a rotation about 3 does this: 4 / \ 1 5 \ 3 / 2 And no progress toward balance has been made. The answer in this case is a "double rotation", which is 2 rotations. First we left rotate about 1 to transform (C) into (A). Then we do what we did to fix (A), namely rotate about 3. It's possible, but rather complicated, to maintain the AVL balance condition when it's violated after an insertion or deletion of a node.