15-451 Algorithms 9/18/03 * Search trees - simple binary search trees Quiz Tues. 30 min. - B-trees Open book + 1 sheet notes - treaps ===================================================================== For the next few lectures we'll be looking at several important data-structures. A data-structure is a method for storing data in such a way that operations you care about can be performed quickly. Today: search trees. One natural set of operations we might want to perform are insert and lookup (and maybe delete). E.g., maybe we are the phone company and we have a database of people and their phone numbers (plus addresses, billing information and so on). As new people come in, we'd like to be able to insert them into our database. And, given a name x we'd like to be able to do lookup(x) to find out their phone number and other info. One option: sorted array. Then, lookup(x) takes O(log n) time using binary search. But insert(x) may take Omega(n) time in the worst case because we have to shift everything to the right in order to make room. Same in reverse with deletes. If we expect to have to do these often, then a better approach is to use a search tree. A binary search tree is a binary tree in which each node has a KEY (in our example, a key is a person's name), such that all descendants to the left have smaller (or equal) keys and all descendants to the right have larger (or equal) keys. To do a lookup operation in a binary search tree you simply walk down from the root, going left or right or stopping depending on whether the query is smaller, larger, or equal to the key in the node. We will also talk about general (not necessarily binary) search trees that potentially have more than one key in each node, and nodes may have more than two children. SIMPLE BINARY SEARCH TREES ========================== The simplest way to maintain a binary search tree is to implement the insert operations as follows. insert(x): walk down the tree as if doing a lookup. Insert x into a new leaf at the end. E.g., how about inserting: C A R N E G I E' (where E' > E). plusses: easy to do (deletes are slightly painful) minuses: bad worst-case behavior. worst-case time to insert n elements? Theta(n^2). Worst-case lookup: Theta(n). Can we fix this problematic worst-case behavior? Answer: yes. Here is a particularly nice method used in database applications called B-trees. We'll then look at another called treaps. IMPORTANT IDEA: the problem with the basic bst was that we weren't maintaining balance in the tree. On the other hand, if we try to maintain a perfectly balanced bst, it will be too expensive rearranging things. So, we want to give ourselves some slack. In AVL trees, you give yourself slack by allowing the tree to be "almost balanced". In the median-finding algorithm, we gave ourselves slack by allowing the pivot to be "near" the middle. For B-trees, we will give ourselves slack by allowing some nodes to have a higher degree (number of children) than others. B-trees and 2-3-4 trees ======================= B-tree is a search tree where for some pre-specified t>=2, * each node has at most 2t-1 keys in it. Every non-root node has at least t-1 keys. Keys are stored in a sorted array. * each non-leaf has degree = (# keys)+1. So, in range [t, 2t], except root in range [2,2t]. E.g., if array is [a1, a2, ..., a10], then there is one child for things less than a1, one child for things between a1 and a2, ..., one child for things bigger than a10. * Flexibility in # keys in node allows us to keep perfectly height-balanced. (Case of t=2 is called 2-3-4 tree since degrees are 2,3,or 4) E.g., here is a tree for t=3. (3-6 tree) (max size of node is 5) [H M R] / | | [A B C D] [K L] [N O] [T Y Z] Idea is that by using flexibility in sizes and degrees of nodes, we'll be able to keep trees perfectly balanced. Rules for search and insert turn out to be pretty easy: - Search: just do binary search in array at root, then go to appropriate child and recurse. - Insert: 1) act like you're doing a search, but if you ever encounter a FULL node (FULL means it has the maximum 2t-1 keys in it), pull the median element up to parent and split the node. This takes O(t) time since need to insert into parent and move half of the keys into a new array. 2) then insert into the leaf. Example: insert "E" above, then insert "F": - when doing part (1) for F we will pull the "C" up to root and split the node. End result: [ C H M R ] / | | \ \ [A B] [D E F] [K L] [N O] [T Y Z] Question: Do we maintain requirement of at least t-1 keys per non-root node? Yes, because one of the two halves gets t-1 keys and the other gets t keys. Question: Can the parent get *over-full*? Answer: no, since if it was full we would have already split it on the way down. insert "S", "U", "V": [ C H M R U ] / | | | \ \ [A B] [D E F] [K L] [N O] [S T] [V Y Z] now, insert "P". Doing this will bring M up to new root. [ M ] / \ [ C H ] [ R U ] / | | | | \ [A B] [D E F] [K L] [N O P] [S T] [V Y Z] Question: is tree always height-balanced? (All leaves at same depth). Answer: yes, since only grow tree "up". So, we have maintained properties. What about running time? Say we do binary search at each node to find position: Time for search is O(depth * log(t)) What is maximum depth? O(log_t(n)). E.g., if t = 1000, n = 1 billion, then depth is at most 3 (or 2 if you count starting at 0) Interesting thing: the t's cancel in expression for search time: Time for search is: O(log_t(n) * log(t)) = O(log n) Inserts: same as search but we also need to split nodes which takes O(t) time each. In the worst case, we need to do this every step of the way down. Worst-case time for insert = search-time + splitting-time + time-to-insert-into-leaf. where splitting-time = O(depth * t) and leaf-insert is O(t). Interesting thing about splitting time: this is fine if t is small, but not so good if t is large. *BUT* amortized analysis comes to our rescue. In particular, if we create a tree from n inserts, I claim we can have made at most O(n/t) splits total in the process. Why? -- Because each split creates a new node, and there are O(n/t) nodes total. So *total* time spent on splits over all n inserts is O(n), which means that we are only spending O(1) time on average per split. So, the amortized time per insert is: O(log n) + O(1) + O(t) = O(log n + t) [which is better than the O(depth * t) worst-case time] More facts about B-trees: * used a lot in databases in practice because fits in nicely with memory heirarchy. Basically, define t so that node fits into a page. Keep root in fast memory (and maybe even its children), so in above example, each insert or lookup would use only 1 or 2 disk accesses. * if use t=2, have 2-3-4 tree. What's special about 2-3-4 trees is that they can be implemented efficiently as binary trees using "red-black-trees". ======================================================= Treaps ====== Going back to binary search trees: we saw how a standard BST is like quicksort using the leftmost element as a pivot, with all of its worst-case problems. Question: can we come up with a method that's instead like *randomized* quicksort? The problem is that we don't have all the elements at the start, so it's not so obvious. But it turns out we *can*, and the idea is called "treaps". Treaps: the idea for a treap is that when an element x is inserted, we also give it a random *priority* value. Think of the priorities as giving the order they are supposed to be chosen as pivots. (And think of them as real numbers so we don't get any ties). In particular, the property we will require is that if v is a descendant of u, then priority(v) > priority(u). For example: N:5 / \ C:10 R:7 / A:12 So, the keys are search-tree ordered and the priorities are heap-ordered, which is why this is called a "treap". Also notice that it's enough to just require that if v is a *child* of u, then priority(v) > priority(u), since by transitivity this will imply it holds for all descendants. Might ask: is it always possible to have keys in bst order and priorities in heap order? Answer: yes, just think of sorting the nodes by priority and then inserting in that order into bst. Or, think of randomized qsort. Question: how do you do an insert? Answer: just put the element at the leaf, and then rotate up the tree if parent has a larger priority value. Remember how tree rotations work: x y / \ / \ T1 y <------> x T3 / \ / \ T2 T3 T1 T2 For instance, if doing CARNEGIE', and so far we have the tree above, and we now insert E with priority 15 (so no rotations) and then G with 8, we would do: N:5 N:5 N:5 / \ / \ / \ C:10 R:7 ---> C:10 R:7 ---> G:8 R:7 / \ / \ / A:12 E:15 A:12 G:8 C:10 \ / / \ G:8 E:15 A:12 E:15 General picture: everything else was already in heap order. As you rotate up you don't affect any of previous descendant relations, and you just fix up for new node. Also, in generic example above, if priority(y) < priority(x), then since stuff in T1 had bigger priority than priority(x), it must also be bigger than priority(y). So making it be a descendant of y doesn't cause any problem. So, insert can be done in time O(time for search) since at worst the number of rotations = the number of steps on the way down. Can actually show that the *expected* number of rotations per insert is O(1). What about the depth of the tree? When we analyzed randomized qsort, we showed that in expectation, the total number of comparisons is O(n log n). This means that in expectation, the sum of all node depths is O(n log n). But you can actually show something a lot stronger: in fact, with high probability, the maximum node depth is O(log n). [which also implies that the qsort bound is "with high probability"]. Here is a sketch of one way to prove it: Proof: let's go back to our "dart-throwing" argument for qsort. Let's line up the elements in sorted order, and pick one X that we care about. We can think about the depth of this node like this: we throw a dart at random into this sorted list, and whereever it lands, it cuts the list and the part that X isn't on disappears. We keep going until only X is left. Now, if you think about it, whether the dart lands to the left or right of X, it has a 50/50 chance of deleting half of that side of the list. This can happen at most log(n) times to the left of X, and at most log(n) times to the right. So, we can think of it like this: each dart/pivot is like a coin toss with a 50% chance of heads. We have to stop sometime before getting 2*log(n) heads. If we flip k times, the expected number of heads is k/2. In fact, the chance that we get < k/4 heads gets really small. There's a nice bound called "Hoeffding's inequality" that says that the chance of getting < k/4 heads is at most e^{-k/8}. So, if we flip 8*log(n) times, the chance of getting at most 2*log(n) heads is at most e^{-log(n)} = (1/n)^{log_2(e)} = 1/n^{1.44}. that with high probability, each element has depth < 8*log(n), and it's high enough so that whp, *all* elements have depth < 8*log(n).