15-451 Algorithms 9/14/00 * Search trees - simple binary search trees - connection to quicksort - B-trees ===================================================================== Imagine we are the phone company and we have a database of people and their phone numbers (plus addresses, billing information and so on). We'd like to store this information in a way that makes it easy to look someone up. One thing we could do is to store our database in a sorted array. Then given a lookup(x) request we could use binary search and find what we want in O(log n) time, where n is the number of entries. This is basically a phonebook. The problem with using a sorted array is that inserting a new element takes Omega(n) time because we have to shift everything to the right in order to make room. Deletes also take Omega(n) time. If we expect to have to do these often, then a better approach is to use a search tree. A binary search tree is a binary tree in which each node has a KEY (in our example, a key is a person's name), such that all descendants to the left have smaller (or equal) keys and all descendants to the right have larger (or equal) keys. To do a lookup operation in a binary search tree you simply walk down from the root, going left or right or stopping depending on whether the query is smaller, larger, or equal to the key in the node. We will also talk about general (not necessarily binary) search tree will potentially have more than one key in each node, and nodes may have more than two children. SIMPLE BINARY SEARCH TREES ------------------------- The simplest way to maintain a binary search tree is to implement the insert operations as follows. insert(x): If the tree is empty then we put x in the root. Otherwise, we compare it to the root: if x is smaller then we recursively insert on the left; otherwise we recursively insert on the right. E.g., short andrew ids of students in this class (alphabetically by last name): {joa, hhd, ebd, dd2, mcd, etf, ljf, rfu, mdh} plusses: easy to do (deletes are painful but we won't talk about them) minuses: bad worst-case behavior. worst-case time to insert n elements? Theta(n^2). Worst-case lookup: Theta(n). plusses: has good average-case behavior (i.e., if inputs are in a random order). Claim: if inputs in a random order, then expected time to do n inserts is O(n log n). Why? It's not obvious from first principles, but I claim we have already proved this fact. The reason is that we are doing EXACTLY THE SAME SET OF COMPARISONS as we would with quicksort using the leftmost element as the pivot. This also means average depth is O(log n). So, lookups are fast on average. Can we fix this problematic worst-case behavior? Answer: yes. Here is a particularly nice method used in database applications called B-trees. B-trees and 2-3-4 trees ------------------------ B-tree is a search tree where for some pre-specified t>=2, * each node has between t-1 and 2t-1 keys in it (special case: root has between 1 and 2t-1 keys in it), in a sorted array. * each non-leaf has degree = (# keys)+1. So, in range [t, 2t], except root in range [2,2t]. E.g., if array is A[1]...A[10], then there is one child for things less than A[1], one child for things between A[1] and A[2], ..., one child for things bigger than A[10]. * Flexibility in # keys in node allows us to keep perfectly height-balanced. (Case of t=2 is called 2-3-4 tree since degrees are 2,3,or 4) E.g., here is a tree for t=3. (3-6 tree) [H M R] / | | \ [A B C D] [K L] [N O] [T Y Z] Idea is that by using flexibility in sizes and degrees of nodes, we'll be able to keep trees perfectly balanced. Rules for search and insert turn out to be pretty easy: - Search: just do binary search in array at root, then go to appropriate child and recurse. - Insert: 1) act like you're doing a search, but if you ever encounter a FULL node (FULL means it has the maximum 2t-1 keys in it), pull the median element up to parent and split the node. This takes O(t) time since need to insert into parent and move half of the keys into a new array. 2) then insert into the leaf. Example: insert "E" above, then insert "F": - when doing part (1) for F we will pull the "C" up to root and split the node. End result: [ C H M R ] / | | \ \ [A B] [D E F] [K L] [N O] [T Y Z] Question: Do we maintain requirement of at least t-1 keys per non-root node? Yes, because one of the two halves gets t-1 keys and the other gets t keys. Question: Can the parent get *over-full*? Answer: no, since if it was full we would have already split it on the way down. insert "S", "U", "V": [ C H M R U ] / | | | \ \ [A B] [D E F] [K L] [N O] [S T] [V Y Z] now, insert "P". Doing this will bring M up to new root. [ M ] / \ [ C H ] [ R U ] / | | | | \ [A B] [D E F] [K L] [N O P] [S T] [V Y Z] Question: is tree always height-balanced? (All leaves at same depth). Answer: yes, since only grow tree "up". So, we have maintained properties. What about running time. Say we do linear search at each node to find position: Time for search is O(depth * t) Say we use binary search: Time for search is O(depth * log(t)) What is maximum depth? O(log_t(n)). E.g., if t = 1000, n = 1 billion, then depth is at most 3 (or 2 if you count starting at 0) What does this say about search time if use binary search? Time for search is: O(log_t(n) * log(t)) = O(log n) Inserts: same as search but we also need to split nodes which takes O(t) time each. In the worst case, we need to do this every step of the way down. Worst-case time for insert is O(depth * t) This is fine if t is small. Not so good if t is large. *BUT* notice the following: if we create tree from n inserts, I claim we can have made at most O(n/t) splits total in the process. Why? -- Because each split creates a new node, and there are O(n/t) nodes total. So *total* time spent on splits over all n inserts is O(n), which means that we are only spending O(1) time on average per split. This is a very different kind of "on average" than we've seen before. Not assuming input is random. Not flipping coins in our algorithm. What we're saying is that for any sequence of n inserts, even though some might be more expensive than others, the total time spent on splits is O(n) and so our *amortized time per insert* spent on splits is just O(1). Actually, the most costly thing in the insert is putting the element into the leaf, which takes time O(t). So, the amortized worst-case time per insert is: O(log n) + O(1) + O(t) = O(log n + t) [which is better than the O(depth * t) worst-case time] (We'll talk more about using amortization to analyze algorithms next week) More facts about B-trees: * used a lot in databases in practice because fits in nicely with memory heirarchy. Basically, define t so that node fits into a page. Keep root in fast memory (and maybe even its children), so in above example, each insert or lookup would use only 1 or 2 disk accesses. * if use t=2, have 2-3-4 tree. What's special about 2-3-4 trees is that they can be implemented efficiently as binary trees using "red-black-trees".