Lecture 10/28/96 o growing a hash table dynamically * (growing the table smoothly) * rehashing when the load factor gets high o search trees GROWING THE HASH TABLE In the hashing methods we've talked about, you have to decide the size of the hash table BEFORE you start to use it. What happens if you underestimate the size needed? First of all, the performance will degrade. With separate chaining, it will degrade gracefully. With probed hashing, it will degrade catistrophically as the load factor approaches 1. What do you do when this happens? Well, what's the simplest thing to do? The simplest thing is to create a new hash table from scratch, and insert every element from the old table into the new -bigger- table. But this is slow. If you have S elements stored in the table, It's going to take time Omega(S) to rehash them. Let's work out an example. Suppose we're doing separate chaining, and we want an average of just 1 item in each bucket. Let N be the current size of the hash table. Suppose that whenever S>N, we increase N by 1 and rehash everything. Then each insertion into S will require us to do Omega(S) work, instead of O(1) work that hashing is supposed to take. But what happens if we DOUBLE the hash table size when S>N after an insertion. This will also take Omega(S) time, but it will not happen very often: N more insertions are required till it will happen again. Over a long sequence of operations, how long is this going to take? Analyzing a sequence of operations like this (rather than worrying about the cost of an individual operation) is called AMORTIZED analysis of algorithms. Let's work out the amortized cost of a sequence of insert, and search operations in a hash table where doubling is used, and the load factor is kept below 1. - Consider a sequence consisting of s searches, and i insertions. - Let's assume the searches take O(1) time (if they start taking too long, then you are using a bad hash function, and it should be fixed). So, the searches contribute O(s) time. - What about the insertions? - Let N be the current size of the table and - S be the number of elements in it. If (S+1) <= N, then the insertion takes O(1) time. If (S+1) > N, then the hash table is doubled N-->2N, and everything is rehashed. This costs O(S) time. - How long does this take? Let's analyze this according to the BANKER'S view. Every element of the hash table is going to have some number of CHIPS associated with it. For each insertion, we allot THREE chips to work with. Initially, there are no chips on anything. Every time we do any work, we're going to have to spend a chip. Bounding the number of chips spent bounds the work done. Whenever we insert an element: - we spend 1 chip to insert it into the table, and - we also put a chip on the element for future use, and - we put a chip on one other element in the table that does not have a chip. E.g., h(key) = key % M 0 1 2 3 NULL NULL NULL NULL N=4, S=0, insert 10, 5, 3, 7 0 1 2 3 + + + | | | 5 10 3 + | 7 N=4, S=4, insert 9 -> grow table and rehash 0 1 2 3 4 5 6 7 + + + + | | | | 10 3 5 7 Then, insert 9 0 1 2 3 4 5 6 7 + + + + + | | | | | 9 10 3 5 7 When we are done inserting 9, THen 9 has a chip and 10 has a chip. We won't grow the table again until 3, 5, and 7 have chips. Thus, we only grow the table when every element has a chip. By the time we need to rehash everything, every element must have chip to pay for this. This proves that the work is O(i) for the insertions. ----- Now we start looking at search trees. o Search trees and why they are useful (even if you know about hashing) - successor, predecessor - Maintaining and searching by rank - splitting and joining - prefix matching o The problem of unbalance o Approaches to the problem of unbalance - the rotation operation - balance conditions -- the red-black tree condition - self-adjusting (splay) trees (more next time) MOTIVATION Question: If you've got hashing that runs in O(1) time for lookups, inserts and deletes, what is the use of search trees that can be O(log n) at best? Answer: Search trees support operations that deal with the ORDER of the elements. Examples: - Find the successor (predecessor) of a given element in the set. - searching and computing the "rank" in that ordering. The RANK of an element in an ordered list is the number of elements before it in the list. Suppose I want to maintain a dynamic list of items, and I want to * compute the rank of an element. * given the rank, find the element with that rank. * awk example: a[1] = x, a[10] = y, a[2] = z, a[2] = ? - prefix matching Suppose I want to find all the words that begin with a given set of letters. Example from the link parser. Hashing cannot support any of these efficiently. Let's explore search trees in more detail and see how these things can be done with them. REVIEW (from previous lectures) You have already seen the definition of search trees: a set of nodes, each with a left child pointer and a right child pointer and a place to store the key of that node. class Node { private: Node * left, * right; int key; /* could be a char * or anything that can be ordered */ public: Node(k); Node* search(int ); Node* insert(int ); } The tree is ordered according to the keys: At every n, all the keys in the left subtree of n are less than n.key, and all those in the right subtree are greater than n.key. For example, here's a tree of 10 nodes. 5 / \ 3 6 / \ \ 1 4 9 / \ / 0 2 8 / 7 Here's a program that can search for a given key: Node* Node::search(int k) { if (this == NULL) return NULL; if (k < key) return search(left, k); if (k > key) return search(right, k); return this; } This can be written iteratively (instead of recursively) as: node * Node::search(int k) { Node* n = this; while (n != NULL) { if (k < n->key) n = n->left; else if (k > n->key) n = n->right; else return n; } return NULL; } Say you want to insert a new key into a tree, and that the tree does not already contain that key. Say have constructor Node(k) makes new node and puts key k in. It also sets left and right to NULL node *Node::insert(int k) { if (this == NULL) return Node(k); if (k < key) { left = insert(left, k); } else { right = insert(right, k); } return this; } [Aside: Iterative version: node *Node::insert(int k) { node * n=Node(k); node * c; c = this; if (c == NULL) return n; while(1) { if (k < c->key) { if (c->left == NULL) { c->left = n; return this; } c = c->left; } else { if (c->right == NULL) { c->right = n; return this; } c = c->right; } } } ] THE PROBLEM OF UNBLALANCE As you know if you keep inserting into a tree, it can become unbalanced, so that searches (and insertions) become very slow. O(N) in an N-node tree. This defeats the purpose, which is efficiency. Three approaches to solving this: - Variable branching factor (2-3 trees, or B-trees) - balance conditions: AVL trees or Red-Black trees - self-adjusting trees. 2-3 TREES We modify things slightly so that each node can store either 1 or 2 items, and it has either 2 or 3 children. +-+ |H| +-+ / \ +-+ <------- -----> +-----+ |D| |J - N| +-+ +-----+ /\ / | \ / \ / | \ +-+ +-----+ +-+ +-----+ +-----+ |A| |E - F| |I| |K - L| |O - P| +-+ +-----+ +-+ +-----+ +-----+ (You can think of a 3-node to be two 2-nodes together) Searching gets a little more complicated, cause you have to do several comparisons at each node to find out which way you have to go. Amazing fact: it is easy to insert elements into a 2-3 tree in such a way that the distance from the root to a NULL is constant. Here's how it works: Insertion in 2-3 tree: (1) Find which NULL to put the item into. (2) Put the item into the node containing the NULL. (a) If the node is now a 3-node, you're done. (b) If the node is now a 4-node, then split it into two 2-nodes, and send the middle key to its parent. Continue with step (2) at this new node with this new item. This splitting process continues up the tree as long as there is a 4-node. If the root splits, the new root is a 2-node. In this case the depth of the tree increases by 1. An insertion can propogate all the way to the root of the tree. - insert M into tree above: Tries to go into K-L node. So split into K, M and send L up +-+ |H| +-+ / \ +-+ <------- -----> +-----+ |D| |J - N| <- insert L into this +-+ +-----+ /\ / | \ / \ / | \ +-+ +-----+ +-+ +-+ +-+ +-----+ |A| |E - F| |I| |K| |M| |O - P| +-+ +-----+ +-+ +-+ +-+ +-----+ Now, J-N node needs to be split since J - L - N is too many keys for it. +-----+ |H - L|-----\ +-----+ \ / \ \ +-+ <------- ----->+-+ \+-+ |D| |J| |N| +-+ +-+ +-+ /\ / | \ / \ / | \ +-+ +-----+ +-+ +-+ +-+ +-----+ |A| |E - F| |I| |K| |M| |O - P| +-+ +-----+ +-+ +-+ +-+ +-----+ add: B, C -> +-----+ |H - L|-----\ +-----+ \ / \ \ +-----+ <------- ----->+-+ \+-+ |B - D| |J| |N| +-----+ +-+ +-+ / /\ / | \ / / \ / | \ +-+ +-+ +-----+ +-+ +-+ +-+ +-----+ |A| |C| |E - F| |I| |K| |M| |O - P| +-+ +-+ +-----+ +-+ +-+ +-+ +-----+ Add G: causes E-F node to split, then B-D node, then H-L node, then new root -> H +-+ |H| +-+ / \ +-+ +-+ |D| |L|-----\ +-+ +-+ \ / \ \ +-+ +-+ <------- ----->+-+ \+-+ |B| |F| |J| |N| +-+ +-+ +-+ +-+ / /\ / | \ / / \ / | \ +-+ +-+ +-+ +-+ +-+ +-+ +-+ +-----+ |A| |C| |E| |G| |I| |K| |M| |O - P| +-+ +-+ +-+ +-+ +-+ +-+ +-+ +-----+ 2-3-trees are bad in practice because (1) they're complex to implement (2) they waste space 2-3 trees are very simple conceptually, and they generalize to B-trees, which are very useful for disk storage of large lists. In a B-tree, each node has between ceil(m/2) and m children.