Lecture 10/22/97 o Remember Quiz in one week. 10/30 o hashing contd - probed hashing (a.k.a. open addressing) for handling collisions * linear probing * clustering phenomena (define load factor) * how to ameliorate clustering: with double hashing - evaluating your hash table * making measurements on performance [Next lecture - growing a hash table dynamically * (growing the table smoothly) * rehashing when the load factor gets high ] PROBABILITY OF A COLLISION Last time we did separate chaining, to day we do open addressing. You might think that having different machinary to avoid collisons is not neccesary, just make the table really big, but alas even with big tables collisons are very likely. For instance, it is extremely likely that two people in this class have the same birthday. Why? Q(n) = prob that no-one has same B'day P(n) = 1 - Q(n), prob at least 2 Q(1) = 1. Q(2) = 1 * (364/365) since there are only 364 open days Q(3) = 1 * (364/365) * Q(363/365) etc. recurrence relation Q(1) = 1 Q(n) = Q(n-1)*(365-n+1)/365 Q(n) = 365! ------------------- 365^n * (365-n)! P(n) = 1 - Q(n) people prob at least 2 have same B'day ------ -------- 23 50% 35 80% 50 97% 60 99% Birthdays are hash table entries in a hash table of size 365. So with 50% prob we see a collison after inserting only 23 elements. or when table is 6.3% full. This is similar with large tables. Roughly, if your table is size M, you expect collisions to happen after you've hashed about square-root(M) elements. PROBED HASHING (A.K.A. OPEN ADDRESSING) FOR HANDLING COLLISIONS This is the original form of hashing. The data structure is just an array, no pointers. This means it is easy to write in Fortran. A comparison between this and separate chaining will be given later in this lecture. There's an array A[] of length N. Each item in the array stores a (key,value) pair, or an EMPTY mark if this slot in the array is empty. The array starts out EMPTY and at least one bigger than the number of things you'll ever store there. Here are the algorithms: search(key): Compute p=hash(key) (the hash function) while (A[p]==EMPTY AND A[p].key != key) p= (p+1) % N; (each step here is called a probe) if (A[p]==EMPTY) return NULL; if (A[p] == key) return A[p].value; insert(key, value): if (search(key) != NULL) return; compute p as search(key) does. A[p] = (key,value); delete(): not supported Think about why is delete hard. Example: hash function (not so good, but illustrative): key hash(key) h2(key) --- --------- ------- a 3 2 b 0 4 c 2 3 d 0 1 e 1 1 f 0 6 The initial array of 7 elements: 0 1 2 3 4 5 6 EMPTY EMPTY EMPTY EMPTY EMPTY EMPTY EMPTY Say we insert a,b,c,d,e into this hash table (ignore the data items for a moment). Say hash(a)=3, hash(b)=0, hash(c)=2, hash(d)=0, hash(e)=1. After inserting a,b,c we have: 0 1 2 3 4 5 6 b EMPTY c a EMPTY EMPTY EMPTY Now, when we insert d, where does it go? It tries to hash to position 0, but there's already something there, so it advances p to 1. This is empty, so it puts d there. The result is: 0 1 2 3 4 5 6 b d c a EMPTY EMPTY EMPTY Now, suppose we search for b,c, or a. Then we find what we're looking for immediately. However, if we search for d, we first try position 0. We find that b is there, so we continue in the next position, which is 1, where we find d. Now, let's insert e. hash(e)=1, so we try to put it into positions 1, 2, 3, before finally landing on 4, which is EMPTY. The resulting table is: 0 1 2 3 4 5 6 b d c a e EMPTY EMPTY What happens when we search(f)? Well, it tries positions 0, 1, 2, 3, 4, and then stops on EMPTY position 5, and returns NULL. Which is correct, cause f is not in the table. Claim 1: This algorithm is correct. That is, when using the above search and insert algorithms, search always returns the correct answer. Proof: The algorithm always maintains the following invariant: If x is in the table at position p, then all of the positions hash(x)%N, (hash(x)+1)%N.....(p-1)%N are non-empty. It's easy to verify this invariant is maintained by the insertion algorithm. Because adding a new key y to the table cannot break the invariant for any x already in the table (it only makes a non-empty cell empty). Furthermore the invariant holds for y, because the place p you end up putting y into was reached by a sequence of probes through non-empty cells. --QED. Claim 2: The search and insert algorithms always terminate when there's at least one EMPTY cell in the table. Proof: The probe sequence hash(x), (hash(x)+1) % N, (hash(x)+2) % N, .... will hit every element of the table, and the search MUST stop when it finds an empty cell. --QED What happens to this algorithm if we use all the empty cells in the table? What happens to unsuccessful search if we let it get to the point where there is only one EMPTY cell in the table? What happens if you implement deletion in the obvious way: Just search for a key, and mark its cell empty? Does this work? ANALYZING THIS ALGORITHM As you can see, the efficiency of this approach (for unsuccessful search) depends on having lots of EMPTY cells to stop the search. Let's do a rough analysis of this. How many probes will an unsuccessful search take? Let's suppose, optimistically, that the EMPTY cells are distributed perfectly uniformly in the table. Let N be the size of the table, and S be the number of elements stored in the table, and E=N-S, the number of empty cells in the table. By the uniformity assumption, the table is circular and consists periodic blocks that look like .....NE NE NE NE EMPTY NE NE NE NE EMPTY NE NE NE NE EMPTY..... Each block in the period has N/E cells in it. An unsuccessful search is going to start in one of these blocks, at a random place. Therefore, on average, it will be in the middle of such a block. Thus, the number of probes will be the (block size)/2 = (1/2)(N/E). The LOAD FACTOR of a hash table is the ratio of the number of elements in the table to the size of the table. This is S/N in the notation above. (So, in the above example, after inserting e, the load factor is 5/7 = .714.) N N 1 1 average unsuccessful search cost = --- = ------ = - --- 2E 2(N-S) 2 1-L Where L is the load factor. So as the load factor approaches 1, the cost of unsuccessful search goes to infinity. It therefore behooves us to prevent the load factor from getting near 1. PRIMARY CLUSTERING The above analysis is very optimistic, because of a phenomenon called PRIMARY CLUSTERING. The algorithm tends to be much worse. Suppose that by chance there forms a "cluster" of K non-empty cells in a row in the table. Then anything hashed to ANY of these K cells gets put into the same empty sell at the end of this cluster, which makes the cluster one bigger. And as the cluster grows, the probability of using that empty cell at the end of the cluster grows! So in this sense, the algorithm is unstable: Once a big cluster starts to form, it tends to grow faster. A detailed analysis of this phenomenon is beyond the scope of this course. A solution to this problem is provided by DOUBLE HASHING There are other probe sequences besides the linear one: h%N, (h+1)%N, (h+2)%N, (h+3)%N.... where h=hash(x). Suppose that we use another hash function hash2(x). Consider the following new probe sequence (d=hash2(x)); h%N, (h+d)%N, (h+2d)%N, (h+3d)%N... Now let's assume that d and N are relative prime so that this probe sequence is guaranteed to hit every element of the table. The idea is that different elements have different probe sequences. The algorithm still works. The proof has a different invariant. Namely If x is in the table at position p, then all positions of the probe sequence leading from hash(x)%N to p are non-empty. It is easy to see why this is true. It's clearly true when you insert x, and adding other new elements to the table does not break it. Double hashing removes the problem of primary clustering. EXAMPLE OF DOUBLE HASHING key hash(key) h2(key) --- --------- ------- a 3 2 b 0 4 c 2 1 d 0 2 e 1 1 f 0 6 The initial array of 7 elements: 0 1 2 3 4 5 6 EMPTY EMPTY EMPTY EMPTY EMPTY EMPTY EMPTY insert a, b, c and since empty same as before 0 1 2 3 4 5 6 b EMPTY c a EMPTY EMPTY EMPTY Now, when we insert d, where does it go? It tries to hash to position 0, but there's already something there, so it advances p to 1. This is empty, so it puts d there. The result is: 0 1 2 3 4 5 6 b EMPTY c a d EMPTY EMPTY searching for a,b,c still as before and d is ok to: try 0, then try 2, then 4. Notice that we didn't create a big cluster as before. Now, let's insert e. since empty, done. 0 1 2 3 4 5 6 b e c a d EMPTY EMPTY What happens when we search(f)? Well, it tries positions 0, 6 then stops on EMPTY and returns NULL. Which is correct, cause f is not in the table. This example is a bit contrived. No matter what you do you will get bad behavior. Why? The load factor is so high. When it gets this high things just cost too much. PERFORMANCE MEASUREMENTS The performance of all hashing algorithms (both probed hashing and separate chaining) will be bad if you use a bad hash function, or if the load factor gets too high. (In separate chaining, the load factor is defined in the same way: S/N. It's the average number of elements in each hash bucket.) The way to deal with this uncertainty in practice is to install code that monitors the behavior of your hashing algorithm. This can be done very easily. Simply keep a counter that measures the total work done (number of probes, or list elements traversed), and a counter that measures the number of hashing operations done (inserts, deletes or searches). If the ratio of the work to the number of operations gets high, then something is wrong with your hashing method. Your program can recognize when this is happening, and you can take action by expanding the hash table (more on this next time) or by warning you that perhaps you are not using a good enough hash function.