Lecture 10/21/97 o Finish up sorting o Dictionary/Table ADT o hashing - separate chaining. Handout: assignment 5 - hash function - example ------------------------------------------------------------------ Remember from last time: - bucket sort O(n+r) - radix sort O(n*L), n = # of keys, L = # of digits, radix is small so can ignore r MSC: sort into buckets from most to least sig character LSC: sort into buckets from least to most. need to use stable version of bucket sort. Algorithm is simple: foreach digit, d, starting with LEAST significant digit bucket sort by digit d into r buckets concatinate the r buckets Why does this work? Consider two strings like gb1y and gb2a What do we mean that gb1y should be first alphabetically? We mean that if you start at most significant and go down, at the first place they're different gb1y is less. See what happens in LSC radix sort: when you sort according to that place, they will be in correct order. .... gb1y ..... gb2a .... From then on, will stay in correct order---because we're using STABLE method. Another example: amit asaf asaf cato amir cato cato asat amir amit asat ---> amir ---> amir ---> amit ---> asaf amir amit amit asaf asat asaf asat cato asat cato (MSC version makes fewer comparisons since can stop when buckets are size 1, but more complicated to code. LSC has faster inner loop but makes more comparisions.) ---------------------------------------------------------------------- Sorting recap. Why: - Important problem. - Lots of times you want to do something that's like sorting but a little different. - Illustrates ideas avg case worst case features Insertion sort n^2 n^2 easy to code, good if "almost sorted" Merge sort n*log n n*log n good worst-case. Conceputally easy Divide-and-conquer Quick sort n*log n n^2 fast inner loop. Divide-and-Conquer linear on average selection Radix sort n*L n*L Least-Significant-Character version just L iterations of bucket-sort Good for sorting strings of same length Heap sort n*log n n*log n Good worst-case + no extra space heaps useful in priority queues (talk about PQs later) -------------------------------- NOW MOVE ON to broader issue: one of main uses of computing is for storing and retrieving databases of information. E.g., library -- want to pull up record for a book, or find entries that match keywords. Hospital --- want to pull up patient record given the name, or maybe want to find all patients that have a certain disease, be able to cross-ref with some other condition, etc. --> Just lots and lots of times want good database. Just going to talk about simple version here. Data abstraction for supporting some of these operations called DICTIONARY or TABLE ADT. DICTIONARY supports insert: insert(key, value) (Think of "key" as word. "value" as definition.) search: search(key) returns value (or maybe NULL if not there) usually: delete: delete(key) optional: change_value(key, newvalue) Question: suppose we have a fixed database: only need to do search (or much more common than insert or delete). E.g., say in charge of a real dictionary. What's a good implementation? --> Well, it depends on the database size. E.g., small company with 100 employees. EID# is 1-100. Whats a good implementation? -> an array But, what about the same company which uses SS# as EID -> SORTED ARRAY. Why? Sorted array: insert is O(n), delete is O(n), but search is O(log n) using BINARY SEARCH. -------------------------------------------------- Sorted array good if only care about search. What if care about Search, Insert, Delete. SEARCH INSERT DELETE SORTED ARRAY O(log n) O(n) O(n) UNSORTED ARRAY O(n) O(1) O(n) B.S.T.On Average: O(log n) O(log n) O(log n) Worst case O(n) O(n) O(n) -- Binary search trees. M / \ C R / \ / A G P Search, insert: O(height of tree) "deletes" tricky - saw in recitation a while back. But still can do in time O(height of tree). Later on: balanced search trees. -> guaranteed to keep balanced Balanced Search trees: O(log n) O(log n) O(log n) worst-case. Now, something else. Might think O(log n) is best can get. Turns out can actually do better. Probably most useful of more "advanced techniques" we will talk about, called HASHING. HASHING on avg: O(1) O(1) O(1) HASHING ------- Hashing is great alg. technique. Useful all over. Reason: pretty simple to do, AND, on average, all ops are O(1). IDEA: map the large key space into a small index set so a table much smaller then the key range will hold all values. called a HASH TABLE. Function h called HASH FUNCTION. Hash function takes keys (words or whatever) and outputs index in array. Get back to what good functions h are. Point of h: to insert (key, value): apply h to key. Gives index in hash table --> place to insert. To search, apply h(key). Give place to look. One problem: might get COLLISIONS: different keys map to same location. Need method for COLLISION RESOLUTION. - Chaining - Open Addressing - Buckets Today, talk about one called SEPARATE CHAINING. Idea: Each element of hash table A is really a list. Have array of lists (lot like adjacency lists you used in Hwk 2). **Insert(key,value): h(key) --> index i. Put (key, value) at top of list A[i]. **Search(key): apply h(key) --> index i. search list at A[i] to see if key is there. Say wanted to make a restaurant guide. 0 1 2 3 4 5 6 Start with array of lists, all NULL. +---------------------------+ | / | / | / | / | / | / | / | +---------------------------+ - insert some names. -> What's insert time? Just time to compute h. --->doesn't depend on # restaurants ==> O(1) time. Say want to search. Search (...). Apply h. Get 4. Now search through list to compare keys. Find key, return description. If don't can return "not found." To delete: do delete from list. That's the basic idea of hashing, with SEPARATE CHAINING method. Reason called "Separate chaining" is have separate lists, or chains, for each index. Let's analyze this. Inserts just took time O(1) -> just put onto top of lists. But: what do we need to be true if we want searching to be fast also? --> A: want lists to be short. Eg, say have 1000 restaurants. Make array of size 2000, say. Don't want all to hash to same place. Want h to be UNIFORM and RANDOM SO SPREADS THINGS OUT. If M places in array. If h was UNIFORM and RANDOM, every OTHER key has 1/M chance of hashing to same place as your key. If N objects, each has 1/M chance, so if RANDOM, then expect about N/M collisions. So, if N = O(M) - say have only twice as many elements as size of array, search is constant time on average. Delete too. MAIN PROPERTIES WE WANT IN HASH FUNCTION: 1. Spreads keys out well between 0,1,..., M-1. 2. easy to compute. POSSIBLE HASH FUNCTIONS FOR STRINGS. 1. Use first character. Number from 0-25. Is that good? No: bad range. A. Treat each letter as number from 1,...,26. Add them up. Mod by N. --> turns out also not so good. Anyone think of why? Hint: say you wanted to hash, say, 1000 5-letter words into array of size 2000. 2 problems. 1. not very big range -- all between 5 and 130. 2. Distribution turns out to often have big hump. Not very many zzzzz. Mostly near middle. Here's a generally good and pretty easy method. GENERALLY GOOD HASH FUNCTION: --> treat word as a large binary number, and MOD by size of hash-table. --> make size M be prime, or at least odd. (like a simple version of the universal hash function above) Eg, "word". In ascii, w = 119, o = 111, r = 114, d = 100. So could think of "word" as 119*256^3 + 111*256^2 + 114*256 + 100. Using base 256. Problem is: IF more than 4 characters, value is too big to convert to integer and then mod. Can't just make into number, then mod. Trick: What you can do is "walk" down the characters, moding as you go. Eg, can think of "word" as ( (w * 256 + o) * 256 + r)*256 + d. "Horner's rule" Just like in decimal: taking 173 and making it 1736 is multiplying original by 10 and then adding 6. At each step, can take mod. Eg: int hash(char *key, int size) { int h = 0; for(; *key != '\0'; key++) h = (h*256 + (*key)) % size; return h; } "256" not particularly important. Could replace by another number like 32 or whatever. Just nice interpretation if use 256. Why use prime, or odd table sizes? Not an extremely scientific reason, but ... why would table size 256 be bad? A: only using last letter. What about 1024? -> only using last 2 letters. Basically, odd, or even more, prime, makes sure nothing funny like this happens. ****ENDED HERE**** Questions? Applications. E.g., c-shell: when you type command, how does unix know where executable is? -- uses "hash table" Let's look at another example. Remember on hwk 2, we gave you a list of words, plus a maze created from them. 2 words were adjacent if they differed in 1 letter. Question: how can you create a maze from the words?? One way: for each word, look up all others and see if any differ in just one letter. If so, output the edge. Works, but slow What's time???? Theta(n^2) Pretty slow with 2000 words, would be really bad with 10,000. Here's a way to do it fast with hashing. Idea: make 5 hash tables. Table i corresponds to ignoring ith letter. For each word, hash once into each table. Eg, shave. Hash into 1st table ignoring 's'. Into 2nd ignoring 'h',..) Now, read in next word and hash it too. When you hash a word to an index, if there are other words there already, check to see if they are neighbors. If two words differ in just letter i (shave, share differ in letter 4) when hash into ith table, will land in same "bucket". (both times hashing "shae"). If they're neighbors, output the edge. Point: if hash function is good --> not a lot of "collisions" --> most of others in bucket are real neighbors (collisions correspond to others there that are not really neighbors in maze). Eg: 2000 words, tables of size 3000. If hash function is good, average number of collisions < 1. ( O(1) in any case). So, reduce from O(n^2) time to LINEAR in number of edges. Most of time is spent in printing routine.