18 Mar 1996

Outline

Hashing

We have three operations - insert, delete, member - that we want to perform. These are called dictionary operations.

Our goal is to implement these operations so that they require constant time on average.

Terminology:

U
universe of possible elements (example: strings of characters)
N
set of elements actually in the dictionary (eg, English words, or student ids). Usually N << U.
M
hash table size
h(x)
maps element x in U to one of M buckets, 0...M-1. What properties do we want this to have?

It should have the following properties:

  1. h(x) should spread keys out well between 0,1,..., M-1.
  2. h(x) should be easy to compute.
For example, if U is the set of integers from 0 to 1 billion, h(x) = x mod M might be a good hash function.

In separate chaining, our hash table is an array of lists. Say M=7, U is A, B, C, D, E, and F, and h is the following:

  x    A B C D E F
 h(x)  3 5 4 3 4 4
Try hashing U in order.

Choosing a Hash Function

At the end of the last lecture, went really quickly through an example of a good hash function. Let's take a closer look.

Here's a generally good hash function: treat the word as a large binary number, and MOD by size of hash-table. make size M be prime, or at least odd.

Eg, ``word''. In ascii, w = 119, o = 111, r = 114, d = 100. So could think of ``word'' as a base 256 number:

  119*256^3 + 111*256^2 + 114*256 + 100
There's a problem, though: if a word is more than 4 characters, value is too big to convert to integer and then mod. We can't just make it into a number, then mod. Trick: What you can do is ``walk'' down the characters, modding as you go. Eg, you can think of ``word'' as
  ( (w * 256 + o) * 256  + r)*256 + d
This transformation is called Horner's rule.

Just like in decimal: taking 173 and making it 1736 is multiplying original by 10 and then adding 6. At each step, can take mod.

To implement this:

int
hash(char *key, int size) {
  int h = 0;
  for(; *key != '\0'; key++)  h = (h * 256 + (*key)) % size;
  return h;
}
The constant 256 isn't particularly important. We could replace it by another number like 32 or whatever. The interpretation is just nicer if we use 256.

Why use prime, or odd table sizes? Not an extremely scientific reason, but ... why would table size 256 be bad? A: only using last letter. What about 1024? A: only using last 2 letters.

Basically, h(x) should depend on all of the bits in x Odd or, even better, prime, makes sure nothing funny like ignoring certain bits happens.

A real-life example of hashing: c-shell: when you type command, how does unix know where executable is? It uses hash table for faster searching.

Building a Maze

Remember on hwk 2, we gave you a list of words, plus a maze created from them. Two words were adjacent if they differed in 1 letter. Question: how can you create a maze from the words?

One way: for each word, look up all others and see if any differ in just one letter. If so, output the edge. Works, but slow What's time?

Well, it's Theta(n^2). Pretty slow with 2000 words, would be really bad with 10,000. Here's a way to do it fast with hashing.

Idea: make five hash tables. Table i corresponds to ignoring ith letter. For each word, hash once into each table. Eg, ``shave''. Hash into first table ignoring 's'. Into second ignoring 'h',...)

Now, read in next word and hash it too. When you hash a word to an index, if there are other words there already, check to see if they are neighbors. If two words differ in just letter i (``shave'' and ``share'' share differ in letter 4), then when we hash into ith table, we will land in the same bucket (both times hashing to ``shae''). If they're neighbors, output the edge.

Point: if hash function is good, then there are not a lot of ``collisions'', so most of others in bucket are real neighbors (collisions correspond to others there that are not really neighbors in maze).

For example, 2000 words, tables of size 3000. If hash function is good, average number of collisions < 1. (O(1) in any case). So, reduce from O(n^2) time to LINEAR in number of edges. Most of the time now is spent in printing the edges.