15-451 Algorithms 09/30/04
* Hashing
- basic setup and idea
- Universal hashing
- Perfect hashing
-----------------------------------------------------------------------------
Hashing: a great practical tool, with interesting & subtle theory too.
Basic setting: dictionary problem just like we've been discussing so
far. We have a bunch of items we want to store for efficient lookup.
Two versions:
- static: Given set S of items. Want to store so that we can do
lookups quickly. E.g., fixed dictionary.
- dynamic: sequence of inserts, lookups, and perhaps delete requests.
Want to do these all efficiently.
For the first problem we could use a sorted array with binary search
for lookups. For the second we could use a balanced search tree. But
hashing gives an alternative approach that is often the fastest and most
convenient way of solving these problems. Example: writing an
AI-search program, and you want to store situations that you've
already solved (board positions or elements of state-space) so that
you don't redo the same computation when you encounter them again.
Also a lot of other uses in cryptography, networks, complexity theory.
FORMAL SETUP
------------
- Keys come from some large universe U. (e.g, all < 50-character strings)
- Some set S in U of keys we actually care about (which may be static
or dynamic). N = |S|.
- do inserts and lookups by having an array A of size M, and a
HASH FUNCTION h: U -> {0,...,M-1}. Given element x, store in A[h(x)].
[if U was small (like 3-character strings) then you could just store
x in A[x] like in bucketsort. But the problem is that U is big.
That's why we need the hash function.]
- Will resolve collisions by having each entry in A be a linked list.
Collision is when h(x) = h(y). There are other methods but this
is cleanest -- called "separate chaining". To insert, just put at top
of list. If h is good, then hopefully lists will be small.
Nice properties: searches are easy: compute index i=h(x) and then
walk down list at A[i] until we find it. Inserts and deletes are easy too.
Properties we want:
- M is O(N)
- Elements are spread out. Ideally all buckets (lists) of constant size.
- h is fast to compute. In analysis today we'll be ignoring time to
compute h, viewing it as constant. But worth remembering in the back
of your head that h shouldn't be crazy.
Given this, time to do a lookup for item x is O(length of list at
h(x)). Same for delete. Insert just takes time O(1). So, important
thing is to be able to analyze how big these lists get.
Basic intuition: one way to spread things out nicely is to spread them
randomly. Unfortunately, we can't just use a random number generator
to decide where the next thing goes because need h to be able to do
lookups: h(x) has to be the same when we lookup as it was when we
inserted. So, we want something "pseudorandom".
First give bad news. Then good news.
BAD NEWS: WORST-CASE BEHAVIOR IS BAD
------------------------------------
- Claim: for any hash function h, if |U| >= (N-1)*M + 1, there exists
a set S of size N that all hash to the same location.
- Proof: by the pigeon-hole principle.
- So, this is partly why hashing seems so mysterious -- how can you
claim hashing is good if for any hash function you can come up with
ways of foiling it? One answer is that there are a lot of simple hash
functions that work well in practice. But what if we want to have a
guarantee for *every* set S?
A nice idea: use randomization in analogy to randomized quicksort.
We'll use some random numbers in our construction of h. (h itself
will be a deterministic function, of course). What we'll show is that
for *any* set S (don't need to assume S is random), if we then pick h
in this random way, things will be good in expectation. Same kind of
guarantee as in randomized quicksort. This is idea of UNIVERSAL hashing.
Once we develop this idea, we'll use it for a really nice practical
application called "perfect hashing".
UNIVERSAL HASHING
-----------------
A set of hash functions H is *universal* if for all x != y in U,
Pr [ h(x) = h(y) ] <= 1/M
h<-H
Theorem: if H is universal, then for any set S in U, for any x in S, the
*expected* number of collisions x has [for a random h in H] is <= (N-1)/M
Proof: Each y!=x has <= 1/M chance of colliding with x. Adding up
over all y in S you get expected number <= (N-1)/M. Or, more formally:
- Let C_xy = 1 if x and y collide and 0 otherwise.
E[C_xy] = Pr(x and y collide) <= 1/M
- C_x = total # collisions for x = sum_y C_xy.
- E[C_x] = sum_y E[C_xy] <= (N-1)/M
Corollary: For any *sequence* of L lookups, the expected *total* time
for the lookups if we use a random h in H is O(L(1 + N/M)) -- so constant
time per operation if M=N.
Proof: By linearity of expectation, the expected total time is
just the sum of expected times for each lookup. Lookup time is
proportional to length of list = 1 + # of collisions.
Question: can we actually construct a universal hash family? If not,
this this is all pretty vacuous. Answer is yes.
One clarification: we're not actually going to explicitly construct
all of H and then pick one h in H at random. Instead, we'll have a
randomized procedure for constructing h. In fact, it probably would
have been more intuitive to define H as a *probability distribution*
over hash functions, and say H is a universal probability distrib over
hash function if [rest of defn here]. But ``universal sets'' is the
way it's done in the literature. Analogy is randomized quicksort: can
think of it as picking an ordering on pivots at random from the set of
all n! orderings, but instead we find it more natural to just say:
each time you need to pick a pivot, pick one at random.
HOW TO CONSTRUCT: Matrix method
-----------------
Say keys are u-bits long. Say table size is power of 2, so an index is
b-bits long with M = 2^b.
What we'll do is pick h to be a random b-by-u 0/1 matrix, and define
h(x) = hx, where we do addition mod 2. These matrices are short and
fat. For instance:
h x h(x)
--------- - ---
[1 0 0 0] 1 1
[0 1 1 1] 0 = 1
[1 1 1 0] 1 0
0
Claim: For x!=y, Pr[h(x) = h(y)] = 1/M = 1/2^b.
Proof: First of all, what does it mean to multiply h by x? We can
think of it as adding some of the columns of h (mod 2) where the 1
bits in x tell you which ones to add. (e.g., we added the 1st and 3rd
columns of h above)
Now, take arbitrary x,y, x!=y. Must differ someplace, so say they differ
in ith coordinate and for concreteness say x_i=0 and y_i=1.
Imagine we first choose all of h but the ith column. Over remaining
choices of ith column, h(x) is fixed. But, each of the 2^b different
settings of the ith column gives a different value of h(y) [every time
we flip a bit in that column, we flip the corresponding bit in h(y).]
So there is exactly a 1/2^b chance that h(x) = h(y).
There are other methods based on multiplication modulo primes too.
=====================================================================
Question: if we fix the set S, can we find hash function h such that
*all* lookups are constant-time? Yes. This leads to...
PERFECT HASHING
---------------
Hash function is "perfect" for S if all lookups involve O(1) work.
METHOD #1
---------
Say we are willing to have a table whose size is quadratic in the size
N of our dictionary S. Then, here is an easy method. Let H be
universal and M = N^2. Pick random h in H and hash everything in S.
Claim: Pr(no collisions) >= 1/2.
[So, we just try it, and if we got any collisions, we just try a new h]
Proof:
- how many pairs (x,y) in S are there? {N \choose 2}
- for each pair, the chance they collide is <= 1/M by defn of universal.
- so, Pr(exists a collision) <= {N \choose 2}/M < 1/2.
This is like the "other side" to the "birthday paradox". If the
number of days is a lot *more* than (# people)^2, then there is a
reasonable chance *no* pair has the same birthday.
What if we want to use just O(N) space?
METHOD #2:
---------
History: This was a big open question -- posed as "should tables be
sorted" -- for a fixed set can you get constant lookup time with only
linear space? Was followed by more and more complicated attempts,
finally solved using nice idea of universal hash functions in 2-level
scheme.
Proposal: hash into table of size N. Will get some collisions.
Then, for each bin, rehash it using Method #1, squaring the size of
the bin to get zero collisions.
E.g., if 1st hash function maps {a,b,c,d,e,f} to
[--,{a}, {b,c}, {d,e,f}],
then final result would look something like:
[--,h_2, h_3, h_4]
h_2 would just be the function h_2(x) = 0. h_3(x) would be a function
to table of size 4, h_4 would be function to table of size 9. Point
is: using Method #1, we can find h_i with no collisions by picking
from universal H and trying it -- if it doesn't work we try again
(each time, Pr(success)>= 1/2).
By *design* this is O(1) time for lookups, since just compute two hash
functions. Question is: how much space does it use?
Space used is O(N) for the 1st table (assume it takes a constant
amount of space to store each hash function), plus O(sum of |B_i|^2)
for the other tables (B_i is ith bucket). So, all we need to do is prove:
Theorem: if pick initial h from universal set H, then
Pr[sum of |B_i|^2 > 4*N] < 1/2.
Proof: We'll prove this by showing that E[sum_i |B_i|^2] < 2*N
(Why does this give us the result we need? <-- "Markov's inequality")
Now, the neat trick is that one way to count this quantity is to
count the number of ordered pairs that collide, including element
colliding with itself. E.g, if bucket has {d,e,f}, then d collides
with each of d,e,f; e collides with each of d,e,f; f collides with
each of d,e,f; so get 9.
So,
sum_i |B_i|^2 = # of ordered pairs that collide (allowing x=y)
= sum_x sum_y C(x,y) [C(x,y)=1 if in same bucket, else 0]
So,
E[sum_i |B_i|^2] = sum_x sum_y Pr(x and y are in same bucket)
<= N + N(N-1) * (1/M), where the "1/M" comes from
definition of universal.
< 2*N. (since we set M = N)
So, final "hash function" is
1. compute i = h(x)
2. Now compute h_i(x) and use to lookup in table T_i.
Total space used is O(sum of (B_i)^2) = O(N).
============================================================================
Other uses of hashing
=====================
Suppose we have a long sequence of items and we want to see how many
*different* items are in the list. What's a good way of doing that?
One way: create hash table; do lookup and then insert if not in
table; count number of inserts.
Now what if the list is really huge, so we don't have space to store
them all, but we are OK with just an approximate answer. E.g.,
imagine we're a router and watching a lot of packets go by, and we
want to see (roughly) how many different source IP addresses there
are.
Neat idea: say we have a hash function h that behaves like a random
function, and let's think of h(x) as a real number between 0 and 1.
One thing we can do is just keep track of the *minimum* hash value
produced so far (so we won't have a table at all). E.g., if keys are
3,10,3,3,12,10,12 and h(3)=0.4, h(10)=0.2, h(12)=0.7, then we get 0.2.
Point is: if we pick N random numbers in [0,1], the expected value of
the minimum is 1/(N+1). And, there's a good chance it is fairly close
(can improve estimate by running several hash functions and taking
median of mins).
Question: why use a hash function rather than just picking a random
number each time? That is because we care about the number of
*different* items, not just the number of items (that problem is a lot
easier: just keep a counter...)
=======================================================