RECITATION NOTES 15-451 Algorithms 10/06/04
Topics:
* search trees vs hashing,
* augmenting search tree data structures (if didn't do it last time)
* another way of getting universal hashing [if time]
==========================================================================
==========================================================
Search trees versus hashing: hashing is very convenient if you just
need to do inserts and lookups, but what are some operations that are
better suited for search trees?
- lookup-by-prefix. E.g., type in first few letters of someone's
name and have it output all names in the database that begin with
those letters.
- find-in-range[key1, key2] (output all keys between key1 and key2)
- a version of lookup(x) that returns the closest key to x when x is
not in the database.
#############if didn't do this last time#############
Now, what if we want to do the following operations:
- output the kth smallest element (let's assume all keys
are distinct)
- do a version of lookup(x) that tells us the rank of x (how many
keys are smaller than x).
If we had a sorted array, these would be easy. To output the kth
smallest element, we just output A[k]. To get the rank of x, we just
do binary search to find it, and then output which location it's in.
So this brings us to the question: how could we do this with search
trees?
Auxiliary information in search trees:
=====================================
By adding extra information to the nodes of a binary search tree, it's
often possible to dramatically expand what can be done. However,
the data you add has to have two properties:
(1) it's sufficient to solve the problem you want to solve
(2) the values of the data in a node can be maintained efficiently.
So, here would be a *bad* way of solving the problem: have each node
store the rank of its key. Why is this bad? Because if you insert a
new key (that, say, is smaller than everything else) you may have to
update *everyone's* rank. So we don't have property (2).
What's a better way? One thing we can do is store the size of the
subtree rooted at each node in that node.
- how do we use this to solve the problems?
- how do we maintain these when we do an insertion or tree rotation?
===================================================
Here's another method of doing universal hashing that requires less
computation than the one in lecture, and is closer to what you likely
coded up in 15-211.
The hashing scheme:
Let's assume keys in {0,1,...,U-1}. Hashing down to {0,1,...,M-1}
Pick prime p >= U. (Or, think of U as prime). Define
h_{a,b}(x) = ( (a*x + b) mod p ) mod M.
H: Pick a at random from 1,...,p-1.
Pick b at random from 0,...,p-1.
Claim: H is a 2-universal family.
Proof idea: the inner part will spread things out nicely in the range
0..p-1. In particular, given x!= y, (1) x is equally likely to go
anywhere, and (2) given where x goes, y is equally likely to go to
each of the p-1 places x didn't go (there won't be any collisions at
the mod p level). Then, this allows us to show things will be nice
when we take the results mod M.
Actual Proof: Fixing x!=y and two locations r and s, let's look at:
Pr[[(ax + b) = r (mod p)] AND [(ay + b) = s (mod p)]].
-> this means that ax-ay = r-s (mod p), so a = (r-s)/(x-y) mod p, which
has exactly one solution (mod p) since Z_p* is a field.
[We needed p >= U to ensure that x!=y mod p]
-> If r=s, then this means we need a=0 which wasn't allowed. So Pr=0.
-> If r!=s, then there is a 1/(p-1) chance that a has the right
value. Given this value of a, we need b = r-ax (mod p), and there is
a 1/p chance b gets this value, so the overall probability is
1/[p(p-1)].
Now, the probability x and y collide is equal to 1/p(p-1) times the
number of pairs r!=s in {0,...,p-1} such that r = s (mod M).
We have p choices for r, and then at most ceiling(p/M)-1 choices for
s ("-1" since we disallow s=r). The product is at most p(p-1)/M
Putting this all together,
Pr[(a*x + b mod p) mod M = (a*y + b mod p) mod M]
<= p(p-1)/M * [1/(p(p-1))] = 1/M.
QED