15-451 Algorithms 10/24/00
* Union-find
- list-based method
- forest-based method
midterm results: median 61/75, middle half 52/75 to 68/75.
=============================================================================
Today's topic is a data structure for the union-find problem. Here's
the problem:
We want to maintain a collection of disjoint sets with operations:
MakeSet(x) - creates set containing just x. Returns ptr p_x
Union(p_x,p_y) - union together the set x is in with the set
y is in (replacing those two sets with their union).
Find(p_x) - return the unique ID for the set containing x (this
is just some representative element of this set).
[From now on, will just say "x" instead of "p_x", etc]
E.g., Kruskal's algorithm:
- Do MakeSet(v) for each node v.
- for each edge (x,y) in order from shortest to longest:
if find(x) != find(y), put (x,y) into tree and do Union(x,y).
Notation: n = # of make-set operations, (so at most n-1 unions)
m = # of total operations.
This matches n= # vertices, m = # of edges pretty well, since we do
2 finds for each edge.
Algorithm 1 (list-based): total cost will be O(m + n log n)
-----------------------------------------------------------
Each set will be just a linked list: each element has pointer to
next element in list. But, augment so that each element also has
a pointer directly to head of list. Head of list is the
representative element.
Makeset(x) - constant time: just create list of 1 element and return ptr.
Find(x) - also constant time: just follow pointer to head of list.
Union(x,y) - to union the two lists (let's call them A and B with
lengths L_A and L_B). We append B onto end of A. What is
is the time needed? Answer: we first have to walk down A to get to
bottom element. Then hook up head of B to this. Then walk down B
updating the pointers to the head. So total time is O(L_A + L_B)
Q: Can we get just O(L_B)?
Method 1: splice B into the middle of A, at x.
Method 2: if store pointer to tail in the head, this also avoids
walking down A.
Q: Can we reduce this to O(min(L_A,L_B))?
A: Yes. Just store length of list in the head. Then compare and
append shorter one onto end of longer one. Then update the
length count.
Claim: this gives us the O(m + n log n) total running time.
Proof: The only part we need to worry about is the "walking down the
shorter list" part of the union operation. All the rest is constant
time per op, so goes into the O(m).
So, let's look at this from the point of view of some lowly element x
sitting in a list somewhere. Over time, how many times does x get stepped
on as we walk down the list and update pointers? Answer: at most
O(log n) times. Reason: because every time x gets stepped on, the
size of the list he is in doubles (at least).
So, if we look at the total time for m operations, we spend O(m) in
the constant-time portion, and then we can charge off the walking-down
portion to the elements walked over. There are n elements, and for
each one the total cost is at most log(n). So, overall, the total is
O(m + n log n)
...This is a neat proof I think. A good time for questions...
Notice that in Kruskal's algorithm, it's the sorting time that
dominates since that was O(m log m).
Algorithm 2: total cost will be O(m lg^*(n))
--------------------------------------------
-> lg^*(n) is the number of times you need to take the lg until you
get down to 1. So,
lg^*(2) = 1
lg^*(2^2 = 4) = 2
lg^*(2^2^2 = 16) = 3
lg^*(2^2^2^2 = 65536) = 4
lg^*(2^65536) = 5
-> So, basically, lg^*(n) is never bigger than 5. Actually running
time of this algorithm is even better: O(m alpha(m,n)) where alpha is
the inverse-Ackermann function I mentioned last time. This grows even
slower than lg^*. We'll prove the lg^*(n) bound because it's a little
easier. But, it's still pretty tricky.
Algorithm:
- we'll represent each set by a tree with pointers going *up* the
tree. Root is the "representative element".
- MakeSet(x) takes constant time: just set x->parent = x.
- To do a find(x), we walk up tree to the root. So, time taken is
proportional to depth of x.
To speed things up, we'd like trees to be bushy. To make trees
bushier, we'll do what's called PATH COMPRESSION. When we do a
find(x), we update all the nodes we pass over to point to the root.
We can do this nicely by making find(x) be a recursive procedure:
If x is root, return x.
Else, x->parent = find(x->parent). Return x->parent.
- To do union(x,y) we do a find on each one, and then make one of the
roots point to the other. Second speedup is called
UNION-BY-RANK, where tree whose root is of lower rank is made
the "child tree".
"rank" defined by:
- When you do a makeset(x), x gets rank = 0.
- When you union a tree whose root has rank r1 with a
tree whose rank is r2, then if r1 == r2, you add 1 to the rank of the
new root. Otherwise you don't.
(This speedup is kindof like our list-based algorithm)
Note: properties of ranks
1. if x isn't root, then rank(x) is strictly less than rank(x->parent).
2. Rank(x) can only change if x is a root. Once x becomes a non-root,
then its rank never changes again. [Note: once you become a
non-root, you're never a root again]
3. At most n/2^r nodes of rank r. Reason is that when you create a
(root) node of rank r, the tree it's in must have at least 2^r nodes
in it (can see by induction). The nodes in that tree (except for the
root) must all have rank < r. So, if we imagine for each node of rank
r, circling all the nodes "responsible" for it getting its rank (i.e.,
the ones in its tree at the time), then these are all disjoint. Since
only n nodes total, this means there are at most n/2^r nodes of rank r.
Analysis:
* First of all, the union does 2 find operations plus a constant
amount of extra work. So, we only need to worry about the time for
the (at most 2m) find operations.
* When we do a find(x), if x was a root then pay $1, so that's ok. If
x was a child of a root we pay $2 so that's ok. If x is lower, then
the very rough idea is that (except for the last $2) every $ we spend
is shrinking the tree because of our path compression, so we'll be
able to amortize it somehow. So, for remaining part of proof, we're
only going to worry about the steps taken in a Find(x) operation up
until we reach the child of the root, since the remainder is just
constant time per operation.
We'll analyze using the ranks, and the properties we figured out
before.
Step 1: let's imagine putting ranks into buckets. Bucket 0 goes from
has all of rank 0, bucket 1 has all of rank 1, bucket 2 has ranks 2 to
2^2 - 1, bucket 3 has ranks 2^2 to 2^2^2 - 1, bucket 4 has ranks 2^2^2
to 2^2^2^2 - 1, etc. So about log^*(n) buckets.
Point of this definition is that the number of nodes with rank in any
given bucket is at most n/(upper bound of bucket + 1).
Step 2: When we walk up the tree in the Find(x) operation, we need to
charge out steps to somebody. Here's the rule: if the step we take
moves us from a node u to a node v such that rank(v) is in the same
bucket as rank(u), then we charge it to the node u we stepped on (the
step-ee). But, if the rank of v is in the next higher bucket, we
charge that to x (the step-er).
The easy part of this is the charge to x. We can move up in buckets
at most log^*(n) times, since there are at most log^*(n) different
buckets ever. So total cost for this (adding up over the m Find()
operations) is at most m log^*(n)
The harder part is the charge to the guy stepped on. The point here
is that first of all, the guy u we charged isn't a root, so his rank
isn't ever going to change. Secondly, every time we charge him, the
rank of his new parent is at least 1 larger than the rank of his old
parent. Finally, once his parent has a rank that's in a different
block than he is, we never charge him again for being stepped on.
... SO...
the maximum charge that u gets is the range of his bucket. But, we
just said that a bucket of upper bound B-1 has at most n/B elements in
it. And, the range is <= B. So, the total charge of this kind adding
up over all nodes u in that bucket is at most n. (Remember that the
only elements being charged are non-roots, so once they start getting
charged, their rank is fixed so they can't jump to some other bucket
and start getting charged there too or something weird like that.) So
the *total* charge of this kind is at most n*(# buckets) = n log^*(n).
QED (ta-da)