15-451 Algorithms 10/14/04
* Minimum Spanning Trees (Prim, Kruskal) MIDTERM on TUES.
* Union-find
=========================================================================
A SPANNING TREE is a tree that touches all the nodes of a graph. (So,
it only makes sense in a connected graph.) A Minimum Spanning Tree
is the shortest spanning tree (size of tree is sum of its edge lengths)
For instance, can imagine you're setting up a communication network.
For example:
5 8
A-----B-----C
| | |
1| 1| |6
| 2 | 4 |
D-----E-----F
What is the MST here?
(Note: our defn is only for undirected graphs).
Prim's algorithm
================
Prim's algorithm is the following simple MST algorithm:
- pick some arbitrary start node.
- add shortest edge to current tree.
- continue until tree spans all nodes.
--> what does this do on above graph?
[So, a lot like Dijsktra, except just put in the shortest edge rather
than adding the length to thee distance of endpoint to the start
node. But even though the alg is simpler than Dijkstra, the analysis
is a bit trickier...]
Now we need to prove that this works. Will prove by induction.
First: useful fact about spanning trees: If you take any spanning tree
and add an edge to it, you get a cycle. That's because there was
already one path between the endpoints, and now there are two. If you
then remove any edge in the cycle, you get back a spanning tree.
Proof of correctness of Prim's alg:
Proof by contradiction. Say it fails. Consider the first edge "e"
chosen by the algorithm that is not consistent with an MST. Let
T be the tree we have built just before adding in e, and let M be
the MST consistent with T.
Now, suppose we add e to the M. This creates a cycle. Since e has
one endpoint in T and one outside T, if we trace around this cycle
we must eventually get to an edge e' that goes back in to T. We
know length(e') >= length(e) by definition of the algorithm. So, if
we add e to M and remove e', we get a new tree that is no larger
than M was, contradicting our defn of "e".
Running time: To make this efficient, we need a good way of storing
the edges into our current tree so we can quickly pull off the
shortest one. If we use a heap, we can do this in log(n) time. So,
the total time is O(m log n) where m is the number of edges and n is
the number of vertices.
Kruskal's algorithm
===================
Here's another nice algorithm, called Kruskal's algorithm.
Algorithm: Sort edges by length and examine them from shortest to
longest. Put edge into current forest (forest is set of trees) if it
doesn't form a cycle.
E.g., look at in this graph:
5 8
A-----B-----C
| | |
1| 1| |6
| 2 | 4 |
D-----E-----F
Proof of correctness: Can basically use same argument as before.
Let e be the first edge we add that is inconsistent with any MST, and
let F be the forest just before adding e. Let M be an MST consistent
with F. Put e into M. This makes a cycle. Go around until you take
some edge e' jumping between trees in F. By definition of our algorithm,
len(e) <= len(e'). So, putting in e and removing e' can only make M
better, which is a contradiction to our defn of e.
What about running time? Takes O(E log E) to sort. Then, for each
edge we need to test if it connects two different components. This
seems like it should be a real pain: how can we tell if an edge has
both endpoints in the same component. Turns out there's a nice data
structure called "union-find" data structure for doing this. So fast
that it actually will be low-order compared to the sorting.
UNION-FIND
==========
General picture is we maintain a collection {S_1, S_2, ..., S_k} of
disjoint sets over some universe. Operations are:
MakeSet(x) - creates set containing just x
Union(x,y) - replace the set S_x containing x and the set S_y
containing y with the single set S_x U S_y.
Find(x) - return the unique ID for the set containing x (this
is just some representative element of this set).
Given this, we begin with MakeSet(v) for all vertices v. When we
consider some edge (v,w) we just see if Find(v)==Find(w). To put in
the edge(v,w) we do a union.
Notation: n = # of make-set operations, (so at most n-1 unions)
m = # of total operations.
This matches n= # vertices, m = # of edges pretty well, since we do
2 finds for each edge.
Algorithm 1 (list-based): total cost will be O(m + n log n)
-----------------------------------------------------------
Each set will be just a linked list: each element has pointer to
next element in list. But, augment so that each element also has
a pointer directly to head of list. Head of list is the
representative element.
Makeset(x) - constant time: x->head = x
Find(x) - also constant time: return(x->head)
Union(x,y) - to union the two lists (let's call them A and B with
lengths L_A and L_B). We append B onto end of A. What is
is the time needed? Answer: we first have to walk down A to get to
bottom element. Then hook up head of B to this. Then walk down B
updating the pointers to the head. So total time is O(L_A + L_B)
Q: Can we get just O(L_B)?
Method 1: splice B into the middle of A, at x.
Method 2: if store pointer to tail in the head, this also avoids
walking down A.
Q: Can we reduce this to O(min(L_A,L_B))?
A: Yes. Just store length of list in the head. Then compare and
append shorter one onto end of longer one. Then update the
length count.
Claim: this gives us the O(m + n log n) total running time.
NICE PROOF!: Finds and Makesets are constant time so that's covered by the
O(m). Each union costs O(length of list we walk down). Key idea:
let's pay for this by charging O(1) to each element that we walked
over. Think of as a charge for updating its head pointer.
Now, let's look at this from the point of view of some lowly element x
sitting in a list somewhere. Over time, how many times does x get
stepped on and have its head pointer updated? Answer: at most O(log
n) times. Reason: because every time x gets stepped on, the size of
the list he is in doubles (at least).
So, we were able to pay for unions by charging elements stepped on,
and no element gets charged more than O(log n), so total cost is O(n
log n) for unions, or O(m + n log n) overall.
...This is a neat proof I think. A good time for questions...
Notice that in Kruskal's algorithm, it's the sorting time that
dominates since that was O(m log m).
Algorithm 2: total cost will be O(m lg^*(n))
============================================
-> lg^*(n) is the number of times you need to take the lg until you
get down to 1. So,
lg^*(2) = 1
lg^*(2^2 = 4) = 2
lg^*(2^2^2 = 16) = 3
lg^*(2^2^2^2 = 65536) = 4
lg^*(2^65536) = 5
-> So, basically, lg^*(n) is never bigger than 5. Actually running
time of this algorithm is even better: O(m alpha(m,n)) where alpha is
the inverse-Ackermann function. This grows even slower than lg^*.
We'll prove the lg^*(n) bound because it's a little easier. But, it's
still pretty tricky.
Idea of the algorithm is this: Say we did the above in a lazy way.
Just point head of B to head of A. Now, a find might become more
expensive because we have to take multiple hops. So, as we do the
find, let's point every element we traverse directly to the root
(basically doing Algorithm 1 lazily). We'll also need to make one
technical change: instead of deciding which of two roots should be the
new one based on the SIZE of the set, we'll do it based on which has
larger RANK, which we'll define in a minute. Also, notice that we no
longer need the downward pointers: what we have in general is a
collection of trees, with all links pointing UP.
[NOT SURE HOW MUCH WE WILL GET TO. THAT'S OK, SINCE THE POINT OF THIS
IS JUST TO SEE A REALLY INTRICATE AMORTIZED ANALYSIS (like climbing
Everest). WE'RE NOT GOING TO BE USING THIS LATER...]
- MakeSet(x) takes constant time: just set x->parent = x.
- To do a find(x), we walk up tree to the root, updating x and all
the nodes we pass over to point to the root. This is called PATH
COMPRESSION.
Time is proportional to (original) distance of x to root.
- union(x,y) = link(find(x), find(y)). Link(root1,root2) makes the root
of smaller RANK point to the other (this is called UNION-BY-RANK).
RANK defined by:
- When you do a makeset(x), x gets rank = 0.
- When you link a root of rank r1 to a root of rank r2, then
if r1 == r2, you add 1 to the rank of the new root. Otherwise
you don't.
[E.g., rank = height if we never did path compression]
Note: properties of ranks
1. if x isn't root, then rank(x) is strictly less than rank(x->parent).
2. Rank(x) can only change if x is a root. Once x becomes a non-root,
then its rank never changes again. [Note: once you become a
non-root, you're never a root again]
3. At most n/2^r nodes of rank >= r. Reason is that when you create a
(root) node of rank r, the tree it's in must have at least 2^r nodes
in it (can see by induction). The nodes in that tree (except for the
root) must all have rank < r. And, their rank is never going to
change. So, for each node of rank >= r, we have at least 2^r nodes of
smaller rank, which means there can be at most n/2^r nodes of rank >= r.
Analysis:
* First of all, the union does 2 find operations plus a constant
amount of extra work. So, we only need to worry about the time for
the (at most 2m) find operations.
* When we do a find(x), if x was a root then pay $1, so that's ok. If
x was a child of a root we pay $2 so that's ok. If x is lower, then
the very rough idea is that (except for the last $2) every $ we spend
is shrinking the tree because of our path compression, so we'll be
able to amortize it somehow. So, for remaining part of proof, we're
only going to worry about the steps taken in a Find(x) operation up
until we reach the child of the root, since the remainder is just
constant time per operation.
We'll analyze using the ranks, and the properties we figured out
before.
Step 1: let's imagine putting ranks into buckets. Bucket 0 goes from
has all of rank 0, bucket 1 has all of rank 1, bucket 2 has ranks 2 to
2^2 - 1, bucket 3 has ranks 2^2 to 2^2^2 - 1, bucket 4 has ranks 2^2^2
to 2^2^2^2 - 1, etc. In general, a bucket has ranks r to 2^r - 1. In
total, about log^*(n) buckets.
Point of this definition is that the number of nodes with rank in any
given bucket is at most n/(upper bound of bucket + 1).
Step 2: When we walk up the tree in the Find(x) operation, we need to
charge out steps to somebody. Here's the rule: if the step we take
moves us from a node u to a node v such that rank(v) is in the same
bucket as rank(u), then we charge it to the node u we stepped on (the
step-ee). But, if the rank of v is in the next higher bucket, we
charge that to x (the step-er).
The easy part of this is the charge to x. We can move up in buckets
at most log^*(n) times, since there are at most log^*(n) different
buckets ever. So total cost for this (adding up over the m Find()
operations) is at most m log^*(n)
The harder part is the charge to the guy stepped on. The point here
is that first of all, the guy u we charged isn't a root, so his rank
isn't ever going to change. Secondly, every time we charge him, the
rank of his new parent is at least 1 larger than the rank of his old
parent. That's by property #1. Now, increasing the rank of its
parent could conceivably happen log(n) times, which to us is a "big"
number. *But*, once the parent's rank becomes large enough that it is
in the next bucket, we never charge him again for being stepped on.
So, the maximum charge that u gets is the range of his bucket.
To finish the proof, we just said that a bucket of upper bound B-1 has
at most n/B elements in it. And, the range is <= B. So, the total
charge of this kind adding up over all nodes u in that bucket is at
most n. (Remember that the only elements being charged are non-roots,
so once they start getting charged, their rank is fixed so they can't
jump to some other bucket and start getting charged there too or
something weird like that.) So the *total* charge of this kind is at
most n*(# buckets) = n log^*(n). QED (Whew!)