15-750 Graduate Algorithms 2/8/01
* A randomized linear-time MST algorithm [KKT]
[Note: extensive MST tutorial at
http://www.cis.upenn.edu/~jeisner/papers/eisner.mst-tutorial.ps]
=================================
Preliminaries:
- Assume all weights are distinct (o.w., number the edges in an
initial pass, and use these to break ties). Note that then the MST
is unique.
- Cycle property: for any cycle in the graph, the MST doesn't have the
heaviest edge in the cycle. [Book calls this the Red rule]
Proof: if it did, remove the edge, which splits the tree into two
components. Walk around the cycle until you find two nbrs from
different components, and then put that edge in and get a better tree.
Alternate way of looking at it: Watch behavior of Kruskal on this
cycle, and the number of different compoents represented. By the time
you get to the last edge, everybody is in the same component.
- Cut property: for any nonempty set X of vertices, the lightest edge
with exactly one endpt in X is in the MST. [Book calls this the Blue
rule] Proof: Kruskal puts it in.
Boruvka's algorithm (1926): For each vertex, select the shortest edge
incident. We know all these will be in the tree by the cut property.
These selected edges for a collection of components (at most n/2 of
them). Now, contract all components, getting rid of multiple edges
and self-loops, and repeat. [when there are multiple edges, just keep
the shortest ofthem]
Running time of Boruvka: selectingthe edges takes time SUM_v d(v) = O(m).
Contracting can also be done in time O(m). Probably easiest is 2-pass
DFS: in pass 1, number the vertices based on which component. In step
2, construct the new graph. Number of steps is O(log n). So, total
time is O(m log n).
Side note: Remember how Prim with Fib heaps was O(m + n log n). So
if you do loglog(n) steps of Boruvka, then the number of nodes is now
n/log n. So, you can run Prim on what's left in time O(m). Total
time is then O(m loglog(n))). That's basically linear...right?
==================================
Linear time algorithm: High level idea is that if we could somehow
keep the density of the graph constant, so that after each step of
Boruvka the number of edges went down by a factor of 2 as well, then
our costs would telescope. So, we need to develop a complementary
algorithm that somehow removes edges that cannot possibly be in the
MST.
For various reasons, we'll want to allow for disconnected graphs G and
talk about minimum spanning Forests. All the discussion above works
then too.
Def: Given a forest F for a graph G, we'll say that an edge (u,v) in G
is *heavy* for F if (a) u and v are connected in F, but not with this
edge, and (b) weight(u,v) > max edge weight in u-v path in F.
-> So, if F is a minimum spanning forest, then *all* edges not in F
are heavy for F.
-> Also, for any forest F, heavy edges cannot be in MSF. Why?
(follows from cycle property).
Magic fact: Given a graph G and a forest F, can find all the heavy
edges for F in time O(m).
[Note 1: it's NOT easy -- maybe sketch at end]
[Note 2: This is sometimes called the "verification" problem, since if
someone handed us a supposed MSF F and we wanted to check it, we'd
need to verify that every edge not in F is heavy for F.]
Here's the plan for the algorithm: We'll do 2 Boruvka steps. Then, in
linear time we find a "pretty good" forest F (will use randomized
strategy for this). Then in linear time we get rid of the heavy
edges. If all went well, we're at a good density again.
Note: finding the "pretty good" forest will involve computing MSF of
smaller graph than G, recursively. So, running time will havea feel
like in linear-time median finding.
More specifically, we're going to randomly sample edges from G, taking
each one with prob p. Call new graph H. Then will find MSF F of H.
The key to this working will be the following lemma:
Lemma: The expected number of non-heavy edges in the above experiment
is at most n/p. [So, new graph will be sparse again]
Proof: This is pretty neat. For purposes of analysis, imagine looking
at the edges in order from shortest to longest (we don't have time to
sort, but this is just for analysis) and using Kruskal to build F at
the same time as we pick H. Specifically, we start with F and H both
empty. Then for each edge, we flip a coin of bias p to decide whether
to put into H. If we do put it into H, we look to see if both
endpoints are in the same component of F, and if not, we put it into F
also. Now, here's the trick: suppose that conceptually we *first*
check to see if both endpoints are in the same component of F, and if
not, mark the edge, and *then* flip the coin of bias p. We still do
the same thing: if the coin is heads *and* the edge is marked we put
the edge into F and H, if heads an unmarked then H only, and if tails
then neither. Note that all unmarked edges are heavy by definition,
so our job is to bound the number of marked edges. Now, here's the
point: each time we have a marked edge, we have prob p of putting it
into F, and F can have at most n-1 edges in the end. So, if we just
look at coin flips on marked edges, what we have is basically a
situation where we're tossing a coin of bias p until n-1 heads occur
and we want to know what is the expected number of flips this took.
(Actually, what we have is some complicated stopping criteria but that
for certain has stopped by the time we get n-1 heads). As you might
expect, the expected number of flips is (n-1)/p.
[Proof of that last statement: what if we just flip until we see one
head? The expected number of flips this takes is 1/p. Can see this
by writing equation E(flips) = p*1 + (1-p)*[1 + E(flips)]. Then use
linearity of expectation.]
So, that proves the lemma. Now we just need to put all of this
together (and after that maybe go into "magic fact")
Algorithm:
1. Run two Boruvka steps.
2. Run sparsification step with p=1/2.
3. Repeat.
(Note: Step 2 involves a recursive run of algorithm on H).
Running time analysis:
Let's first do a back-of-the-envelope calculation as if alg was
deterministic, and we hit all expectations exactly. Then we get a
reccurence T(m,n) <= c*m + T(m/2,n/4) + T(n/2,n/4). This looks messy,
but suppose we assume that m >= 2n at the start (add dummy edges if we
need to). Then this holds in the recurrence, and the point is that n/2
<= m/4. So, we can ignore n and just write: T(m) <= c*m + T(m/2) +
T(m/4), and we have linear time.
Now, let's do it right. Let's write out the recursion tree. We can
write it as a binary tree where left branch corresponds to step 2 and
right branch corresponds to step 3. In any particular node, can write
down the number of edges (which will be a random variable) and our
total running time is the sum of these. So, by linearity of
expectation, can get our expected running time by summing up the
expectations. If Y is the left-child of X, then our theorem tells us
that E[Y | X=k] <= k/2. This means that E[Y] <= E[X]/2. The next
point is that if a node is at level i, then it has at most n/4^i
vertices. So, if Z is a right-child of somebody and it is at level i,
then the expected number of edges is at most 2*(n/4^i). So, now we
just need to add things up. In fact, can now go back to our handwavy
argument by just reinterpreting recurrence as a way of adding all
these expectations.
=============================
Basic idea of MSF verification (that "magic step"):
For each edge e = (u,v) with u and v connected in F, we want to find
the length of the longest edge in F in their path (so we can compare
that to e). Let's call this Max_F(u,v). The problem is we need to
handle all m of these queries in time O(m).
Here's part of the idea. Say we run Boruvka's alg on just the forest
F. This is weird since we know we'll just end up with F (since it is
its own MSF), but the point is to look at the order in which edges are
added in. The first thing is we *can* run Boruvka in time O(n) here,
since each contracting phase cuts the number of edges in half.
The second interesting thing is that suppose for each node in F, we
keep track of the longest edge ever selected by its component in the
run of Boruvka so far. Then when u and v get connected, we can
quickly compute Max_F(u,v). But, it is not at all clear how to update
these things efficiently, and that's where all the hard work comes in.