15-451 Algorithms 10/14/04 * Minimum Spanning Trees (Prim, Kruskal) MIDTERM on TUES. * Union-find ========================================================================= A SPANNING TREE is a tree that touches all the nodes of a graph. (So, it only makes sense in a connected graph.) A Minimum Spanning Tree is the shortest spanning tree (size of tree is sum of its edge lengths) For instance, can imagine you're setting up a communication network. For example: 5 8 A-----B-----C | | | 1| 1| |6 | 2 | 4 | D-----E-----F What is the MST here? (Note: our defn is only for undirected graphs). Prim's algorithm ================ Prim's algorithm is the following simple MST algorithm: - pick some arbitrary start node. - add shortest edge to current tree. - continue until tree spans all nodes. --> what does this do on above graph? [So, a lot like Dijsktra, except just put in the shortest edge rather than adding the length to thee distance of endpoint to the start node. But even though the alg is simpler than Dijkstra, the analysis is a bit trickier...] Now we need to prove that this works. Will prove by induction. First: useful fact about spanning trees: If you take any spanning tree and add an edge to it, you get a cycle. That's because there was already one path between the endpoints, and now there are two. If you then remove any edge in the cycle, you get back a spanning tree. Proof of correctness of Prim's alg: Proof by contradiction. Say it fails. Consider the first edge "e" chosen by the algorithm that is not consistent with an MST. Let T be the tree we have built just before adding in e, and let M be the MST consistent with T. Now, suppose we add e to the M. This creates a cycle. Since e has one endpoint in T and one outside T, if we trace around this cycle we must eventually get to an edge e' that goes back in to T. We know length(e') >= length(e) by definition of the algorithm. So, if we add e to M and remove e', we get a new tree that is no larger than M was, contradicting our defn of "e". Running time: To make this efficient, we need a good way of storing the edges into our current tree so we can quickly pull off the shortest one. If we use a heap, we can do this in log(n) time. So, the total time is O(m log n) where m is the number of edges and n is the number of vertices. Kruskal's algorithm =================== Here's another nice algorithm, called Kruskal's algorithm. Algorithm: Sort edges by length and examine them from shortest to longest. Put edge into current forest (forest is set of trees) if it doesn't form a cycle. E.g., look at in this graph: 5 8 A-----B-----C | | | 1| 1| |6 | 2 | 4 | D-----E-----F Proof of correctness: Can basically use same argument as before. Let e be the first edge we add that is inconsistent with any MST, and let F be the forest just before adding e. Let M be an MST consistent with F. Put e into M. This makes a cycle. Go around until you take some edge e' jumping between trees in F. By definition of our algorithm, len(e) <= len(e'). So, putting in e and removing e' can only make M better, which is a contradiction to our defn of e. What about running time? Takes O(E log E) to sort. Then, for each edge we need to test if it connects two different components. This seems like it should be a real pain: how can we tell if an edge has both endpoints in the same component. Turns out there's a nice data structure called "union-find" data structure for doing this. So fast that it actually will be low-order compared to the sorting. UNION-FIND ========== General picture is we maintain a collection {S_1, S_2, ..., S_k} of disjoint sets over some universe. Operations are: MakeSet(x) - creates set containing just x Union(x,y) - replace the set S_x containing x and the set S_y containing y with the single set S_x U S_y. Find(x) - return the unique ID for the set containing x (this is just some representative element of this set). Given this, we begin with MakeSet(v) for all vertices v. When we consider some edge (v,w) we just see if Find(v)==Find(w). To put in the edge(v,w) we do a union. Notation: n = # of make-set operations, (so at most n-1 unions) m = # of total operations. This matches n= # vertices, m = # of edges pretty well, since we do 2 finds for each edge. Algorithm 1 (list-based): total cost will be O(m + n log n) ----------------------------------------------------------- Each set will be just a linked list: each element has pointer to next element in list. But, augment so that each element also has a pointer directly to head of list. Head of list is the representative element. Makeset(x) - constant time: x->head = x Find(x) - also constant time: return(x->head) Union(x,y) - to union the two lists (let's call them A and B with lengths L_A and L_B). We append B onto end of A. What is is the time needed? Answer: we first have to walk down A to get to bottom element. Then hook up head of B to this. Then walk down B updating the pointers to the head. So total time is O(L_A + L_B) Q: Can we get just O(L_B)? Method 1: splice B into the middle of A, at x. Method 2: if store pointer to tail in the head, this also avoids walking down A. Q: Can we reduce this to O(min(L_A,L_B))? A: Yes. Just store length of list in the head. Then compare and append shorter one onto end of longer one. Then update the length count. Claim: this gives us the O(m + n log n) total running time. NICE PROOF!: Finds and Makesets are constant time so that's covered by the O(m). Each union costs O(length of list we walk down). Key idea: let's pay for this by charging O(1) to each element that we walked over. Think of as a charge for updating its head pointer. Now, let's look at this from the point of view of some lowly element x sitting in a list somewhere. Over time, how many times does x get stepped on and have its head pointer updated? Answer: at most O(log n) times. Reason: because every time x gets stepped on, the size of the list he is in doubles (at least). So, we were able to pay for unions by charging elements stepped on, and no element gets charged more than O(log n), so total cost is O(n log n) for unions, or O(m + n log n) overall. ...This is a neat proof I think. A good time for questions... Notice that in Kruskal's algorithm, it's the sorting time that dominates since that was O(m log m). Algorithm 2: total cost will be O(m lg^*(n)) ============================================ -> lg^*(n) is the number of times you need to take the lg until you get down to 1. So, lg^*(2) = 1 lg^*(2^2 = 4) = 2 lg^*(2^2^2 = 16) = 3 lg^*(2^2^2^2 = 65536) = 4 lg^*(2^65536) = 5 -> So, basically, lg^*(n) is never bigger than 5. Actually running time of this algorithm is even better: O(m alpha(m,n)) where alpha is the inverse-Ackermann function. This grows even slower than lg^*. We'll prove the lg^*(n) bound because it's a little easier. But, it's still pretty tricky. Idea of the algorithm is this: Say we did the above in a lazy way. Just point head of B to head of A. Now, a find might become more expensive because we have to take multiple hops. So, as we do the find, let's point every element we traverse directly to the root (basically doing Algorithm 1 lazily). We'll also need to make one technical change: instead of deciding which of two roots should be the new one based on the SIZE of the set, we'll do it based on which has larger RANK, which we'll define in a minute. Also, notice that we no longer need the downward pointers: what we have in general is a collection of trees, with all links pointing UP. [NOT SURE HOW MUCH WE WILL GET TO. THAT'S OK, SINCE THE POINT OF THIS IS JUST TO SEE A REALLY INTRICATE AMORTIZED ANALYSIS (like climbing Everest). WE'RE NOT GOING TO BE USING THIS LATER...] - MakeSet(x) takes constant time: just set x->parent = x. - To do a find(x), we walk up tree to the root, updating x and all the nodes we pass over to point to the root. This is called PATH COMPRESSION. Time is proportional to (original) distance of x to root. - union(x,y) = link(find(x), find(y)). Link(root1,root2) makes the root of smaller RANK point to the other (this is called UNION-BY-RANK). RANK defined by: - When you do a makeset(x), x gets rank = 0. - When you link a root of rank r1 to a root of rank r2, then if r1 == r2, you add 1 to the rank of the new root. Otherwise you don't. [E.g., rank = height if we never did path compression] Note: properties of ranks 1. if x isn't root, then rank(x) is strictly less than rank(x->parent). 2. Rank(x) can only change if x is a root. Once x becomes a non-root, then its rank never changes again. [Note: once you become a non-root, you're never a root again] 3. At most n/2^r nodes of rank >= r. Reason is that when you create a (root) node of rank r, the tree it's in must have at least 2^r nodes in it (can see by induction). The nodes in that tree (except for the root) must all have rank < r. And, their rank is never going to change. So, for each node of rank >= r, we have at least 2^r nodes of smaller rank, which means there can be at most n/2^r nodes of rank >= r. Analysis: * First of all, the union does 2 find operations plus a constant amount of extra work. So, we only need to worry about the time for the (at most 2m) find operations. * When we do a find(x), if x was a root then pay \$1, so that's ok. If x was a child of a root we pay \$2 so that's ok. If x is lower, then the very rough idea is that (except for the last \$2) every \$ we spend is shrinking the tree because of our path compression, so we'll be able to amortize it somehow. So, for remaining part of proof, we're only going to worry about the steps taken in a Find(x) operation up until we reach the child of the root, since the remainder is just constant time per operation. We'll analyze using the ranks, and the properties we figured out before. Step 1: let's imagine putting ranks into buckets. Bucket 0 goes from has all of rank 0, bucket 1 has all of rank 1, bucket 2 has ranks 2 to 2^2 - 1, bucket 3 has ranks 2^2 to 2^2^2 - 1, bucket 4 has ranks 2^2^2 to 2^2^2^2 - 1, etc. In general, a bucket has ranks r to 2^r - 1. In total, about log^*(n) buckets. Point of this definition is that the number of nodes with rank in any given bucket is at most n/(upper bound of bucket + 1). Step 2: When we walk up the tree in the Find(x) operation, we need to charge out steps to somebody. Here's the rule: if the step we take moves us from a node u to a node v such that rank(v) is in the same bucket as rank(u), then we charge it to the node u we stepped on (the step-ee). But, if the rank of v is in the next higher bucket, we charge that to x (the step-er). The easy part of this is the charge to x. We can move up in buckets at most log^*(n) times, since there are at most log^*(n) different buckets ever. So total cost for this (adding up over the m Find() operations) is at most m log^*(n) The harder part is the charge to the guy stepped on. The point here is that first of all, the guy u we charged isn't a root, so his rank isn't ever going to change. Secondly, every time we charge him, the rank of his new parent is at least 1 larger than the rank of his old parent. That's by property #1. Now, increasing the rank of its parent could conceivably happen log(n) times, which to us is a "big" number. *But*, once the parent's rank becomes large enough that it is in the next bucket, we never charge him again for being stepped on. So, the maximum charge that u gets is the range of his bucket. To finish the proof, we just said that a bucket of upper bound B-1 has at most n/B elements in it. And, the range is <= B. So, the total charge of this kind adding up over all nodes u in that bucket is at most n. (Remember that the only elements being charged are non-roots, so once they start getting charged, their rank is fixed so they can't jump to some other bucket and start getting charged there too or something weird like that.) So the *total* charge of this kind is at most n*(# buckets) = n log^*(n). QED (Whew!)