15-750 Grad Algorithms 1/30/01 * Topological sorting * MST - Prim * MST - Kruskal * Intro to union-find ======= We've been talking about data-structures for storing information so that it's easy to do lookups, inserts, deletes, and other natural operations. In next couple of lectures, we'll be talking about data-structures for other kinds of operations where motivation comes in large part from graph algorithms. In process of constructing the algorithm, we'll want to be able to maintain data efficiently under certain kinds of operations, and we'll need to come up with the right data-structures for that. Today will talk about some of these basic graph algorithms, and then in next couple of lectures we'll cover the data-structures. Warm-up: TOPOLOGICAL SORTING. A "DAG" or directed-acyclic-graph is a directed graph without any cycles. E.g., A----->D--->E |\ ^ | \ | | \ | | | / V V/ C<--B<----F Given a DAG, the "topological sorting" problem is: find an ordering of the vertices such that all edges go forward in the ordering. Typically comes up when given a set of tasks to do with precedence constraints (need to do A and F before you can do B), and want to find a legal ordering of performing the jobs. How to solve it #1: 1. find a node v with no in-edges and output it. 2. remove v and its outedges 3. goto 1. This takes time O(n^2). How about a linear-time way? Here's a neat algorithm based on depth-first-search. DFS-main: For v=1 to n: if v is not yet visited, do DFS(v). DFS(v): mark v as visited. for each unvisited neighbor w of v: do DFS(w). How to solve it #2: 1. Use depth-first search, and output the nodes as they are exited. [output v after for-loop] 2. reverse the order. Claim: If there is an edge from u to v, then v is exited first. (This implies that when we reverse the order, we have a topological sorting.) Proof of claim: [Think of u=B, v=D above.] Easy to see if our DFS started at u before starting at v. What if we started at v first? In this case, we would finish v before even starting u since there can't be a path from v to u (else wouldn't be acyclic). Note: topological sorting in linear time (unlike regular sorting) Note: can use same alg to tell if graph has any cycles. MST - Prim A SPANNING TREE is a tree that touches all the nodes of a graph. (So, it only makes sense in a connected graph.) A MST is the shortest spanning tree (size of tree is sum of its edge lengths) For instance, can imagine you're setting up a communication network between these nodes or something like that. For example: 5 8 A-----B-----C | | | 1| 1| |6 | 2 | 4 | D-----E-----F What is the MST here? (total length = 14) Prim's algorithm is a simple MST algorithm that grows the tree one edge at a time, beginning at some arbitrary starting node. In particular, the algorithm is just this: at each step add on the shortest edge into the tree that doesn't create a cycle. --> run Prim on above graph Now we need to prove that this works. Will prove by induction. First: useful fact about spanning trees: If you take a spanning tree and add an edge to it, you get a cycle. If you then remove any edge in the cycle, you get back a spanning tree. Proof of correctness of Prim's alg: * Inductively, assume tree built so far is consistent with an MST (there might be more than one). This is certainly true at the start. * Need to show this is maintained each time we add an edge. Say T is tree so far, and M is MST consistent with it. If new edge is in M, then fine. If new edge is not in M, then edge forms a cycle with M. This edge e goes from T to outside, so if we follow the cycle we must eventually get to an edge e' that goes back in to T. We know length(e') >= length(e) by definition of the algorithm. So, if we add e to M and remove e', we get a new tree that is no longer than M was. This maintains inductive hypothesis. Running time: To make this efficient, we need a good way of storing the edges into our current tree in such a way that we can quickly pull off the shortest one. In fact, it's a little more efficient to store vertices instead of edges. E.g., like this. Init: start.dist=0, v.dist = infinity for v != start. Put all nodes in data structure H. Repeat: pull out v of min v.dist from H. Mark as being in the tree. for each unmarked neighbor x of v: set x.dist = min (x.dist, length(xv edge)) if new value is smaller, set x.parent = v. Operations done on H: insert(x): |V| times remove-min(): |V| times. decrease-key(x): |E| times. plus, we do O(|E|) additional work in the for loop. Linked list: insert is O(1). descrease-key is O(1). remove-min is O(V). Regular heap: all operations are O(log V). Total time is O(E*log(V)). Fibonacci heap: insert, decrease-key are O(1), remove-min is O(log V) (amortized, but that's fine for us). So, total time using a fibonacci heap is O(V*logV + E). So, this is linear time for dense graphs (average degree >= log(V)). MST - Kruskal Here's another nice algorithm, called Kruskal's algorithm. Algorithm: Sort edges by length and examine them from shortest to longest. Put edge into current forest (forest is set of trees) if it doesn't form a cycle. E.g., look at in this graph: 5 8 A-----B-----C | | | 1| 1| |6 | 2 | 4 | D-----E-----F Proof of correctness: Can basically use same argument as before: assume inductively there is some MST M consistent with forest F found so far. Now we add in a new edge e between components C and C' of our forest. We know that M gets between C and C' somehow. In particular, either M has e in it, or else e forms a cycle with M, and this cycle has some other edge e' in it that connects C to some other component (maybe C' or maybe an intermediate one) in our forest. By definition of our algorithm, len(e) <= len(e'). So either M has e already, or substituting e for e' produces a tree M' that is just as good and maintains our inductive hypothesis. What about running time? Takes O(E log E) to sort. Then, for each edge we need to test if it connects two different components. Turns out there's a nice data structure called "union-find" data structure for doing this. General picture is of this ADT is we maintain a collection {S_1, S_2, ..., S_k} of disjoint sets over some universe. Operations are: MakeSet(x) - creates set containing just x Union(x,y) - replace the set S_x containing x and the set S_y containing y with the single set S_x U S_y. Find(x) - return the unique ID for the set containing x (this is just some representative element of this set). Given this, we begin with MakeSet(v) for all vertices v. When we consider some edge (v,w) we just see if Find(v)==Find(w). To put in the edge(v,w) we do a union. Turns out there's a nice data structure for doing this with "almost constant" amortized time per operation (get into specifics next time). So, total time is O(E log E + E * alpha(E,V)) where alpha(E,V) is "almost constant". This is dominated by the sorting time and is O(E log E). But, if for some reason we have the edges already sorted by length, then get time of O(E * alpha(E,V)), where basically alpha(E,V) <= 4. (to make alpha > 4, would need E > 2^2^2^65536 (actually, a couple more 2's in the tower))