15-451 Algorithms 10/09/01 Graph algorithms I: * Topological sorting * MST - Prim * MST - Kruskal * Intro to union-find ========================================================================= A lot of algorithmic problems can be modeled as problems on graphs. Today we will talk about a couple basic ones and we will continue talking about graph algorithms for a lot of the rest of the course. Reminder of basic terminology: A graph is a set of NODES, or VERTICES, with edges between some of the nodes. V = set of vertices, E = set of edges. If there's an edge between two vertices, we call them NEIGHBORS. Degree of a vertex is the number of neighbors it has. standard notation: n = |V|, m = |E|. If we don't allow self-loops, what is max number of total edges? {n \choose 2} In a directed graph, each edge has a direction. Can have up to n(n-1) edges now. For each node, can talk about out-neighbors (and out-degree) and in-neighbors (and in-degree). -------------------------------------------------------- PROBLEM 1: TOPOLOGICAL SORTING. A "DAG" or directed-acyclic-graph is a directed graph without any cycles. E.g., A----->D--->E |\ ^ | \ | | \ | | | / V V/ C<--B<----F Given a DAG, the "topological sorting" problem is: find an ordering of the vertices such that all edges go forward in the ordering. Typically comes up when given a set of tasks to do with precedence constraints (need to do A and F before you can do B), and want to find a legal ordering of performing the jobs. How to solve it #1: 1. find a node v with no in-edges and output it. 2. remove v and its outedges 3. goto 1. If we don't use any special data-structures, then step 1 takes O(n) time and overall this takes O(n^2) time. Can improve this using a heap data structure to store in-degree, but here's a neat linear-time algorithm based on depth-first search: DFS(v): mark v as visited. for each unvisited neighbor w of v: do DFS(w). DFS-main: For v=1 to n: if v is not yet visited, do DFS(v). How to solve it #2: 1. Use depth-first search, and output the nodes as they are exited. [output v after for-loop] 2. reverse the order. Claim: If there is an edge from u to v, then v is exited first. (This implies that when we reverse the order, we have a topological sorting.) Proof of claim: [Think of u=B, v=D above.] Easy to see if our DFS started at u before starting at v. What if we started at v first? In this case, we would finish v before even starting u since there can't be a path from v to u (else wouldn't be acyclic). Note: topological sorting in linear time (unlike regular sorting) Note: can use same alg to tell if graph has any cycles. -------------------------------------------------------------------- PROBLEM 2: MST A SPANNING TREE is a tree that touches all the nodes of a graph. (So, it only makes sense in a connected graph.) A MST is the shortest spanning tree (size of tree is sum of its edge lengths) For instance, can imagine you're setting up a communication network between these nodes or something like that. For example: 5 8 A-----B-----C | | | 1| 1| |6 | 2 | 4 | D-----E-----F What is the MST here? (total length = 14) Prim's algorithm ---------------- Prim's algorithm is the following simple MST algorithm: - pick some arbitrary start node. - add shortest edge to current tree that doesn't create a cycle. - continue until tree spans all nodes. --> run Prim on above graph Now we need to prove that this works. Will prove by induction. First: useful fact about spanning trees: If you take a spanning tree and add an edge to it, you get a cycle. If you then remove any edge in the cycle, you get back a spanning tree. Proof of correctness of Prim's alg: * Inductively, assume tree built so far is consistent with an MST (there might be more than one). This is certainly true at the start. * Need to show this is maintained each time we add an edge. Say T is tree so far, and M is MST consistent with it. If new edge is in M, then fine. If new edge is not in M, then edge forms a cycle with M. This edge e goes from T to outside, so if we follow the cycle we must eventually get to an edge e' that goes back in to T. We know length(e') >= length(e) by definition of the algorithm. So, if we add e to M and remove e', we get a new tree that is no longer than M was. This maintains inductive hypothesis. Running time: To make this efficient, we need a good way of storing the edges into our current tree in such a way that we can quickly pull off the shortest one. In fact, it's a little more efficient to store vertices instead of edges. E.g., like this. For each node x, have two fields: x.dist = length of shortest edge into tree so far (or infinity if not nbr) x.parent = parent if we put this edge in. Init: start.dist=0, v.dist = infinity for v != start. Put all nodes in data structure H. Repeat: pull out v of min v.dist from H. Mark as being in the tree. for each unmarked neighbor x of v: set x.dist = min (x.dist, length(xv edge)) if new value is smaller, set x.parent = v. Operations done on H: insert(x): |V| times remove-min(): |V| times. decrease-key(x): |E| times. plus, we do O(|E|) additional work in the for loop. Linked list: insert is O(1). descrease-key is O(1). remove-min is O(V). Total time is O(V^2). Heap: all operations are O(log V). Total time is O(E*log(V)). Also something fancier called a "Fibonacci heap": Improves on regular heap in that insert and decrease-key are O(1) amortized. So, total time using a fibonacci heap is O(V*logV + E), which is better for dense graphs. ----------------------------------------------------------------------- MST - Kruskal Here's another nice algorithm, called Kruskal's algorithm. Algorithm: Sort edges by length and examine them from shortest to longest. Put edge into current forest (forest is set of trees) if it doesn't form a cycle. E.g., look at in this graph: 5 8 A-----B-----C | | | 1| 1| |6 | 2 | 4 | D-----E-----F Proof of correctness: Can basically use same argument as before: assume inductively there is some MST M consistent with forest F found so far. Now we add in a new edge e between components C and C' of our forest. We know that M gets between C and C' somehow. In particular, either M has e in it, or else e forms a cycle with M, and this cycle has some other edge e' in it that connects C to some other component (maybe C' or maybe an intermediate one) in our forest. By definition of our algorithm, len(e) <= len(e'). So either M has e already, or substituting e for e' produces a tree M' that is just as good and maintains our inductive hypothesis. What about running time? Takes O(E log E) to sort. Then, for each edge we need to test if it connects two different components. Turns out there's a nice data structure called "union-find" data structure for doing this. General picture is of this ADT is we maintain a collection {S_1, S_2, ..., S_k} of disjoint sets over some universe. Operations are: MakeSet(x) - creates set containing just x Union(x,y) - replace the set S_x containing x and the set S_y containing y with the single set S_x U S_y. Find(x) - return the unique ID for the set containing x (this is just some representative element of this set). Given this, we begin with MakeSet(v) for all vertices v. When we consider some edge (v,w) we just see if Find(v)==Find(w). To put in the edge(v,w) we do a union. Turns out there's a nice data structure for doing this with "almost constant" amortized time per operation (get into specifics next time). So, total time is O(E log E + E * alpha(E,V)) where alpha(E,V) is "almost constant". This is dominated by the sorting time and is O(E log E). But, if for some reason we have the edges already sorted by length, then get time of O(E * alpha(E,V)), where basically alpha(E,V) <= 4. (to make alpha > 4, would need E > 2^2^2^65536 (actually, a couple more 2's in the tower))