15-451 Algorithms 10/25/05 * finish SCC algorithm * max flow * bipartite matching ======================================================================= We ended last time looking at the problem: given a directed graph G, decompose it into its strongly-connected components (SCCs). Let's start today by finishing off our analysis of that. A SCC is a maximal set of vertices that can all reach each other. It's possible to decompose any graph G into its SCCs, and if you view each SCC as a "supernode" then the connections between these supernodes form a DAG. [insert example] Starting idea: do a DFS, keeping track of the order in which nodes are visited, which we will call their "dfsnum", and then see what we can do with it. Observation #1: if we look at the DFS-tree, each SCC has a unique "head" which is the first vertex of the component visited in the search. Furthermore, the component is connected in the tree. [Why can't the DFS go from the current component, off into some other component and then back before popping?] So, if you chopped the tree at every edge leading into a head, you would decompose it into the strongly connected components. So, all we need to do is identify the heads, then we can chop away... Observation 2: suppose we're about to exit some vertex v. All vertices with a lower dfsnum than v that have not yet been exited are ancestors of v, and vice-versa. So, v is the head of an SCC iff there is no path from v to a lower dfsnum w not yet exited. [Proof: if v is head then there better not be such a path since w is ancestor. If v is not the head then v's head needs to be an ancestor of v and by definition of SCC, that node had better be reachable from v.] So, this would be great if we could somehow get DFS to output the lowest reachable non-exited vertex when it returns. Unfortunately, this is not so easy.... Example: [leap-frog graph] o--->o--->o--->o--->o--->o ^ ^ | | +----|----+ | +---------+ Instead, let's define one more category for vertices: "outputted". When we find the head of some component, we then label all not-yet outputted vertices in its subtree as being in its SCC and mark them as outputted. Can actually do this without invoking new DFS and just using a stack, but let's not worry about that. Define low(v) = lowest dfsnum non-outputted vertex reachable from v by a path that uses at most one back-edge or cross-edge. [by "non-outputted we mean not yet outputted by the time we exit v] This we *can* get DFS to return, since we can compute it at the leaves of the DFS tree, and if we know the answers for all children of v, we can compute it for v. Claim: low numbers are sufficient. v is head iff low(v)=v. Direction 1: if v is head then low(v)=v. Proof: if there's a back edge from v's subtree to ancestor of v then clearly v can't be a head. But even if there is a cross-edge to some w, let u be the head of w's component. If u isn't an ancestor of v then (inductively) we would have outputted it already. So if w is not outputted, then u *is* an ancestor of v, and so v is not the head of its component. Direction 2: if v is not head then low(v) != v. Proof: if v is not head then there must be a path from subtree of v to an ancestor of v. This entire path must be inside v's SCC, so the first back- or cross-edge on it must be to an unoutputted vertex. So, that's it... ============================================================================= Today and next time we are going to talk about an important algorithmic problem called the Network Flow problem. It's important because it can be used to express and solve a lot of different things. In O.R. they have entire courses devoted to the topic. NETWORK FLOW ------------ - directed graph. Source node. Sink node. Each (directed) edge has a capacity. (Let's say these are integers.) Goal is to flow as much as possible from source to sink. E.g., s-->a, cap = 4. a-->c, cap=3. c-->t, cap = 2. s-->b, cap = 2. b-->d, cap=3. d-->t, cap = 4. c-->b, cap = 1. b-->c, cap=2. E.g., want to route message traffic from source to sink, and the capacities tell us how much bandwidth we're allowed on each edge. Formally, rules are: Capacity constraint: on any edge, f(u,v) <= c(u,v) Flow conservation: for any vertex except s and t, flow in = flow out. What is the max flow here? Answer: 5. How can you see that above max flow was really maximum? Notice, this flow saturates the a-->c and s-->b edges. And, if you remove these, you disconnect t from s. In other words, the graph has a "s-t cut" of size 5. (a set of edges of total capacity 5 that if you remove disconnects the source from the sink). So, the point is that any unit of stuff going from s to t must take up at least 1 unit of capacity in these pipes. So, we know we're optimal. We just argued that in general, max flow <= min cut. DEFINITION: an s-t cut is a set of edges whose removal disconnects t from s. Or, formally, a cut is a partition of vertex set into A and B where s is in A and t is in B. (The edges of the cut are then all edges going from A to B). DEFINITION: the capacity of cut is the sum of capacities of edges in the cut. Or, in the formal viewpoint, it is the sum of capacities of all edges going from A to B. (Don't include edges from B to A.) Easy fact we showed above: max s-t flow <= min s-t cut. How to find a maximum flow and prove it's correct? Here's a very natural strategy: find a path from s to t and push as much flow on it as possible. Then look at the leftover graph and repeat until there is no longer any path with capacity left to push any more flow on. Of course, we need to *prove* that this works: we can't somehow end up at a suboptimal solution by making bad choices along the way. And we will. This is called the Ford-Fulkerson algorithm. Basic Ford-Fulkerson algorithm ------------------------------ While there exists an s-->t path P of positive residual (leftover) capacity: push maximum possible flow along P (saturating at least one edge on it) [These are called "augmenting paths"] Definition: "RESIDUAL CAPACITY" c_f(u,v) = c(u,v) - f(u,v) where we make the convention that f(v,u) = -f(u,v). Definition: the RESIDUAL GRAPH is the directed graph with all edges of positive residual capacity, each one labeled by its residual capacity. Note: may include "back-edges" of original graph. --> Example on above graph: start with path s-->b-->c-->t. Draw flow using "[flow/capacity]" notation, and draw residual graph. Note: residual capacity can be *larger* than original capacity if we had flow going in opposite direction. E.g., residual capacity on the c-->b edge is now 3. CLAIM: Any legal flow in the Residual Graph can be added to the flow so far and get a legal flow in the original graph. PROOF: (1) flow-in = flow-out condition is always satisfied when you add two flows. (2) Capacity constraints satisfied by defn of residual. Why is the residual graph good to have? One reason: now left with same type of problem we started with. Can use any black-box path-finding method (e.g., DFS). --> Note: running time is not necessarily so great - it depends on the choices we make. E.g., standard bad-choice graph. After we prove correctness, we'll worry about about algorithms to fix this problem (next time). Algorithm by design finds a legal flow. Why is it maximum? Proof: Let's look at the final residual graph. This graph must have s and t disconnected by definition of the algorithm. Let A be component containing s and B = rest. Now, let C = capacity of the (A,B) cut in the *original* graph --- so we know we can't do better than C. The claim is that we in fact *did* find a flow of value C (which therefore implies it is maximum). Here's why: let's look at what happens to the residual capacity of the cut after each iteration of the algorithm. Say in some iteration we found a path with k units of flow. Then, even if the path zig-zagged between A and B, every time we went A-->B we added k to flow from A to B (and subtracted k from the residual capacity) and every time we went B-->A we took away k from this flow (and added k to the residual capacity). And, we went from A-->B exactly one more time than we went from B-->A. So, the residual capacity went down by exactly k. So, the drop in capacity is equal to the increase in flow. Since at the end the residual capacity is zero (remember how we defined A and B) this means the total flow is equal to C. So, we've found a flow of value EQUAL to the capacity of this cut. We know we can't do better, so this must be a max flow, and C must be a minimum cut. Note, we've actually proven the nonobvious MAXFLOW-MINCUT theorem: In any graph, max flow from s to t = capacity of minimum (s,t)-cut. We started with saying max flow <= min cut. But now we've argued that this algorithm finds a flow of value equal to SOME cut, so it can't be less than the MINIMUM. We have also proven the INTEGRAL-FLOW THEOREM: if all capacities are integers, then there is a max flow in which all flows are integers. [This seems obvious, but you'll use it to show something that's not at all obvious on hwk 5.] Bipartite Maximum matching problem ---------------------------------- Say we wanted to be more sophisticated about assigning homework presentation slots to groups. We could ask each group to list the slots acceptable to them, and then draw this as a bipartite graph: group slot ----- ---- A 1 B 2 C 3 Edge if acceptable: e.g, (A 1) (A 2) (B 1) (B 3) (C 2) (C 3) This is a BIPARTITE GRAPH: a graph with two sides L and R, and all edges go between L and R. A MATCHING is a set of edges with no endpoints in common. What we want here is a PERFECT MATCHING - a matching that connects every point on the left hand side with some point on the right hand side. What's a perfect matching here? More generally (say there is no perfect matching) we want a MAXIMUM MATCHING - a matching with maximum possible number of edges. Algorithm to solve: ------------------ 1. Set up fake "start" node S connected to all in L. Connect all in R to a fake "end" node T. Orient all edges left-to-right and give each a capacity of 1. 2. Find a max flow from S to T using Ford-Fulkerson. 3. Output matching corresponding to flow. This finds a legal matching because edges from R to T have capacity 1, so the matching can't have two edges into same node, and similarly the edges from S to L have cap=1, so you can't have two edges leaving same node in L. It's maximum because any matching gives you a flow of the same value. (So if there was a better matching, we wouldn't be at a maximum flow). What about the number of iterations of path-finding? This is at most the #edges in the matching since each augmenting path gives us 1 new edge. [Run alg on above example]. Notice a neat fact: say you start matching A->1, C->3. These are bad choices. But, the augmenting path automatically undoes them as it improves the flow! Matchings come up as subroutines in a lot of problems like matching up suppliers to customers, or cell-phones to cell-stations when you have overlapping cells. Also as a basic subroutine in other algorithmic problems.