15-750 Grad Algorithms		1/30/01
* Topological sorting
* MST - Prim
* MST - Kruskal
* Intro to union-find
=======

We've been talking about data-structures for storing information so
that it's easy to do lookups, inserts, deletes, and other natural
operations.   In next couple of lectures, we'll be talking about
data-structures for other kinds of operations where motivation comes
in large part from graph algorithms.  In process of constructing
the algorithm, we'll want to be able to maintain data efficiently
under certain kinds of operations, and we'll need to come up with the
right data-structures for that.  Today will talk about some of these
basic graph algorithms, and then in next couple of lectures we'll
cover the data-structures.


Warm-up: TOPOLOGICAL SORTING.

A "DAG" or directed-acyclic-graph is a directed graph without any
cycles.  E.g.,

      A----->D--->E
      |\     ^
      | \    |
      |  \   |
      |   | /  
      V   V/         
      C<--B<----F

Given a DAG, the "topological sorting" problem is: find an ordering of
the vertices such that all edges go forward in the ordering.
Typically comes up when given a set of tasks to do with precedence
constraints (need to do A and F before you can do B), and want to find
a legal ordering of performing the jobs.

How to solve it #1:
 1. find a node v with no in-edges and output it.
 2. remove v and its outedges
 3. goto 1.
This takes time O(n^2).  How about a linear-time way?  Here's a neat
algorithm based on depth-first-search.

DFS-main:
 For v=1 to n: if v is not yet visited, do DFS(v).

DFS(v):
  mark v as visited.
  for each unvisited neighbor w of v: do DFS(w).

How to solve it #2:
1. Use depth-first search, and output the nodes as they are exited.
    [output v after for-loop]
2. reverse the order.

Claim: If there is an edge from u to v, then v is exited first.  (This
implies that when we reverse the order, we have a topological sorting.)

Proof of claim: [Think of u=B, v=D above.]  Easy to see if our DFS
started at u before starting at v.  What if we started at v first?  In
this case, we would finish v before even starting u since there can't
be a path from v to u (else wouldn't be acyclic).

Note: topological sorting in linear time (unlike regular sorting)
Note: can use same alg to tell if graph has any cycles.


MST - Prim


A SPANNING TREE is a tree that touches all the nodes of a graph. (So,
it only makes sense in a connected graph.)   A MST is the shortest
spanning tree (size of tree is sum of its edge lengths)

For instance, can imagine you're setting up a communication network
between these nodes or something like that.

For example:

                 5     8
              A-----B-----C
              |     |     |
             1|    1|     |6
              |  2  |  4  |
              D-----E-----F

What is the MST here?  (total length = 14)


Prim's algorithm is a simple MST algorithm that grows the tree one
edge at a time, beginning at some arbitrary starting node.  In
particular, the algorithm is just this: at each step add on the
shortest edge into the tree that doesn't create a cycle. 

--> run Prim on above graph

Now we need to prove that this works.  Will prove by induction.

First: useful fact about spanning trees:  If you take a spanning tree
and add an edge to it, you get a cycle.  If you then remove any edge
in the cycle, you get back a spanning tree.

Proof of correctness of Prim's alg:
* Inductively, assume tree built so far is consistent with an MST (there
  might be more than one).  This is certainly true at the start.

* Need to show this is maintained each time we add an edge.  Say T is
  tree so far, and M is MST consistent with it.  If new edge is in M,
  then fine.  If new edge is not in M, then edge forms a cycle with M.
  This edge e goes from T to outside, so if we follow the cycle we must
  eventually get to an edge e' that goes back in to T.  We know
  length(e') >= length(e) by definition of the algorithm.  So, if we add
  e to M and remove e', we get a new tree that is no longer than M was.
  This maintains inductive hypothesis.

Running time: To make this efficient, we need a good way of storing
the edges into our current tree in such a way that we can
quickly pull off the shortest one.  In fact, it's a little
more efficient to store vertices instead of edges.  E.g., like this.  

Init: start.dist=0, v.dist = infinity for v != start.  Put all nodes in
data structure H.

Repeat:
 pull out v of min v.dist from H.  Mark as being in the tree.
 for each unmarked neighbor x of v:
        set x.dist = min (x.dist, length(xv edge))
        if new value is smaller, set x.parent = v. 

Operations done on H:
        insert(x): |V| times
        remove-min(): |V| times.
        decrease-key(x): |E| times.

plus, we do O(|E|) additional work in the for loop.

Linked list: insert is O(1).  descrease-key is O(1).  remove-min is O(V).

Regular heap: all operations are O(log V).  Total time is O(E*log(V)).

Fibonacci heap: insert, decrease-key are O(1), remove-min is O(log V)
        (amortized, but that's fine  for us).

So, total time using a fibonacci heap is O(V*logV + E).  So, this is
linear time for dense graphs (average degree >= log(V)).


MST - Kruskal
Here's another nice algorithm, called Kruskal's algorithm. 

Algorithm: Sort edges by length and examine them from shortest to
longest. Put edge into current forest (forest is set of trees) if it
doesn't form a cycle. 

E.g., look at in this graph:

                 5     8
              A-----B-----C
              |     |     |
             1|    1|     |6
              |  2  |  4  |
              D-----E-----F

Proof of correctness: Can basically use same argument as before:  assume
inductively there is some MST M consistent with forest F found so
far.  Now we add in a new edge e between components C and C' of our
forest.  We know that M gets between C and C' somehow.  In particular,
either M has e in it, or else e forms a cycle with M, and this cycle has some
other edge e' in it that connects C to some other component (maybe C'
or maybe an intermediate one) in our forest.  By definition of our
algorithm, len(e) <= len(e').  So either M has e already, or
substituting e for e' produces a tree M' that is just as good and
maintains our inductive hypothesis.

What about running time?  Takes O(E log E) to sort.  Then, for each
edge we need to test if it connects two different components.  Turns
out there's a nice data structure called "union-find" data structure
for doing this.  

General picture is of this ADT is we maintain a collection 
{S_1, S_2, ..., S_k} of disjoint sets over some universe.
Operations are: 
        MakeSet(x) - creates set containing just x
        Union(x,y) - replace the set S_x containing x and the set S_y
                        containing y with the single set S_x U S_y.
        Find(x) - return the unique ID for the set containing x (this
                        is just some representative element of this set).

Given this, we begin with MakeSet(v) for all vertices v.  When we
consider some edge (v,w) we just see if Find(v)==Find(w).  To put in
the edge(v,w) we do a union.    Turns out there's a nice data
structure for doing this with "almost constant" amortized time per
operation (get into specifics next time).  

So, total time is O(E log E + E * alpha(E,V)) where alpha(E,V) is
"almost constant".  This is dominated by the sorting time and is O(E
log E).  But, if for some reason we have the edges already sorted by
length, then get time of O(E * alpha(E,V)), where basically alpha(E,V) <= 4.

(to make alpha > 4, would need E > 2^2^2^65536  (actually, a couple
more 2's in the tower))