15-451					11/28/00
* Approximation Algorithms

[We've switched the order of ZK proofs and Approximation algs from
what's on the schedule]
======================================================================
General Question:

For problems that are NP-hard, can we perhaps in polynomial time guarantee a
good *approximate* solution?  E.g., can we guarantee to find a
solution that's within 10% of optimal?  Or how about within a factor
of 2 of optimal?  Or, anything non-trivial?

Interesting thing: even though all NP-complete problems are equivalent
in terms of difficulty of finding optimal solution, the hardness or
easyness of getting a good approximation varies all over the map.

VERTEX COVER
============
   - Given a graph G, looking for smallest set of vertices such that
	every edge is incident to at least one vertex in the set. 

   - Example:                            +----+----+
                                         |    |    |
                                         +----+----+

   - Can think of as: what is the fewest # of guards we need to place
	in museum to cover all the corridors.

   - This problem is NP-hard.  But it turns out it's easy to get
        within a factor of 2.

Let's start with some strategies that *don't* work.

  Strawman #1: Pick arbitrary vertex with at least one uncovered edge
	incident, put into cover, and repeat.  What would be a bad example?
	[A: how about a star graph]

  Strawman #2: how about picking the vertex that covers the *most*
	uncovered edges.  Turns out this doesn't work either. [make bipartite
	graph where opt is size t, but this alg finds one of size t +
	floor(t/2) + floor(t/3) + floor(t/4) + ... + 1.  This is Theta(t log
	t).  Best examples are with t=6 or t=12.]

How to get factor of 2.  Two easy algorithms:

 Alg1:
        Pick arbitrary edge.  We know any VC must have at least 1
        endpt, so lets take both.  Then throw out all edges covered
        and repeat.   Keep going until no uncovered edges left.  What
	we've found in the end is a matching (a set of edges no two of
	which share an endpoint) that's "maximal" (meaning that you can't add
	any more edges to it and keep it a matching).  So, if we picked k
	edges, our VC has size 2k, but any VC must have size at least k since
	you need to have at least one endpoint of each of these edges.


Here's another 2-approximation algorithm for Vertex Cover:

 Alg2: 
        Step1: Solve a *fractional* version of the problem.  Have a variable
        x_i for each vertex.  Constraint 0<= x_i <= 1.  Each edge should be
        "fractionally covered": for each edge (i,j) we want x_i+x_j >= 1.
        Then our goal is to minimize sum_i x_i.  We can solve this using
        linear programming.

        E.g., triangle-graph.

        Step2: now pick each vertex i such that x_i >= 1/2.

   Claim 1: this is a VC.  Why?

   Claim 2: The size of this VC is at most twice the size of the optimal
        VC.  Why?  Because it's at most twice the value of the *fractional*
        solution we found.  And, the size of the optimal fractional solution
        is <= the size of the optimal integral solution (i.e., the
        optimal VC).

   Interesting fact: nobody knows any algorithm with approximation
   ratio 1.9.  Best known is 2 - O((loglog n)/(log n)), which is 2 - o(1).

   Current best hardness result: Hastad shows 7/6 - epsilon is NP-hard.


SET-COVER
---------
On hwk6, problem 1 talks about the set-cover problem:  
	Given a domain X of n points, and m subsets S_1, S_2, ..., S_m of
	these points.  Goal: find the fewest number of these subsets needed to
	cover all the points.

As you will be showing on hwk6, set-cover is NP-hard.  In fact, it's
been proven that getting an o(log n) approximation to the optimal
solution is also NP-hard.  Luckily, getting O(log n) is not so bad: a
simple greedy algorithm works.

	Greedy Algorithm: Pick the set that covers the most points.  Throw out
	all the points covered.  Repeat.
	
	What's an example where this *doesn't* find the best solution?
	
	Theorem: If the optimal solution uses k sets, the greedy algorithm
	finds a solution with at most k(ln(n/k) + 1) sets. So, it is an 
	O(log n) approximation -- or better if k is large.
	
	Proof: Since the optimal solution uses k sets, there must some set
	that covers at least a 1/k fraction of the points. Therefore, after
	the first iteration of the algorithm, there are at most n(1 - 1/k)
	points left.  After the second, there are at most n(1 - 1/k)^2 points
	left, etc.  After k rounds, there are at most n(1 - 1/k)^k < n*(1/e)
	points left.  After 2k rounds, there are < n*(1/e)^2
	points.  After k*ln(n) rounds, there is < 1 point, so we're done.  Get
	slightly tighter bound by noticing that after k*ln(n/k) rounds there
	are at most n*(1/e)^{ln(n/k)} = k points left, and after that, each
	round gets us at least one point each, which gives us the k(ln(n/k) +
	1) bound.
	

MAX-SAT:  Given a CNF formula (like in SAT), try to maximize the
number of clauses satisfied.

  To make things cleaner, let's assume we have reduced each clause [so,
  (x or x or y) would become just (x or y), and (x or not(x)) would be
  removed] 

  Let m = # of clauses in the formula, m_1 = # clauses of size 1, m_2 =
  # clauses of size 2, etc.  So, m = m_1 + m_2 + ...

Claim: There exists an assignment that satisfies at least 
  sum_k m_k(1 - 1/2^k) clauses.  Furthermore, there is an efficient
  algorithm to find one.

Corollary: if every clause has size 3 (this is sometimes called the
  MAX-exactly-3-SAT problem), then we can satisfy at least m(7/8) clauses.
  So, this is for sure a 7/8-approximation, since the optimal solution
  satisfies at most m clauses.  

Interesting fact: getting a 7/8 + epsilon approximation for any
  constant epsilon (like .001) is NP-hard. 

Proof of claim: Consider a random assignment to the variables.  For a
  clause of size k, the chance it is *not* satisfied is 1/2^k.  So, the
  expected # of clauses satisfied is sum_k m_k(1 - 1/2^k).  So, to find
  an assignment that satisfies at least this many clauses, we just keep
  trying random assignments, and it's not hard to show that whp, pretty
  soon we'll find one that's as least as good as this expectation.  This
  also proves that an assignment satisfying at least this many clauses
  exists.

How about a deterministic algorithm?  Here's what we can do.  Look at
  $x_1$: for each of the two possible settings (0 or 1) we can calculate
  the expected number of clauses satisfied if we were to go with that
  setting, and then set the rest of the variables randomly.  (It is just
  the number of clauses already satisfied plus $\sum_k m_k(1-1/2^k)$,
  where $m_k$ is the number of clauses of size $k$ in the ``formula to
  go''.)  Fix $x_1$ to the setting that gives us a larger expectation.
  Now go on to $x_2$ and do the same thing, setting it to the value with
  the highest expectation-to-go, and then $x_3$ and so on.  The point is
  that the expected number of clauses satisfied by a random assignment
  is the {\em average} of the expected number given $x_i = 0$ and the
  expected number given $x_i = 1$, so our ``expectation to go'' never
  decreases.

This is called the ``conditional expectation'' method.  The algorithm
  itself is completely deterministic --- in fact we could rewrite
  it to get rid of any hint of randomization by just viewing $\sum_k
  m_k(1-1/2^k)$ as a way of weighting the clauses to favor the small
  ones, but our motivation was based on the randomized method.


In general, the area of approximation algorithms and approx hardness
is a major area of algorithms research.  Occupies a good fraction of
major algorithms conferences.