15-451 Undergraduate Algorithms
Spring 2003
April 29, 2004
Approximation Algorithms
Danny Sleator (adapted from notes by Avrim Blum)
* NP-completeness recap
* Approximations to Metric TSP
* Approximations to Vertex Cover
* The MAX SAT problem
======================================================================
NP-completeness recap: (Not explained during lecture)
Going back to network flow and linear programming, one thing that
makes these problems important is that you can use them to represent a
lot of problems (i.e., you can reduce a lot of problems to them).
One way to think about NP-completeness is that a problem like 3-SAT is
so expressive, it can represent *any* problem in NP. I.e., you can
reduce any problem in NP to that. One way to show some other problem
X is NP-complete by showing how a magic algorithm to solve X could be
used to solve 3-SAT (which then can solve anything in NP). A good way
to think of this is to imagine someone has an instance of 3-SAT that
they will pay $1,000,000 to solve, and someone else has a magic box
for X they're willing to sell for $100,000. So, what you want to do
is show how you could convert the given 3-SAT instance to an instance
of problem X, run the magic algorithm, and have this give you a
solution to the original million-dollar 3-SAT instance.
Today: look at approximation algorihms.
General Question: Maybe we can't hope to always get the best solution,
but can we at least guarantee to get a "pretty good" solution? E.g.,
can we guarantee to find a solution that's within 10% of optimal? Or
how about within a factor of 2 of optimal? Or, anything non-trivial?
Interesting thing: even though all NP-complete problems are equivalent
in terms of difficulty of finding optimal solution, the hardness or
easyness of getting a good approximation varies all over the map.
METRIC-TSP
==========
Here we define a version of the Travelling Salesman Problem on a
metric space:
INPUT: a set of n points and a metric d(.,.) defined on these
points, along with a bound B.
QUESTION: Does there exist a travelling salesman tour (TST) of
length at most B. That is, is there a permutation of the points
s(1), s(2), ... s(n) such that
d(s(1),s(2)) + d(s(2),s(3)) + ... + d(s(n-1),s(n))+d(s(n), s(1)) <= B
Recall the definition of a _metric_ d(.,.):
For all X,Y: d(X,Y) >= 0
For all X,Y: d(X,Y) = d(Y,X)
For all X,Y,Z d(X,Y) + d(Y,Z) >= d(X,Z) (Triangle inequality)
An additional requirement (not usually important) is that for any
two distinct points X,Y, d(X,Y) > 0. In other words if d(X,Y)=0
then X=Y. If this does not hold we have a _semi-metric_.
Theorem: METRIC-TSP is NP-Complete:
Proof: Clearly METRIC-TSP is in NP, cause you can just produce the
necessary TST to prove that the instance is solvable.
Let HAM-CYCLE be the hamiltonian cycle problem. We'll show that
p
HAM-CYCLE <= METRIC-TST
Given a graph G=(V,E) on which we want to solve the HAM-CYCLE problem.
We'll construct a set and a metric and a bound such that the resulting
METRIC-TSP is solvable if and only if the HAM-CYCLE problem on G is
solvable.
Let the points of the metric be the vertices of the graph G. Define
d(.,.) as follows:
{ 0 if x=y
d(x,y) = { 1 if (x,y) in E
{ 2 otherwise
Let the bound B=|V|. If G has a hamiltonian cycle, then there is a
solution to the TSP problem. (This ham cycle gives a TST of length
n). If there is a TST of length |V|, then every distance between
successive points must be 1. Therefore these moves correspond to
edges of G.
QED.
Now we'll give two approximation algorithms for METRIC-TSP.
Approximation Algorithm 1:
1) Compute a Minimum Spanning Tree of the set of points.
2) Traverse the tree using the "right hand rule".
3) remove extra visits to vertices.
Here's an example. Suppose the MST looks like this:
4 5
| |
1----------2----------3-------6------8
| |
7 9
Let Cmst be the cost of the minimum spanning tree. Compute the
following tour in step 2:
1 2 4 2 3 5 3 6 8 6 9 6 3 2 7 2 1
Let C2 be the cost of this tour. Now removing the vertices that have
already been visited we get the following tour:
1 2 4 3 5 6 8 9 7 1
Let C3 be the cost of this tour. Let Cs be the cost of the optimum
solution of the METRIC-TSP.
Here's what we know:
C2 = 2*Cmst (the traversal uses each edge twice).
C1 <= C2 (by the triangle inequality removing vertices
from the tour makes it shorter)
Cmst < Cs (take any TST and remove any edge. The result
is a spanning tree, whose cost is at lesat
that of the MST.)
Combining these we get:
C1 < 2*Cs
Therefore the length of the tour computed by this algorithm is at most
twice that of the optimum tour.
Approximation Algorithm 2:
1) Compute a Minimum Spanning Tree of the set of points.
2) Look at the set of odd degree vertices of the MST. Say
there are K of them. Construct a minimum cost matching
among these K points (K must be even.)
3) Form the union of the edges of the MST and M, and construct an
Euler tour of this graph.
4) Remove duplicate visits to vertices.
Claim: this algorithm is within a factor of 3/2 of optimum.
Arguing as above, we know that the cost of the resulting tour is at
most Cmst + Cm (cost of the matching M). From above we know that
Cmst < Cs. It also turns out that Cm <= 1/2 Cs. Proving this will
complete the proof of the claim.
Here's why Cm <= 1/2 Cs. Take any TST among the K odd vertices we're
working with. The cost of this TST is at most the cost of the
original TST on the whole set of points. (Triangle inequality)
The set of edges fo this TST on K points can be partitioned into two
matchings. These sum to the cost of the TST. Therefore one of them
must be at most 1/2 of the TST. And the minimum matching among these
K points must be at most the cost of this particular matching on the K
points. This proves the claim.
VERTEX COVER
============
- Given a graph G, looking for smallest set of vertices such that
every edge is incident to (touches) at least one vertex in the set.
- Example: +----+----+
| | |
+----+----+
- Can think of as: what is the fewest # of guards we need to place
in museum to cover all the corridors.
- This problem is NP-hard for general graphs. (You did this on the
quiz for the case G is a *tree*). But it turns out we can get
within a factor of 2.
Let's start with some strategies that *don't* work.
Strawman #1: Pick arbitrary vertex with at least one uncovered edge
incident, put into cover, and repeat. What would be a bad example?
[A: how about a star graph]
Strawman #2: how about picking the vertex that covers the *most*
uncovered edges. Turns out this doesn't work either. [make bipartite
graph where opt is size t, but this alg finds one of size t +
floor(t/2) + floor(t/3) + floor(t/4) + ... + 1. This is Theta(t log
t). Best examples are with t=6 or t=12.]
How to get factor of 2. Two algorithms:
Alg1:
Pick arbitrary edge. We know any VC must have at least 1
endpt, so lets take both. Then throw out all edges covered
and repeat. Keep going until no uncovered edges left. What
we've found in the end is a matching (a set of edges no two of
which share an endpoint) that's "maximal" (meaning that you can't add
any more edges to it and keep it a matching). This means if
we take all endpoints, we have a VC. So, if we picked k
edges, our VC has size 2k, but any VC must have size at least k since
you need to have at least one endpoint of each of these edges.
Here's another 2-approximation algorithm for Vertex Cover:
Alg2:
Step1: Solve a *fractional* version of the problem. Have a variable
x_i for each vertex. Constraint 0<= x_i <= 1. Each edge should be
"fractionally covered": for each edge (i,j) we want x_i+x_j >= 1.
Then our goal is to minimize sum_i x_i. We can solve this using
linear programming.
E.g., triangle-graph. E.g., star-graph.
Step2: now pick each vertex i such that x_i >= 1/2.
Claim 1: this is a VC. Why?
Claim 2: The size of this VC is at most twice the size of the
optimal VC. Why? Explanation in two parts: First, notice that
it's at most twice the value of the *fractional* solution we found
(because it's no worse than doubling and then rounding down).
Then, notice that the size of the optimal fractional solution is <=
the size of the optimal integral solution, i.e., the optimal VC,
because the optimal VC *is* a legal fractional solution.
Interesting fact: nobody knows any algorithm with approximation
ratio 1.9. Best known is 2 - O((loglog n)/(log n)), which is 2 - o(1).
Current best hardness result: Hastad shows 7/6 is NP-hard.
Recently improved to 1.361 by Dinur and Safra. Beating 2-epsilon
has been related to some other open problems, but not known to be
NP-hard.
MAX-SAT: Given a CNF formula (like in SAT), try to maximize the
number of clauses satisfied.
To make things cleaner, let's assume we have reduced each clause [so,
(x or x or y) would become just (x or y), and (x or not(x)) would be
removed]
Let m = # of clauses in the formula, m_1 = # clauses of size 1, m_2 =
# clauses of size 2, etc. So, m = m_1 + m_2 + ...
Claim: There exists an assignment that satisfies at least
sum_k m_k(1 - 1/2^k) clauses. Furthermore, there is an efficient
algorithm to find one.
Corollary: if every clause has size 3 (this is sometimes called the
MAX-exactly-3-SAT problem), then we can satisfy at least m(7/8) clauses.
So, this is for sure a 7/8-approximation, since the optimal solution
satisfies at most m clauses.
Interesting fact: getting a 7/8 + epsilon approximation for any
constant epsilon (like .001) is NP-hard.
Proof of claim: Consider a random assignment to the variables. For a
clause of size k, the chance it is *not* satisfied is 1/2^k. So, the
expected # of clauses satisfied is sum_k m_k(1 - 1/2^k). Since a
random assignment satisfies this many in expectation, surely an
assignment that does this well must exist. Also, if we just keep
trying random assignments, and it's not hard to show that whp, pretty
soon we'll find one that's as least as good as this expectation.
How about a deterministic algorithm? Here's what we can do. Look at
x_1: for each of the two possible settings (0 or 1) we can calculate
the expected number of clauses satisfied if we were to go with that
setting, and then set the rest of the variables randomly. (It is just
the number of clauses already satisfied plus sum_k m_k(1-1/2^k),
where m_k is the number of clauses of size k in the ``formula to
go''.) Fix x_1 to the setting that gives us a larger expectation.
Now go on to x_2 and do the same thing, setting it to the value with
the highest expectation-to-go, and then x_3 and so on. The point is
that since we always pick the setting whose expectation-to-go is
larger, this expectation-to-go never decreases (since our current
expectation is the average of the ones we get by setting the next
variable to 0 or 1).
This is called the ``conditional expectation'' method. The algorithm
itself is completely deterministic --- in fact we could rewrite
it to get rid of any hint of randomization by just viewing sum_k
m_k(1-1/2^k) as a way of weighting the clauses to favor the small
ones, but our motivation was based on the randomized method.
In general, the area of approximation algorithms and approximation
hardness is a major area of algorithms research. Occupies a good
fraction of major algorithms conferences.