15-854 Approximation and Online Algorithms	01/24/00

* Recap					       Handouts: Shmoys survey
* Set cover: greedy + rand rounding                      [PV] paper
* p-center problem					 hwk1
========================================================================
Note: Readings on web page in case you find lecture notes confusing.  

Recap
=====
 Last time looked at vertex cover as example of approximation problem,
rent/buy and pursuer/evader as examples of online problems.  We saw
that in the online setting there are different reasonable measures of
performance.  Competitive ratio is an approximation measure: want to
guarantee you never did much worse than best possible in hindsight.
These philosophical issues of what do you want to optimize don't arise
so much in approx algs since given two alternative strategies, you can
always just do both.  We'll get back to these questions later.  For
now, going to focus on approximation algorithms.

Last time: vertex cover.  Get factor of 2 approx using simple maximal
matching, also simple LP rounding.  (Surprisingly, greedy vertex
selection doesn't work).

Today: start with related problem called set-cover problem.

Set-Cover
=========
Given a domain X of n points, and m subsets S_1, S_2, ..., S_m of
these points.  Goal: find the fewest number of these subsets needed to
cover all the points.  E.g.,...

Comes up in lots of different settings.

E.g., facility location problems.  Say you are planning the locations
of fire-stations in a city.  You want that every house in the city can
be reached in < 5 min from some firestation.  So, points are the
houses, and then for each possible fire-station location, you
construct the set of houses reachable in < 5 minutes from that
location.  Then set-cover problem = minimum number of firestations
needed.   Another example: what is the smallest number of features
needed to uniquely identify a given entry in a database.

Set-cover is also a generalization of vertex cover.  How would you write a
vertex cover problem this way?

Greedy Algorithm: Pick the set that covers the most points.  Throw out
all the points covered.  Repeat.

What's an example where this *doesn't* find the best solution?

Theorem: If the optimal solution uses k sets, the greedy algorithm
finds a solution with at most k(ln(n/k) + 1) sets. So, it is an 
O(log n) approximation -- or better if k is large.

Proof: Since the optimal solution uses k sets, there must some set
that covers at least a 1/k fraction of the points. Therefore, after
the first iteration of the algorithm, there are at most n(1 - 1/k)
points left.  After the second, there are at most n(1 - 1/k)^2 points
left, etc.  After k rounds, there are at most n(1 - 1/k)^k < n*(1/e)
points left.  After k*ln(n/k) rounds there are at most k points left.
So there are at most k iterations to go before we're done.

Here's another algorithm: first solve the fractional version --- have
variable x_i for each set S_i which is allowed to be in range
[0,1]. For each point p, have constraint that sum of x_i's such that
S_i contains p must be at least 1.  Then solve.  Get a solution whose
total sum is some value k' (value of the optimal fractional solution)
which is <= k (value of the optimal integral solution).
Unfortunately, unlike vertex-cover, we can't just take all values > x,
since some point might be covered by lots of small fractions that
happen to sum to 1.  But, one thing we *can* do is treat fractions as
probabilities -- this is called randomized rounding.

There are a couple ways to do randomized rounding.  Here's the one I
think is cleanest.  First, divide all the numbers by k' so
that they now all sum to 1.  Now, view this as a probability
distribution and make independent draws from it.   

Let's fix some point p.  Sum of values on sets covering it is at
least 1/k'.  So, after k' draws, the probability it *hasn't* been
covered is at most (1 - 1/k')^k' <= 1/e.  So, expected number of
points still uncovered is <= n/e.  After k'*ln(n/k') draws, 
Pr(p is uncovered) <= k'/n, so expected number of points uncovered is
<= k'.  Once we have <= k' points remaining, can just be greedy and
cover the rest with <= k' more sets.  So, the algorithm is:

	Draw k'*ln(n/k') sets from the distribution.  If > ceiling{k'} points
	uncovered, repeat.  Else, cover remaining greedily.

	==> Total number of sets is at most k'*ln(n/k') + ceiling{k'}.

One detail to verify: we showed that the *expected* number of points
uncovered after the random draws is <= ceil{k'}.  What we need is that this
actually happens with some reasonable probability.  But, we're OK --
since the random variable (# points remaining) is nonnegative, the
event we want has to occur with prob >= 1/(k'+1).  (Worst case in
terms of the probability is that with prob 1/(k'+1) we have 0 points
remaining, and with prob k'/(k'+1) we have ceil{k'}+1 points remaining). 

So, this bound is in some sense better than the greedy algorithm
bound, since we replace k by k'.  

Aside: the maximum ratio (OPT integral)/(OPT fractional) is often
called the "integrality gap".  We have shown here that the integrality
gap is at most log(n) since we actually *find* an integral solution
within log(n) of the optimal fractional.

[Feige '96]: Approximating set-cover to (1-epsilon)*ln(n) in poly time
would put NP into "almost polynomial" n^{O(log log n)} time.

p-center problem
================
Let's go back to the "firehouse location problem".  We talked about
fixing the radius and optimizing the number, but can also turn it
around (fix quantity, optimize quality).  This is called the p-center
problem.

p-center problem: Given graph G and number p, find set of p "center"
vertices (firehouses) such that sets of radius R from these vertices
cover all of G, for as small a value of R as possible.

Symmetric version: G is undirected.  Asymmetric version: G is directed.

Symmetric version: can approximate to factor of 2.  (ie., find p
points covering G with radius 2R).

Algorithm:  Let's say optimal R is known (there are at most n^2
meaningful values of R so can just try them all).  Now, repeat the
following: 

	Pick an arbitrary uncovered point x.
	Mark everything within radius 2R of x as "covered".

Analysis: Say the first point we pick is x1.  We know one of the true
centers y1 covers x1 within radius R.  So, going 2R from x1 covers
everything that y1 covers in radius R.  Then we pick x2 --- again, if
y2 is the true center covering x2, then going out 2R from x2 covers
everything that y2 covers in radius R.  But let's be careful: if y2 =
y1 then we're in trouble.  How do we know y2 != y1?  Similarly y_i is
different from all yj's for j<i.  Therefore, we stop after at most p
rounds.


Asymmetric case is not so easy.  In fact, the best result known is
pretty neat:

Theorem: can approximate asymmetric p-center problem to factor of log^*(n).

[log^*(n) = number of logs until get down to 1. E.g., if base 2, then
lg*(4) = 2, lg*(16) = 3, lg*(65536) = 4, lg*(2^65536) = 5]

Weaker Theorem: there is an algorithm that finds 2p points that covers
G in radius O(R log^*(n)).

Proof of weaker theorem:

Step 1: Run greedy set cover to cover all of G in radius R, using 
p1 <= p + p*ln(n/p) points.

Step 2: Now, let's again run greedy set cover, but just trying to
cover *those* p1 points in radius R.  This will cover *all* of G in
radius 2R.  Number of points used is 
	p2 <= p + p*ln(p1/p) <= p + p*ln(1 + ln(n/p))

Let's repeat.  In next step, we cover all of G in radius 3R with p3 points.
	p3 <= p + p*ln(p2/p) <= p + p*ln(1 + ln(1 + ln(n/p)))

If we ignore the "1+" part, then can see that this will get down to 2p
in at most ln^*(n/p) iterations.  The "1+" just makes it a little more
annoying.  Can use the fact that for x >= 4, ln(1+x) < lg(x).
Finally, when get to x=4, ln(1+4) = 1.609, and ln(1 + 1.609) = 0.95.

Proof of stronger theorem [sketch]:

Suppose we could show that whenever there are p points that R-cover G,
there must also be p/2 points that 5R-cover G.  Then we'd be done.
This isn't always true but almost.  What we'll first do is see if the
graph has any vertex v such that for radius R, v covers everything
that covers it.  Then, we can pick v and look out distance 2R as in
the symmetric case.  Suppose we've done that as much as we can and
there are no longer any such vertices in the remaining graph.  The
claim is that what we want holds for the graph where you actually
remove everything within radius 5R from the vertices we have picked.
Some messy details, but here is the basic idea: if A is one of the
optimal centers of points in the remaining graph, then there must be
some node x that covers A, but A doesn't cover x.  This means some
center B != A covers x.  Again there is some y that covers B but B
doesn't cover y so some center C != B covers y.  Eventually we get
into a cycle or into the "already covered" region.  If we get into a
cycle, then can take every other (with one "every other other" if the
cycle has an odd number of centers, which gives us 5R).  Case of going
into already-covered region is similar...

Open problem: can you get constant for asymmetric p-center?  For
natural notion of "fractional version" it appears possible to create
an instance with integrality gap of log^*(n), which suggests this
might be tricky.