15-854 Approximation and Online Algorithms 01/24/00 * Recap Handouts: Shmoys survey * Set cover: greedy + rand rounding [PV] paper * p-center problem hwk1 ======================================================================== Note: Readings on web page in case you find lecture notes confusing. Recap ===== Last time looked at vertex cover as example of approximation problem, rent/buy and pursuer/evader as examples of online problems. We saw that in the online setting there are different reasonable measures of performance. Competitive ratio is an approximation measure: want to guarantee you never did much worse than best possible in hindsight. These philosophical issues of what do you want to optimize don't arise so much in approx algs since given two alternative strategies, you can always just do both. We'll get back to these questions later. For now, going to focus on approximation algorithms. Last time: vertex cover. Get factor of 2 approx using simple maximal matching, also simple LP rounding. (Surprisingly, greedy vertex selection doesn't work). Today: start with related problem called set-cover problem. Set-Cover ========= Given a domain X of n points, and m subsets S_1, S_2, ..., S_m of these points. Goal: find the fewest number of these subsets needed to cover all the points. E.g.,... Comes up in lots of different settings. E.g., facility location problems. Say you are planning the locations of fire-stations in a city. You want that every house in the city can be reached in < 5 min from some firestation. So, points are the houses, and then for each possible fire-station location, you construct the set of houses reachable in < 5 minutes from that location. Then set-cover problem = minimum number of firestations needed. Another example: what is the smallest number of features needed to uniquely identify a given entry in a database. Set-cover is also a generalization of vertex cover. How would you write a vertex cover problem this way? Greedy Algorithm: Pick the set that covers the most points. Throw out all the points covered. Repeat. What's an example where this *doesn't* find the best solution? Theorem: If the optimal solution uses k sets, the greedy algorithm finds a solution with at most k(ln(n/k) + 1) sets. So, it is an O(log n) approximation -- or better if k is large. Proof: Since the optimal solution uses k sets, there must some set that covers at least a 1/k fraction of the points. Therefore, after the first iteration of the algorithm, there are at most n(1 - 1/k) points left. After the second, there are at most n(1 - 1/k)^2 points left, etc. After k rounds, there are at most n(1 - 1/k)^k < n*(1/e) points left. After k*ln(n/k) rounds there are at most k points left. So there are at most k iterations to go before we're done. Here's another algorithm: first solve the fractional version --- have variable x_i for each set S_i which is allowed to be in range [0,1]. For each point p, have constraint that sum of x_i's such that S_i contains p must be at least 1. Then solve. Get a solution whose total sum is some value k' (value of the optimal fractional solution) which is <= k (value of the optimal integral solution). Unfortunately, unlike vertex-cover, we can't just take all values > x, since some point might be covered by lots of small fractions that happen to sum to 1. But, one thing we *can* do is treat fractions as probabilities -- this is called randomized rounding. There are a couple ways to do randomized rounding. Here's the one I think is cleanest. First, divide all the numbers by k' so that they now all sum to 1. Now, view this as a probability distribution and make independent draws from it. Let's fix some point p. Sum of values on sets covering it is at least 1/k'. So, after k' draws, the probability it *hasn't* been covered is at most (1 - 1/k')^k' <= 1/e. So, expected number of points still uncovered is <= n/e. After k'*ln(n/k') draws, Pr(p is uncovered) <= k'/n, so expected number of points uncovered is <= k'. Once we have <= k' points remaining, can just be greedy and cover the rest with <= k' more sets. So, the algorithm is: Draw k'*ln(n/k') sets from the distribution. If > ceiling{k'} points uncovered, repeat. Else, cover remaining greedily. ==> Total number of sets is at most k'*ln(n/k') + ceiling{k'}. One detail to verify: we showed that the *expected* number of points uncovered after the random draws is <= ceil{k'}. What we need is that this actually happens with some reasonable probability. But, we're OK -- since the random variable (# points remaining) is nonnegative, the event we want has to occur with prob >= 1/(k'+1). (Worst case in terms of the probability is that with prob 1/(k'+1) we have 0 points remaining, and with prob k'/(k'+1) we have ceil{k'}+1 points remaining). So, this bound is in some sense better than the greedy algorithm bound, since we replace k by k'. Aside: the maximum ratio (OPT integral)/(OPT fractional) is often called the "integrality gap". We have shown here that the integrality gap is at most log(n) since we actually *find* an integral solution within log(n) of the optimal fractional. [Feige '96]: Approximating set-cover to (1-epsilon)*ln(n) in poly time would put NP into "almost polynomial" n^{O(log log n)} time. p-center problem ================ Let's go back to the "firehouse location problem". We talked about fixing the radius and optimizing the number, but can also turn it around (fix quantity, optimize quality). This is called the p-center problem. p-center problem: Given graph G and number p, find set of p "center" vertices (firehouses) such that sets of radius R from these vertices cover all of G, for as small a value of R as possible. Symmetric version: G is undirected. Asymmetric version: G is directed. Symmetric version: can approximate to factor of 2. (ie., find p points covering G with radius 2R). Algorithm: Let's say optimal R is known (there are at most n^2 meaningful values of R so can just try them all). Now, repeat the following: Pick an arbitrary uncovered point x. Mark everything within radius 2R of x as "covered". Analysis: Say the first point we pick is x1. We know one of the true centers y1 covers x1 within radius R. So, going 2R from x1 covers everything that y1 covers in radius R. Then we pick x2 --- again, if y2 is the true center covering x2, then going out 2R from x2 covers everything that y2 covers in radius R. But let's be careful: if y2 = y1 then we're in trouble. How do we know y2 != y1? Similarly y_i is different from all yj's for j= 4, ln(1+x) < lg(x). Finally, when get to x=4, ln(1+4) = 1.609, and ln(1 + 1.609) = 0.95. Proof of stronger theorem [sketch]: Suppose we could show that whenever there are p points that R-cover G, there must also be p/2 points that 5R-cover G. Then we'd be done. This isn't always true but almost. What we'll first do is see if the graph has any vertex v such that for radius R, v covers everything that covers it. Then, we can pick v and look out distance 2R as in the symmetric case. Suppose we've done that as much as we can and there are no longer any such vertices in the remaining graph. The claim is that what we want holds for the graph where you actually remove everything within radius 5R from the vertices we have picked. Some messy details, but here is the basic idea: if A is one of the optimal centers of points in the remaining graph, then there must be some node x that covers A, but A doesn't cover x. This means some center B != A covers x. Again there is some y that covers B but B doesn't cover y so some center C != B covers y. Eventually we get into a cycle or into the "already covered" region. If we get into a cycle, then can take every other (with one "every other other" if the cycle has an odd number of centers, which gives us 5R). Case of going into already-covered region is similar... Open problem: can you get constant for asymmetric p-center? For natural notion of "fractional version" it appears possible to create an instance with integrality gap of log^*(n), which suggests this might be tricky.