15-451 Algorithms Spring 2004 April 13, 2006 Danny Sleator Randomized competitiveness and paging. Summary: The paging problem Server problems Persuer/Evader or Cat/Mouse games ---------------------------------------------------------- Paging ====== This is a classic problem in computer systems. There are N pages of data. There is room for k of the pages to be in fast memory. (k= 0 For all u,v,w in S d(u,v) + d(v,w) >= d(u,w) The 3rd condition is called the triangle inequality. Here's the k-server problem: There is a metric space S of at least k+1 points. There are k servers that occupy k distinct points in S. There is a sequence of requests for points in the metric space, sigma = sigma_1, sigma_2, ... In response to a request to a point x, a server must be moved to x (if there isn't already there). The cost of the move is simply the distance between the starting point and the ending point of the server that moved. An example of a 2-server problem is a two headed disk. +--------------------------------+ 0 ^ ^ 1 | | head 1 head 2 The metric space is the real interval [0,1]. Requests come in to locations on the line (requests to read or write a file there). the system has to decide which head to move to satisfy the request. Another example is the paging problem. Here the metric space is the one consisting of N points (one for each page), and the distance function is simply / 0 if u=v d(u,v) = | 1 otherwise \ The points of the metric space correspond to pages of the address space, and the points occupied by the k servers correspond to the pages that are in fast memory. The k-server problem has been studied extensively. To avoid trivial Here are some things that are known about it: Theorem: The competitive factor is at least k for any k-server problem. Theorem: There is a 2-competitive algorithm for the 2-server problem. Theorem: There is a 2k-competitive algorithm for the k-server problem. Conjecture: There exists a k-competitive algorithm for the k-server problem. --------------------------------------------------------------------- Back to paging. How bad can LRU be from the competitive point of view. We can see that it will be at least k as follows. Consider the sequence sigma = 1, 2, 3, 4 ... N, 1, 2, 3 ... N, 1 .... Say that LRU and the adversary each start out with 1....N-1 in fast memory. Now after the first N-1 requests, each request for LRU will cause a page fault. On the other hand, LFD will only get 1 page fault for each block of N-1 consecutive requests in this sequence -- it just evicts the page used N-1 requests in the future. In fact, from the lower bound theorem above, we know that ANY deterministic on-line algorithm for the paging problem will have a competitive factor of at least k. (Actually it is not hard to prove this directly for paging.) So a competitive factor of k is not that useful. So we're going to change the model a little bit. We're going to consider randomized algorithms. Definition: A randomized algorithm A is c-competitive if there exists a constant a such that for all sequences sigma, and for all algorithms B (possibly off-line) we have E[C (sigma)] <= c C (sigma) + a A B Here the expectation E[] is taken over all the random choices of the algorithm A. The adversary still gets to choose sigma and B. It turns out that this model allows us to devise a paging algorithm with a much better competitive factor. In fact the competitive factor is O(log N). Before we do this we'll convert the problem to yet another form.... Pursuer/evader game =================== Here's something called the pursuer/evader game. (Or cat and mouse game). This is a special case of paging problem and the k-server problem. You're a mouse in one of N hiding places. At each time step, cat comes to one of the places. If it's the one you're hiding in, you need to move. Say your cost is the number of times you've moved. OPT is the fewest number of times you could have moved in hindsight. E.g., say n=4 and you start in location 1. What is best in hindsight if sequence is (2,1,3,4,2,1,3)? Note that this problem is exactly the same as the paging problem where k=N-1. The mouse's location is just the specific page that IS NOT in fast memory. QUESTION: Is there a randomized strategy for us, such that for any sequence of moves of the cat, E[(our cost)]/OPT is "small"? (Note: for any deterministic algorithm, there exists a sequence for the cat that causes us to move every time. That's why we're looking at randomized algs). What's the simplest randomized algorithm for this problem? How about this: RAND: Mouse starts in a random place. Each time the mouse is hit by the cat, the mouse moves to a random other place. It turns out that this algorithm that is NOT so good. What is a strategy for the *cat* that causes our expected cost to be a lot more than our optimal cost in hindsight? Answer: cat visits points 1,2,...,N-1 repeatedly. We should have moved to point N at the start for a cost of 1. But, we expect to move an N-1 times. Here's why. The probability that we're on N at the start is 1/N and our cost is 0 in that case. The probability that we're not on N is (N-1)/N. In this case what's the expected number of times we get hit before we land on N? It's the expected number of times we have to roll an N-1 sided die until we get a specific number. This is just N-1. Putting this together we get: 1 N-1 1 - 0 + --- (N-1) = N-2+ - N N N So this algorithm is Theta(N) competitive, which is not what we're looking for. (Actually a more careful lower bound shows that its competitive factor is N-1.) Here's a lower bound showing that *no* algorithm can get a ratio better than log(n): Claim 1: no algorithm can get o(log N) Proof: What if cat probes randomly. Then, no matter what mouse does, mouse has 1/n prob of being forced to move. So, in t time steps, expected cost to mouse is t/N. But, how long does it take cat to hit every point? (This is called the coupon collector's problem. The anser is N/N+ N/(N-1) + ... + N/1 ~= N*log(N). So, in hindsight, OPT was to move to that last point and only pay once every N*log(N) probes. So, ratio is O(log N). In fact, we can get O(log N) with the following online algorithm called "Marking". * start at a random place * when a place is probed, mark it. * when your spot gets probed, move to random unmarked spot * if you're at last unmarked spot, clear marks and restart. Claim 2: Randomized competitive ratio for Marking is O(log N) Proof: We divide the analysis into phases. The last probe of a phase is the one that causes all of the marks to be cleared. Note that the set of probes in a phase must hit every spot. Initially our prob dist is 1/N at each point. We're going to consider the probes that the cat makes to unmarked points. (Probes to marked points do not contribute to the completion of the phase, nor do they cause the mouse to incur any cost.) When cat probes the first point, we have 1/N prob of being hit. After this probe, our distrib is 1/(N-1) at each unmarked point. After next probe to unmarked point we have 1/(N-1) prob of being hit and our distrib is 1/(N-2) at each unmarked. And so on. So, total expected number of moves per phase is 1/N + 1/(N-1) + ... + 1/1 = H_N = O(log N). But, any algorithm (even in hindsight) must move at least once per phase since every point got hit. So, ratio is O(log n). Note 1: Actually, there's a H_{N-1} competitive algorithm for the cat and mouse game. And this is the best possible. It deviates just slightly from the one described above. -------------------------------------------------------------------- Back to paging again How do we interpret the marking algorithm for the cat/moust game in the world of paging? You have to change things a little bit, because there are basically N-k mice. Here's the marking algorithm for paging: start with all pages in fast memory as unmarked. When a request is made, mark the page and if it is a page fault, bring it in and throw out a *random* unmarked page. When all pages get marked, clear the marks and start over on the next page fault. Like a randomized 1-bit version of LRU, where just keeping track of "was it recently used or not?". The actual competitive ratio you can prove for this problem is 2H_k. There's a much more complex algorithm that achieves the optimal competitive ratio of H_k. I don't know if this algorithm has been tested on data from real systems. It would be interesting to see how well it worked.