15-451 Algorithms
Spring 2004
April 13, 2006
Danny Sleator
Randomized competitiveness and paging.
Summary:
The paging problem
Server problems
Persuer/Evader or Cat/Mouse games
----------------------------------------------------------
Paging
======
This is a classic problem in computer systems. There are N pages of
data. There is room for k of the pages to be in fast memory. (k= 0
For all u,v,w in S d(u,v) + d(v,w) >= d(u,w)
The 3rd condition is called the triangle inequality.
Here's the k-server problem:
There is a metric space S of at least k+1 points. There are k
servers that occupy k distinct points in S. There is a sequence of
requests for points in the metric space, sigma = sigma_1, sigma_2,
...
In response to a request to a point x, a server must be moved to x
(if there isn't already there). The cost of the move is simply the
distance between the starting point and the ending point of the
server that moved.
An example of a 2-server problem is a two headed disk.
+--------------------------------+
0 ^ ^ 1
| |
head 1 head 2
The metric space is the real interval [0,1]. Requests come in to
locations on the line (requests to read or write a file there).
the system has to decide which head to move to satisfy the request.
Another example is the paging problem. Here the metric space is the
one consisting of N points (one for each page), and the distance
function is simply
/ 0 if u=v
d(u,v) = | 1 otherwise
\
The points of the metric space correspond to pages of the address
space, and the points occupied by the k servers correspond to the
pages that are in fast memory.
The k-server problem has been studied extensively. To avoid trivial
Here are some things that are known about it:
Theorem: The competitive factor is at least k for any k-server
problem.
Theorem: There is a 2-competitive algorithm for the 2-server problem.
Theorem: There is a 2k-competitive algorithm for the k-server problem.
Conjecture: There exists a k-competitive algorithm for the k-server
problem.
---------------------------------------------------------------------
Back to paging.
How bad can LRU be from the competitive point of view. We can see that
it will be at least k as follows. Consider the sequence
sigma = 1, 2, 3, 4 ... N, 1, 2, 3 ... N, 1 ....
Say that LRU and the adversary each start out with 1....N-1 in fast
memory. Now after the first N-1 requests, each request for LRU will
cause a page fault. On the other hand, LFD will only get 1 page fault
for each block of N-1 consecutive requests in this sequence -- it just
evicts the page used N-1 requests in the future.
In fact, from the lower bound theorem above, we know that ANY
deterministic on-line algorithm for the paging problem will have a
competitive factor of at least k. (Actually it is not hard to prove
this directly for paging.)
So a competitive factor of k is not that useful. So we're going to
change the model a little bit. We're going to consider randomized
algorithms.
Definition: A randomized algorithm A is c-competitive if there exists
a constant a such that for all sequences sigma, and for all algorithms
B (possibly off-line) we have
E[C (sigma)] <= c C (sigma) + a
A B
Here the expectation E[] is taken over all the random choices of the
algorithm A. The adversary still gets to choose sigma and B.
It turns out that this model allows us to devise a paging algorithm
with a much better competitive factor. In fact the competitive factor
is O(log N).
Before we do this we'll convert the problem to yet another form....
Pursuer/evader game
===================
Here's something called the pursuer/evader game. (Or cat and mouse
game). This is a special case of paging problem and the k-server
problem.
You're a mouse in one of N hiding places. At each time step, cat
comes to one of the places. If it's the one you're hiding in, you
need to move. Say your cost is the number of times you've moved. OPT
is the fewest number of times you could have moved in hindsight.
E.g., say n=4 and you start in location 1. What is best in hindsight
if sequence is (2,1,3,4,2,1,3)?
Note that this problem is exactly the same as the paging problem where
k=N-1. The mouse's location is just the specific page that IS NOT in
fast memory.
QUESTION: Is there a randomized strategy for us, such that for any
sequence of moves of the cat, E[(our cost)]/OPT is "small"?
(Note: for any deterministic algorithm, there exists a sequence for
the cat that causes us to move every time. That's why we're looking
at randomized algs).
What's the simplest randomized algorithm for this problem? How about
this:
RAND: Mouse starts in a random place. Each time the mouse is hit by
the cat, the mouse moves to a random other place.
It turns out that this algorithm that is NOT so good. What is a
strategy for the *cat* that causes our expected cost to be a lot more
than our optimal cost in hindsight?
Answer: cat visits points 1,2,...,N-1 repeatedly. We should have
moved to point N at the start for a cost of 1. But, we expect to move
an N-1 times. Here's why. The probability that we're on N at the
start is 1/N and our cost is 0 in that case. The probability that
we're not on N is (N-1)/N. In this case what's the expected number of
times we get hit before we land on N? It's the expected number of
times we have to roll an N-1 sided die until we get a specific number.
This is just N-1. Putting this together we get:
1 N-1 1
- 0 + --- (N-1) = N-2+ -
N N N
So this algorithm is Theta(N) competitive, which is not what we're
looking for. (Actually a more careful lower bound shows that its
competitive factor is N-1.)
Here's a lower bound showing that *no* algorithm can get a ratio
better than log(n):
Claim 1: no algorithm can get o(log N)
Proof: What if cat probes randomly. Then, no matter what mouse does,
mouse has 1/n prob of being forced to move. So, in t time steps,
expected cost to mouse is t/N. But, how long does it take cat to hit
every point? (This is called the coupon collector's problem. The
anser is N/N+ N/(N-1) + ... + N/1 ~= N*log(N). So, in hindsight, OPT
was to move to that last point and only pay once every N*log(N)
probes. So, ratio is O(log N).
In fact, we can get O(log N) with the following online algorithm
called "Marking".
* start at a random place
* when a place is probed, mark it.
* when your spot gets probed, move to random unmarked spot
* if you're at last unmarked spot, clear marks and restart.
Claim 2: Randomized competitive ratio for Marking is O(log N)
Proof: We divide the analysis into phases. The last probe of a phase is
the one that causes all of the marks to be cleared. Note that the set
of probes in a phase must hit every spot.
Initially our prob dist is 1/N at each point. We're going to consider
the probes that the cat makes to unmarked points. (Probes to marked
points do not contribute to the completion of the phase, nor do they
cause the mouse to incur any cost.) When cat probes the first point, we
have 1/N prob of being hit. After this probe, our distrib is 1/(N-1) at
each unmarked point. After next probe to unmarked point we have 1/(N-1)
prob of being hit and our distrib is 1/(N-2) at each unmarked. And so
on. So, total expected number of moves per phase is 1/N + 1/(N-1) +
... + 1/1 = H_N = O(log N). But, any algorithm (even in hindsight) must
move at least once per phase since every point got hit. So, ratio is
O(log n).
Note 1: Actually, there's a H_{N-1} competitive algorithm for the cat
and mouse game. And this is the best possible. It deviates just
slightly from the one described above.
--------------------------------------------------------------------
Back to paging again
How do we interpret the marking algorithm for the cat/moust game
in the world of paging? You have to change things a little bit,
because there are basically N-k mice.
Here's the marking algorithm for paging:
start with all pages in fast memory as unmarked. When a request is
made, mark the page and if it is a page fault, bring it in and throw
out a *random* unmarked page. When all pages get marked, clear the
marks and start over on the next page fault. Like a randomized 1-bit
version of LRU, where just keeping track of "was it recently used or
not?".
The actual competitive ratio you can prove for this problem is 2H_k.
There's a much more complex algorithm that achieves the optimal
competitive ratio of H_k.
I don't know if this algorithm has been tested on data from real
systems. It would be interesting to see how well it worked.