15-451  Algorithms
Spring 2013
April 23, 2013
Danny Sleator

Randomized competitiveness and paging.

Summary:
  The paging problem
  Server problems
  Persuer/Evader or Cat/Mouse games

----------------------------------------------------------

Paging
======

This is a classic problem in computer systems.  There are N pages of
data.  There is room for k of the pages to be in fast memory.  (k<N).
There is a sequence of requests for pages sigma = sigma_1, sigma_2, ... sigma_m

Initially some set of pages is in fast memory.  The algorithm must
ensure that each requested page be made to be in fast memory.  The
algorithm can move pages in and out of fast memory at will.  Each time
a page is moved into fast memory a cost of 1 is incurred.  When a
request is processed if the requested page is already in fast memory,
nothing needs to happen.  If it's not in fast memory, then it must be
moved into fast memory, and a page that is in fast memory must be
"evicted" to make room for it.

Here are some basic paging algorithms:

  LRU = Least Recently Used
        Evict the page that has not been used in the longest time.
        This is a deterministic on-line algorithm


  LFD = Longest Forward Distance
        Looking at the future requests, evict the page that is not
        needed until longest into the future.
        This is a deterministic off-line algorithm.

In our definition of algorithms for the paging problem, we allowed an
algorithm to move a page into fast memory (and evict a page) before
the page was requested.  This is called a non-lazy algorithm.  (A lazy
one is one which moves a page into fast memory only immediately after
that page has been requested.)

Theorem: A non-lazy algorithm for the paging problem can be converted
into a lazy algorithm that costs no more.

Theorem: LFD incurrs no more page faults than any other algorithm for
processing a sequence of requests.

These theorems are not that difficult to prove, but we don't have time
to do that here.

We'll come back to paging again below.


Server Problems  (This is optional material)
============================================

Recall that a metric space S is a set of points with a distance
measure d(.,.) on the points.  The distance measure, or metric, must
satisfy the following conditions:

          For all u in S     d(u,u) = 0
          For all u,v in S   d(u,v) >= 0
          For all u,v,w in S   d(u,v) + d(v,w) >= d(u,w)

The 3rd condition is called the triangle inequality.

Here's the k-server problem:

  There is a metric space S of at least k+1 points.  There are k
  servers that occupy k distinct points in S.  There is a sequence of
  requests for points in the metric space, sigma = sigma_1, sigma_2,
  ...

  In response to a request to a point x, a server must be moved to x
  (if there isn't already there).  The cost of the move is simply the
  distance between the starting point and the ending point of the
  server that moved.

An example of a 2-server problem is a two headed disk.  


        +--------------------------------+
        0     ^               ^          1
              |               |
           head 1          head 2

The metric space is the real interval [0,1].  Requests come in to
locations on the line (requests to read or write a file there). 
the system has to decide which head to move to satisfy the request. 


Another example is the paging problem.  Here the metric space is the
one consisting of N points (one for each page), and the distance
function is simply 

                    /  0  if u=v
          d(u,v) = |   1  otherwise
                    \

The points of the metric space correspond to pages of the address
space, and the points occupied by the k servers correspond to the
pages that are in fast memory.

The k-server problem has been studied extensively.  To avoid trivial
Here are some things that are known about it:

Theorem: The competitive factor is at least k for any k-server
problem.

Theorem:  There is a 2-competitive algorithm for the 2-server problem.

Theorem:  There is a 2k-competitive algorithm for the k-server problem.

Conjecture: There exists a k-competitive algorithm for the k-server
problem.

---------------------------------------------------------------------

Back to paging.

How bad can LRU be from the competitive point of view. We can see that
it will be at least k as follows.  Consider the sequence

   sigma = 1, 2, 3, 4 ... N, 1, 2, 3 ... N, 1 ....

Say that LRU and the adversary each start out with 1....N-1 in fast
memory.  Now after the first N-1 requests, each request for LRU will
cause a page fault.  On the other hand, LFD will only get 1 page fault
for each block of N-1 consecutive requests in this sequence -- it just
evicts the page used N-1 requests in the future.

In fact, from the lower bound theorem above, we know that ANY
deterministic on-line algorithm for the paging problem will have a
competitive factor of at least k.  (Actually it is not hard to prove
this directly for paging.)

So a competitive factor of k is not that useful.  So we're going to
change the model a little bit.  We're going to consider randomized
algorithms.

Definition: A randomized algorithm A is c-competitive if there exists
a constant a such that for all sequences sigma, and for all algorithms
B (possibly off-line) we have

       E[C (sigma)]  <= c C (sigma) + a
          A                B

Here the expectation E[] is taken over all the random choices of the
algorithm A.  The adversary still gets to choose sigma and B.

It turns out that this model allows us to devise a paging algorithm
with a much better competitive factor.  In fact the competitive factor
is O(log N).

Before we do this we'll convert the problem to yet another form....


Pursuer/evader game
===================
Here's something called the pursuer/evader game.  (Or cat and mouse
game).  This is a special case of paging problem and the k-server
problem.

You're a mouse in one of N hiding places.  At each time step, cat
comes to one of the places.  If it's the one you're hiding in, you
need to move.  Say your cost is the number of times you've moved.  OPT
is the fewest number of times you could have moved in hindsight.
E.g., say n=4 and you start in location 1.  What is best in hindsight
if sequence is (2,1,3,4,2,1,3)?  

Note that this problem is exactly the same as the paging problem where
k=N-1.  The mouse's location is just the specific page that IS NOT in
fast memory.

QUESTION: Is there a randomized strategy for us, such that for any
sequence of moves of the cat, E[(our cost)]/OPT is "small"?

(Note: for any deterministic algorithm, there exists a sequence for
the cat that causes us to move every time.  That's why we're looking
at randomized algs).

What's the simplest randomized algorithm for this problem?  How about
this:

  RAND: Mouse starts in a random place.  Each time the mouse is hit by
  the cat, the mouse moves to a random other place.

It turns out that this algorithm that is NOT so good.  What is a
strategy for the *cat* that causes our expected cost to be a lot more
than our optimal cost in hindsight? 

Answer: cat visits points 1,2,...,N-1 repeatedly.  We should have
moved to point N at the start for a cost of 1.  But, we expect to move
an N-1 times.  Here's why.  The probability that we're on N at the
start is 1/N and our cost is 0 in that case.  The probability that
we're not on N is (N-1)/N.  In this case what's the expected number of
times we get hit before we land on N?  It's the expected number of
times we have to roll an N-1 sided die until we get a specific number.
This is just N-1.  Putting this together we get:

        1      N-1                1
        - 0  + --- (N-1)  =  N-2+ -
        N       N                 N

So this algorithm is Theta(N) competitive, which is not what we're
looking for.  (Actually a more careful lower bound shows that its
competitive factor is N-1.)


Here's a lower bound showing that *no* algorithm can get a ratio
better than log(n):

Claim 1: no algorithm can get o(log N)

Proof: What if cat probes randomly.  Then, no matter what mouse does,
mouse has 1/n prob of being forced to move.  So, in t time steps,
expected cost to mouse is t/N.  But, how long does it take cat to hit
every point?  (This is called the coupon collector's problem.  The
anser is N/N+ N/(N-1) + ... + N/1 ~= N*log(N).  So, in hindsight, OPT
was to move to that last point and only pay once every N*log(N)
probes.  So, ratio is O(log N).

In fact, we can get O(log N) with the following online algorithm
called "Marking".   

  * start at a random place
  * when a place is probed, mark it.
  * when your spot gets probed, move to random unmarked spot
  * if you're at last unmarked spot, clear marks and restart.

Claim 2: Randomized competitive ratio for Marking is O(log N)

Proof: We divide the analysis into phases.  The last probe of a phase is
the one that causes all of the marks to be cleared.  Note that the set
of probes in a phase must hit every spot.

Initially our prob dist is 1/N at each point.  We're going to consider
the probes that the cat makes to unmarked points.  (Probes to marked
points do not contribute to the completion of the phase, nor do they
cause the mouse to incur any cost.)  When cat probes the first point, we
have 1/N prob of being hit.  After this probe, our distrib is 1/(N-1) at
each unmarked point. After next probe to unmarked point we have 1/(N-1)
prob of being hit and our distrib is 1/(N-2) at each unmarked. And so
on. So, total expected number of moves per phase is 1/N + 1/(N-1) +
... + 1/1 = H_N = O(log N).  But, any algorithm (even in hindsight) must
move at least once per phase since every point got hit. So, ratio is
O(log n).

Note 1: Actually, there's a H_{N-1} competitive algorithm for the cat
and mouse game.  And this is the best possible.  It deviates just
slightly from the one described above.

--------------------------------------------------------------------
Back to paging again

How do we interpret the marking algorithm for the cat/moust game
in the world of paging?  You have to change things a little bit,
because there are basically N-k mice.

Here's the marking algorithm for paging:
start with all pages in fast memory as unmarked.  When a request is
made, mark the page and if it is a page fault, bring it in and throw
out a *random* unmarked page.  When all pages get marked, clear the
marks and start over on the next page fault.  Like a randomized 1-bit
version of LRU, where just keeping track of "was it recently used or
not?".  

The actual competitive ratio you can prove for this problem is 2H_k.
There's a much more complex algorithm that achieves the optimal
competitive ratio of H_k.

I don't know if this algorithm has been tested on data from real
systems.  It would be interesting to see how well it worked.

-------------------Note for Future Lectures-------------------------

Let's look at the case n=k+1 (i.e. the cat/mouse game).  It would be
nice to really nail this down and prove the H_k competitive factor for
the proper version of the marking algorithm.  (Instead of the H_n
algorithm above.)  Here's how to do this.

Define a phase as follows.  Starting with the empty set, continue to
add requests into this set.  The moment the set becomes of size n,
you've gone to far.  The end of the phase is right before the request
that makes the set too big.  Then start over with the empty set again,
starting at the next request after the end of the phase we just found.
In each phase k (i.e. n-1) different pages are requested, but if we
added the first request of the next phase it would contain requests
for all n elements.

It would be nice in a future lecture to include the proof that LFD is
optimal.  Because we're going to need that right now.  To make the
definition complete, let's define the initial state of the LFD to have
all the pages of the first phase in fast memory.

Theorem: LFD incurrs a cost of 1 on the first request of each phase
(except the first phase).

Proof: So the requests of the first phase cost it nothing.  Now it's
easy to see that from that point on, the LFD costs precisely 1 for
each request that is the beginning of the phase.  That's because the
page that is evicted (the one not needed for the longest time in the
future) is precisely the first page of the next phase.  QED.

Let's nail down the correct marking algorithm.  Initialize with a
random set of pages in fast memory.  Now when a page is requested,
mark it.  If it's not in fast memory, then evict a random page
from among the unmarked ones.  If there are no unmarked pages,
then mark the current request (which must be the beginning of
a new phase), erase all the other marks, and evict a random
page.

If we write the probability vector of the n pages being in fast
memory, it looks like this (illustrated for n=4) initially:


       3    3    3    3
      ---  ---  ---  ---
       4    4    4    4

After a request to the first page we have (* = a mark)

            2    2    2
       1   ---  ---  ---
            3    3    3
       *

This illustrates that the first page is in fast memory and is marked.
The other pages have a probability of 2/3 of being in fast memory.

Say that the next request is to the 2nd page.  Then we have:

                1    1
       1    1  ---  ---
                2    2
       *    *

If the next request is to the fourth page we have:

       1    1   0    1
       *    *        *

After a request to the third page we have:

       2    2         2
      ---  ---   1   ---
       3    3         3
                 *

Of these four requests, the first three are in phase 1.  The expected
cost of the first phase is 1/4 + 1/3 + 1/2 = H_4 - 1.

The last request shown in the beginning of phase 2, and it costs 1.
The remaining two non-trivial requests in phase 2 have expected cost
1/3 and 1/2 respectively.  This exact analysis repeats for each
subsequente phase.

So the expected cost of phase 1 is (H_n)-1, and the expected cost of
all subsequent phases is H_k.

So for both the marking algorithm and the LFD the first phase is
anomalous, and all the rest of the phases behave exactly the same.

Thus we have proven:

Theorem: The marking algorithm is H_k competitive.