15-859(A) Machine Learning Theory            04/01/04

* Kalai-Vempala algorithm for online optimization
=========================================================================

In linear programming, you have a convex set S of feasible points in
R^n, you are given an objective function c (say to minimize) and the
goal is to find the point x in S that minimizes c.x.

Consider now the following online version of linear programming.  You
have a feasible region S, but you don't have the objective function.
Instead, you have to pick a point x_1 in S first. Then you are given
the cost vector c_1 and you pay c_1 . x_1.  Then you pick a new
point x_2 and get a cost vector c_2 and pay c_2 . x_2, and so on.  The
question is: can you find a strategy that does nearly as well as the
best *fixed* x in hindsight?  Formally, we want for any sequence
c_1,...,c_T  to get:
    E[Alg's cost] <= min_{x in S} (c_1 + ... + c_T).x + [regret term]

For this to make sense, we'll need to assume that the set S and the
entries in c are bounded.

What are some things we can model this way?
===========================================

- One thing we can model this way is the standard experts problem:
  the feasible region S is the simplex in R^n.  Each dimension is an
  "expert", and the points in S correspond to picking a probability
  distribution over experts.  The vector c corresponds to the loss
  vector over experts (which ones made a mistake, etc).

- Another problem we can model this way is an online shortest path
  problem.  Imagine each day you have to drive between two points s
  and t.  You have a map (a graph G) but you don't know what traffic
  will be like (what the cost of each edge will be that day).  Say the
  costs are only revealed to you after you get to t (on the evening
  news).  We can model this by having one dimension per edge, and
  representing a path by the indicator vector listing the edges in the
  path.  We can let S be the convex hull over these paths (probability
  distributions over paths) and the cost vector c is just the vector
  with the costs of each edge that day.  Notice that you *could*
  represent this as an experts problem also, with one expert per path,
  but the number of s-t paths can easily be exponential in the number
  of edges in the graph (e.g., think of the case of a grid).
  Actually, since the experts problem gave bounds that depend only
  logarithmically on the number of experts, this is not bad as far as
  the bounds are concerned, but it's not a great way to do things in
  terms of running time.

KV setting
==========
More generally, Kalai-Vempala allow for any set S (not necessarily
convex) so long as we can solve the *offline* problem of minimizing
the dot product x . c for a *given* c.  That is:

- We have an arbitrary set S of feasible points in R^n.  Each time
  step t, we (probabilistically) pick x_t in S.  We are then given c_t
  and pay the dot-product c_t . x_t.

- We have a magic offline algorithm M that, given any cost vector c,
  finds the x in S that minimizes c.x.  Want to use to produce an *online*
  algorithm so that expected cost is not too much worse than for best
  x in S in hindsight.

In particular, we can at least use M to figure out: what *was* the
best x in hindsight anyway?  argmin_{x in S} (c_1 + ... + c_t).x.

So, for the shortest-path problem, M would just be a shortest-path
algorithm.  

Another example
===============

  - Consider the problem of a self-adjusting binary search tree.  As
  an experts problem, we could instantiate one expert for each
  possible tree.  When a request is made, a tree experiences a loss
  based on depth of the requested item.  (so losses are in the range
  [0,n]).  We would maintain weights, and at each time step
  probabilistically decide to stay with current tree or move to new
  one (can do this in a way that movement cost is bounded by lookup
  cost).  Now, if you don't worry about computation time to maintain
  all those weights, then on any sequence of requests, the cost of RWM
  is close to that of the best static tree in hindsight.  But of
  course this is crazy in terms of computation time.

  In the KV setting, this is a bit less crazy.  To formally fit our
  framework, we have one dimension per item.  A tree is represented by
  just the set of depths of the items.  A request for an item is
  represented by just the unit vector in its coordinate direction.
  The algorithm M is something that given a *set* of requests computes
  the best static tree for that set.  This is something you can do
  with dynamic programming.  So the question is: given that for any
  set of requests we can compute the best static tree in hindsight,
  can we use that to do nearly as well as that in an online setting?

The KV Algorithm
================

One natural thing to try is just run M to pick the best x in
hindsight, argmin_{x in S} (c_1 + ... + c_{t-1}).x, and use that on
day t.  But this doesn't work.  E.g., imagine there are two routes to
work, one fast and one slow, but they alternate which is fast and
which is slow from day to day.  This procedure will always choose the
wrong one.  Instead, we will "hallucinate" a day 0 in which we have
random costs, from an appropriate distribution.  Then we will run M to
pick the x that minimizes (c_0 + c_1 + ... + c_{t-1}).x.

To analyze this, we will argue in stages.  

Step 1: First, we will show that picking
	argmin_{x in S} (c_1 + ... + c_t).x
*does* work.  I.e., if you find the best in hindsight tomorrow and use
that today.  This is maybe not surprising but also not obvious.  Note
this is *not* the same as the x that minimizes c_t . x, which
obviously would be super-optimal.

Now, let algorithm A be the one that minimizes (c_1 + ... + c_{t-1}).x.  
Let B be the one that minimizes (c_1 + ... + c_t).x.  
They both go to the bar and start drinking.  As they drink, their
objective functions get fuzzier and fuzzier.  We end up getting:
  - A' that minimizes  (c_0A + c_1 + ... + c_{t-1}).x.  
  - B' that minimizes  (c_0B + c_1 + ... + c_t).x.  
where c_0A, c_0B are picked at random in [-1/epsilon,1/epsilon]^n.
(Could use a ball instead of a cube and get somewhat different
bounds.  Cube is simpler anyway).  As epsilon gets smaller, A and B
start behaving more and more alike.

Step 2 is to show that for an appropriate value of epsilon, B' is not
too much worse than B, and A' is close to B'.   This then shows that
A' is a good algorithm. 


Preliminaries
=============
- When we analyzed the experts algorithm, we got bounds like:
     E[Alg's cost] < OPT(1+epsilon) + (1/epsilon)*(log n).
  If you then set epsilon = sqrt(log(n)/T), where T is the number of time
  steps you plan to run the algorithm, you get:
     E[Alg's cost] < OPT + OPT*sqrt(log(n)/T) + sqrt(T*log(n)).
  If you then use the fact that OPT <= T, you get:
     E[Alg's cost] < OPT + 2*sqrt(T log n).

  (This is the way a game-theory person would usually phrase it: in
  terms of additive regret.)

- In our case, we will assume that the maximum L_1 length (sum of
  absolute values) of any cost vector is 1. If D is the L_1 diameter
  of S, the bound we will show is:
     E[Alg's cost] < OPT + D(epsilon*T/2 + 1/epsilon)
  So, setting epsilon = sqrt(2/T), we get:
     E[Alg's cost] < OPT + D*sqrt(2T)
  If you convert to the experts setting, where c \in [0,1]^n, the
  result you will get is an "n" inside the square root.  This is not
  as good as the standard experts bounds.   They give a different
  distribution for hallucinating c_0 that fixes that problem.

Step 1
======
Define x_t = argmin_{x in S} (c_1 + ... + c_t) . x.
We need to show that:
     c_1 . x_1 + ... + c_T . x_T  <= (c_1 + ... + c_T) . x_T
In other words, we do at least as well using algorithm B as the
optimal fixed x in hindsight does.  We can do this by induction:

- By induction, we can assume:
     c_1.x_1 + ... + c_{T-1}.x_{T-1} <= (c_1+...+c_{T-1}).x_{T-1}.
  And we know, by definition of x_t, that:
     (c_1 + ... + c_{T-1}) . x_{T-1} <= (c_1 + ... + c_{T-1}) . x_T
  So, putting these together, and adding c_T . x_T to both sides, we
  get what we want.

Step 2:
=======
Let's start with the easier part, showing that A' and B' are close.

- A' is picking a random objective function from a box of
  side-length 2/epsilon centered at c_1 + ... + c_{t-1}.  B' is
  picking a random objective function from a box of side-length
  2/epsilon centered at c_1 + ... c_t.  Let's call the boxes box(A)
  and box(B).

- These boxes each have volume V = (2/epsilon)^n.  Since their centers
  differ by a vector of L_1 length at most 1, the intersection I of
  the boxes has volume at least (2/epsilon)^n - (2/epsilon)^{n-1} =
  V(1 - epsilon/2).

- This means that A' can't be much worse than B'.  In particular, say
  alpha is your expected loss if you were to pick x by feeding to M a
  random point c in I.  We can view B' as with probability 1-epsilon/2
  doing exactly this and getting expected loss alpha, and with
  probability epsilon/2 choosing a random c in box(B) - I, and at best
  getting a loss of 0.  We can view A' as with probability 1-epsilon/2
  getting an expected loss of alpha, and with probability epsilon/2
  getting at worst a loss of D. [Actually, we never said the losses
  had to be non-negative.  The point is that the L_1 diameter of S is
  D, so the biggest possible gap between the losses of 2 points in S
  on any given c is D.]  The difference between the two is D*epsilon/2.
  This happens for T time steps, so that counts for the DT*epsilon/2 part.

Now, to finish, we need to argue about B versus B'.  This will give us
the D/epsilon part.  This turns out to be pretty easy:

- View the random offset as an initial day 0.  By the argument from
  Step 1, we know that:
       c_0 . x_0 + ... + c_T . x_T <= (c_0 + ... + c_T) . x_T.
  where we define x_t as argmin[x . (c_0 + ... + c_T)]

- But the left-hand-side = cost(B') + c_0 . x_0 (B' didn't actually
  have to pay for day 0).  The RHS is at most OPT + c_0 . x_opt (where
  x_opt is what opt picks if you don't count the fake day 0).  Putting
  this together, we get:
      cost(B') <= OPT + c_0(x_opt - x_0).
  Now, since each entry in c_0 has magnitude at most 1/epsilon, the
  RHS is at most OPT + D/epsilon.

Done!

Extensions
==========
 - Can also get multiplicative bounds if you assume costs are
   non-negative and use a different distribution.
 - If you make L_2 assumptions on c,D, then it's more natural to pick
   random vector from ball instead of cube.  It all depends on the
   application.  
 - If you have movement costs (like in BST example) you can implement
   in a lazy way, and only run M every once in a while.