15-859(A) Machine Learning Theory 04/01/04
* Kalai-Vempala algorithm for online optimization
=========================================================================
In linear programming, you have a convex set S of feasible points in
R^n, you are given an objective function c (say to minimize) and the
goal is to find the point x in S that minimizes c.x.
Consider now the following online version of linear programming. You
have a feasible region S, but you don't have the objective function.
Instead, you have to pick a point x_1 in S first. Then you are given
the cost vector c_1 and you pay c_1 . x_1. Then you pick a new
point x_2 and get a cost vector c_2 and pay c_2 . x_2, and so on. The
question is: can you find a strategy that does nearly as well as the
best *fixed* x in hindsight? Formally, we want for any sequence
c_1,...,c_T to get:
E[Alg's cost] <= min_{x in S} (c_1 + ... + c_T).x + [regret term]
For this to make sense, we'll need to assume that the set S and the
entries in c are bounded.
What are some things we can model this way?
===========================================
- One thing we can model this way is the standard experts problem:
the feasible region S is the simplex in R^n. Each dimension is an
"expert", and the points in S correspond to picking a probability
distribution over experts. The vector c corresponds to the loss
vector over experts (which ones made a mistake, etc).
- Another problem we can model this way is an online shortest path
problem. Imagine each day you have to drive between two points s
and t. You have a map (a graph G) but you don't know what traffic
will be like (what the cost of each edge will be that day). Say the
costs are only revealed to you after you get to t (on the evening
news). We can model this by having one dimension per edge, and
representing a path by the indicator vector listing the edges in the
path. We can let S be the convex hull over these paths (probability
distributions over paths) and the cost vector c is just the vector
with the costs of each edge that day. Notice that you *could*
represent this as an experts problem also, with one expert per path,
but the number of s-t paths can easily be exponential in the number
of edges in the graph (e.g., think of the case of a grid).
Actually, since the experts problem gave bounds that depend only
logarithmically on the number of experts, this is not bad as far as
the bounds are concerned, but it's not a great way to do things in
terms of running time.
KV setting
==========
More generally, Kalai-Vempala allow for any set S (not necessarily
convex) so long as we can solve the *offline* problem of minimizing
the dot product x . c for a *given* c. That is:
- We have an arbitrary set S of feasible points in R^n. Each time
step t, we (probabilistically) pick x_t in S. We are then given c_t
and pay the dot-product c_t . x_t.
- We have a magic offline algorithm M that, given any cost vector c,
finds the x in S that minimizes c.x. Want to use to produce an *online*
algorithm so that expected cost is not too much worse than for best
x in S in hindsight.
In particular, we can at least use M to figure out: what *was* the
best x in hindsight anyway? argmin_{x in S} (c_1 + ... + c_t).x.
So, for the shortest-path problem, M would just be a shortest-path
algorithm.
Another example
===============
- Consider the problem of a self-adjusting binary search tree. As
an experts problem, we could instantiate one expert for each
possible tree. When a request is made, a tree experiences a loss
based on depth of the requested item. (so losses are in the range
[0,n]). We would maintain weights, and at each time step
probabilistically decide to stay with current tree or move to new
one (can do this in a way that movement cost is bounded by lookup
cost). Now, if you don't worry about computation time to maintain
all those weights, then on any sequence of requests, the cost of RWM
is close to that of the best static tree in hindsight. But of
course this is crazy in terms of computation time.
In the KV setting, this is a bit less crazy. To formally fit our
framework, we have one dimension per item. A tree is represented by
just the set of depths of the items. A request for an item is
represented by just the unit vector in its coordinate direction.
The algorithm M is something that given a *set* of requests computes
the best static tree for that set. This is something you can do
with dynamic programming. So the question is: given that for any
set of requests we can compute the best static tree in hindsight,
can we use that to do nearly as well as that in an online setting?
The KV Algorithm
================
One natural thing to try is just run M to pick the best x in
hindsight, argmin_{x in S} (c_1 + ... + c_{t-1}).x, and use that on
day t. But this doesn't work. E.g., imagine there are two routes to
work, one fast and one slow, but they alternate which is fast and
which is slow from day to day. This procedure will always choose the
wrong one. Instead, we will "hallucinate" a day 0 in which we have
random costs, from an appropriate distribution. Then we will run M to
pick the x that minimizes (c_0 + c_1 + ... + c_{t-1}).x.
To analyze this, we will argue in stages.
Step 1: First, we will show that picking
argmin_{x in S} (c_1 + ... + c_t).x
*does* work. I.e., if you find the best in hindsight tomorrow and use
that today. This is maybe not surprising but also not obvious. Note
this is *not* the same as the x that minimizes c_t . x, which
obviously would be super-optimal.
Now, let algorithm A be the one that minimizes (c_1 + ... + c_{t-1}).x.
Let B be the one that minimizes (c_1 + ... + c_t).x.
They both go to the bar and start drinking. As they drink, their
objective functions get fuzzier and fuzzier. We end up getting:
- A' that minimizes (c_0A + c_1 + ... + c_{t-1}).x.
- B' that minimizes (c_0B + c_1 + ... + c_t).x.
where c_0A, c_0B are picked at random in [-1/epsilon,1/epsilon]^n.
(Could use a ball instead of a cube and get somewhat different
bounds. Cube is simpler anyway). As epsilon gets smaller, A and B
start behaving more and more alike.
Step 2 is to show that for an appropriate value of epsilon, B' is not
too much worse than B, and A' is close to B'. This then shows that
A' is a good algorithm.
Preliminaries
=============
- When we analyzed the experts algorithm, we got bounds like:
E[Alg's cost] < OPT(1+epsilon) + (1/epsilon)*(log n).
If you then set epsilon = sqrt(log(n)/T), where T is the number of time
steps you plan to run the algorithm, you get:
E[Alg's cost] < OPT + OPT*sqrt(log(n)/T) + sqrt(T*log(n)).
If you then use the fact that OPT <= T, you get:
E[Alg's cost] < OPT + 2*sqrt(T log n).
(This is the way a game-theory person would usually phrase it: in
terms of additive regret.)
- In our case, we will assume that the maximum L_1 length (sum of
absolute values) of any cost vector is 1. If D is the L_1 diameter
of S, the bound we will show is:
E[Alg's cost] < OPT + D(epsilon*T/2 + 1/epsilon)
So, setting epsilon = sqrt(2/T), we get:
E[Alg's cost] < OPT + D*sqrt(2T)
If you convert to the experts setting, where c \in [0,1]^n, the
result you will get is an "n" inside the square root. This is not
as good as the standard experts bounds. They give a different
distribution for hallucinating c_0 that fixes that problem.
Step 1
======
Define x_t = argmin_{x in S} (c_1 + ... + c_t) . x.
We need to show that:
c_1 . x_1 + ... + c_T . x_T <= (c_1 + ... + c_T) . x_T
In other words, we do at least as well using algorithm B as the
optimal fixed x in hindsight does. We can do this by induction:
- By induction, we can assume:
c_1.x_1 + ... + c_{T-1}.x_{T-1} <= (c_1+...+c_{T-1}).x_{T-1}.
And we know, by definition of x_t, that:
(c_1 + ... + c_{T-1}) . x_{T-1} <= (c_1 + ... + c_{T-1}) . x_T
So, putting these together, and adding c_T . x_T to both sides, we
get what we want.
Step 2:
=======
Let's start with the easier part, showing that A' and B' are close.
- A' is picking a random objective function from a box of
side-length 2/epsilon centered at c_1 + ... + c_{t-1}. B' is
picking a random objective function from a box of side-length
2/epsilon centered at c_1 + ... c_t. Let's call the boxes box(A)
and box(B).
- These boxes each have volume V = (2/epsilon)^n. Since their centers
differ by a vector of L_1 length at most 1, the intersection I of
the boxes has volume at least (2/epsilon)^n - (2/epsilon)^{n-1} =
V(1 - epsilon/2).
- This means that A' can't be much worse than B'. In particular, say
alpha is your expected loss if you were to pick x by feeding to M a
random point c in I. We can view B' as with probability 1-epsilon/2
doing exactly this and getting expected loss alpha, and with
probability epsilon/2 choosing a random c in box(B) - I, and at best
getting a loss of 0. We can view A' as with probability 1-epsilon/2
getting an expected loss of alpha, and with probability epsilon/2
getting at worst a loss of D. [Actually, we never said the losses
had to be non-negative. The point is that the L_1 diameter of S is
D, so the biggest possible gap between the losses of 2 points in S
on any given c is D.] The difference between the two is D*epsilon/2.
This happens for T time steps, so that counts for the DT*epsilon/2 part.
Now, to finish, we need to argue about B versus B'. This will give us
the D/epsilon part. This turns out to be pretty easy:
- View the random offset as an initial day 0. By the argument from
Step 1, we know that:
c_0 . x_0 + ... + c_T . x_T <= (c_0 + ... + c_T) . x_T.
where we define x_t as argmin[x . (c_0 + ... + c_T)]
- But the left-hand-side = cost(B') + c_0 . x_0 (B' didn't actually
have to pay for day 0). The RHS is at most OPT + c_0 . x_opt (where
x_opt is what opt picks if you don't count the fake day 0). Putting
this together, we get:
cost(B') <= OPT + c_0(x_opt - x_0).
Now, since each entry in c_0 has magnitude at most 1/epsilon, the
RHS is at most OPT + D/epsilon.
Done!
Extensions
==========
- Can also get multiplicative bounds if you assume costs are
non-negative and use a different distribution.
- If you make L_2 assumptions on c,D, then it's more natural to pick
random vector from ball instead of cube. It all depends on the
application.
- If you have movement costs (like in BST example) you can implement
in a lazy way, and only run M every once in a while.