15-859(B) Machine Learning Theory 02/06/06
* Recap
* Kalai-Vempala algorithm
=========================================================================
Recap from last time
====================
"Combining expert advice": Have n prediction rules and want to perform
nearly as well as best of them. Or, in game-theoretic setting, have
n strategies and want to do nearly as well of best of them in
repeated play.
Randomized weighted majority algorithm:
- Give each a weight (starting at 1).
- Choose rule at random with probability proportional to its weight.
- When you are told the correct label, penalize rules that
made a mistake by multiplying weight by 1-epsilon. (If costs in
[0,1] then multiply by (1-epsilon)^cost).
Get: E[Alg cost] <= (1+epsilon)*OPT + (1/epsilon)*log(n).
"Sleeping experts"/"specialists": at each time step, only some rules fire.
Goal: for all i, E[cost_i(Alg)] <= (1+epsilon)*cost(i) + (1/epsilon)*log(n).
where cost_i(Alg) = cost of algorithm in the times when rule i fires.
- Generalized version of randomized WM.
- Initialize all rules to have weight 1.
- At each time step, of the rules i that fire, select one with
probability p_i proportional to w_i
To update weights:
- If didn't fire, leave weight alone.
- If did fire, raise or lower depending on performance compared to
weighted average:
* R_i = [sum_j p_j cost(j)]/(1+epsilon) - cost(i)
* w_i <- w_i(1+epsilon)^{R_i}
So, if rule i does exactly as well as weighted average, its weight
drops a little. Weight increases if does better than weighted average
by more than a (1+epsilon) factor. Can then prove that total sum of
weights never goes up.
Now notice that one way to look at weights is:
* w_i = (1+epsilon)^{E[cost_i(alg)]/(1+epsilon) - cost_i(i)}
I.e., we are explicitly giving large weights to rules for which we
have large regret. Since sum of weights <= n, exponent must be at most
log_{1+epsilon}(n )
==========================================================================
Today: Notice that RWM bounds are good even if "n" is exponential in
the natual parameters of the problem. But the running time in that
case is bad because we have to explicitly maintain a probability
distribution over experts. Are there interesting settings where we
can get these kinds of bounds *efficiently*?
Today we will discuss an algorithm that can do this in many such cases.
Kalai-Vempala Algorithm
========================
The Kalai-Vempala algorithm applies to the following setting.
- We have a set S of feasible points in R^n. Each time step t, we
(probabilistically) pick x_t in S. We are then given a cost vector
c_t and we pay the dot-product c_t . x_t.
- The set S may have exponentially-many points, so we don't have time
even to list its elements. However, we have a magic offline
algorithm M that, given any cost vector c, finds the x in S that
minimizes c.x. Want to use to produce an *online* algorithm so that
our overall expected cost is not too much worse than for best x in S
in hindsight. Specifically, we want for any sequence c_1,...,c_T
to get:
E[Alg's cost] <= min_{x in S} (c_1 + ... + c_T).x + [regret term]
- To have any hope, we will need to assume that S is bounded and so
are the cost vectors c_t.
What are some things we can model this way?
===========================================
- Consider the following adaptive route-choosing problem. Imagine
each day you have to drive between two points s and t. You have a
map (a graph G) but you don't know what traffic will be like (what
the cost of each edge will be that day). Say the costs are only
revealed to you after you get to t (on the evening news). We can
model this by having one dimension per edge, and each path is represented
by the indicator vector listing the edges in the path. Then the
cost vector c is just the vector with the costs of each edge that
day. Notice that you *could* represent this as an experts problem
also, with one expert per path, but the number of s-t paths can
easily be exponential in the number of edges in the graph (e.g.,
think of the case of a grid). However, given any set of edge
lengths, we can efficiently compute the best path for that cost
vector, since that is just a shortest-path problem. You don't have
to explicitly list all possible paths.
- We can also model the standard experts problem this way by having
each expert be its own coordinate, and c_t is the vector of costs at
time t. (But bounds as a fn of n might be worse).
- More generally, this is like online linear programming.
- Can also model the problem of performing search-tree lookups in a
way that is competitive with the best binary search tree in
hindsight. For this problem, there is a cost of "movement" too
(choosing x_{t+1} that is different from x_t), but that can be
handled by running the algorithm at a very slow learning rate. For
this problem, we have one dimension per element, and a given search
tree is represented as the vector of depths of its elements.
Requests are indicator vectors. Even though there are
exponentially-many search trees on n elements, if we are given a
vector c (which can be viewed as listing how many times each element
is requested), we can compute the optimal tree for c using dynamic
programming. So, we can plug into this framework.
The KV Algorithm
================
One natural thing to try is just run M to pick the best x in
hindsight, argmin_{x in S} (c_1 + ... + c_{t-1}).x, and use that on
day t. But this doesn't work. E.g., imagine there are two routes to
work, one fast and one slow, but they alternate which is fast and
which is slow from day to day. This procedure will always choose the
wrong one. Instead, we will "hallucinate" a day 0 in which we have
random costs, from an appropriate distribution. Then we will run M to
pick the x that minimizes (c_0 + c_1 + ... + c_{t-1}).x.
To analyze this, we will argue in stages.
Step 1: First, we will show that picking
argmin_{x in S} (c_1 + ... + c_t).x
*does* work. I.e., if you find the best in hindsight tomorrow and use
that today. This is maybe not surprising but also not obvious. Note
this is *not* the same as the x that minimizes c_t . x, which
obviously would be super-optimal.
Now, let algorithm A be the one that minimizes (c_1 + ... + c_{t-1}).x.
Let B be the one that minimizes (c_1 + ... + c_t).x.
They both go to the bar and start drinking. As they drink, their
objective functions get fuzzier and fuzzier. We end up getting:
- A' that minimizes (c_0A + c_1 + ... + c_{t-1}).x.
- B' that minimizes (c_0B + c_1 + ... + c_t).x.
where c_0A, c_0B are picked at random in [-1/epsilon,1/epsilon]^n.
(Could use a ball instead of a cube and get somewhat different
bounds. Cube is simpler anyway). As epsilon gets smaller, A and B
start behaving more and more alike.
Step 2 is to show that for an appropriate value of epsilon, B' is not
too much worse than B, and A' is close to B'. This then shows that
A' is a good algorithm.
Preliminaries
=============
- When we analyzed the experts algorithm, we got bounds like:
E[Alg's cost] < OPT(1+epsilon) + (1/epsilon)*(log n).
If you then set epsilon = sqrt(log(n)/T), where T is the number of time
steps you plan to run the algorithm, you get:
E[Alg's cost] < OPT + OPT*sqrt(log(n)/T) + sqrt(T*log(n)).
If you then use the fact that OPT <= T, you get:
E[Alg's cost] < OPT + 2*sqrt(T log n).
(This is the way a game-theory person would usually phrase it: in
terms of additive regret.)
- In our case, we will assume that the maximum L_1 length (sum of
absolute values) of any cost vector is 1. If D is the L_1 diameter
of S, the bound we will show is:
E[Alg's cost] < OPT + D(epsilon*T/2 + 1/epsilon)
So, setting epsilon = sqrt(2/T), we get:
E[Alg's cost] < OPT + D*sqrt(2T)
If you convert to the experts setting, where c \in [0,1]^n, the
result you will get is an "n" inside the square root. This is not
as good as the standard experts bounds. They give a different
distribution for hallucinating c_0 that fixes that problem.
Step 1
======
Define x_t = argmin_{x in S} (c_1 + ... + c_t) . x.
We need to show that:
c_1 . x_1 + ... + c_T . x_T <= (c_1 + ... + c_T) . x_T
In other words, we do at least as well using algorithm B as the
optimal fixed x in hindsight does. We can do this by induction:
- By induction, we can assume:
c_1.x_1 + ... + c_{T-1}.x_{T-1} <= (c_1+...+c_{T-1}).x_{T-1}.
And we know, by definition of x_{T-1}, that:
(c_1 + ... + c_{T-1}) . x_{T-1} <= (c_1 + ... + c_{T-1}) . x_T
So, putting these together, and adding c_T . x_T to both sides, we
get what we want.
Step 2:
=======
Let's start with the easier part, showing that A' and B' are close.
- A' is picking a random objective function from a box of
side-length 2/epsilon centered at c_1 + ... + c_{t-1}. B' is
picking a random objective function from a box of side-length
2/epsilon centered at c_1 + ... c_t. Let's call the boxes box(A)
and box(B).
- These boxes each have volume V = (2/epsilon)^n. Since their centers
differ by a vector of L_1 length at most 1, the intersection I of
the boxes has volume at least (2/epsilon)^n - (2/epsilon)^{n-1} =
V(1 - epsilon/2).
- This means that A' can't be much worse than B'. In particular, say
alpha is your expected loss if you were to pick x by feeding to M a
random point c in I. We can view B' as with probability 1-epsilon/2
doing exactly this and getting expected loss alpha, and with
probability epsilon/2 choosing a random c in box(B) - I, and at best
getting a loss of 0. We can view A' as with probability 1-epsilon/2
getting an expected loss of alpha, and with probability epsilon/2
getting at worst a loss of D. [Actually, we never said the losses
had to be non-negative. The point is that the diameter of S is at most
D, so the biggest possible gap between the losses of 2 points in S
on any given c is D. Actually, for this part of the argument, we
only need L_infty diameter <= D and not L_1 diameter.] The
difference between the two is D*epsilon/2. This happens for T time
steps, so that counts for the DT*epsilon/2 part.
Now, to finish, we need to argue about B versus B'. This will give us
the D/epsilon part. This turns out to be pretty easy:
- View the random offset as an initial day 0. By the argument from
Step 1, we know that:
c_0 . x_0 + ... + c_T . x_T <= (c_0 + ... + c_T) . x_T.
where we define x_t as argmin[x . (c_0 + ... + c_T)]
- But the left-hand-side = cost(B') + c_0 . x_0 (B' didn't actually
have to pay for day 0). The RHS is at most OPT + c_0 . x_opt (where
x_opt is what opt picks if you don't count the fake day 0). Putting
this together, we get:
cost(B') <= OPT + c_0(x_opt - x_0).
Now, since each entry in c_0 has magnitude at most 1/epsilon, the
RHS is at most OPT + D/epsilon.
Done!
Extensions
==========
- Can also get multiplicative bounds if you assume costs are
non-negative and use a different distribution.
- If you make L_2 assumptions on c,D, then it's more natural to pick
random vector from ball instead of cube. It all depends on the
application.
- If you have movement costs (like in BST example) you can implement
in a lazy way, and only run M every once in a while.