15-859(A) Machine Learning Theory 04/01/04 * Kalai-Vempala algorithm for online optimization ========================================================================= In linear programming, you have a convex set S of feasible points in R^n, you are given an objective function c (say to minimize) and the goal is to find the point x in S that minimizes c.x. Consider now the following online version of linear programming. You have a feasible region S, but you don't have the objective function. Instead, you have to pick a point x_1 in S first. Then you are given the cost vector c_1 and you pay c_1 . x_1. Then you pick a new point x_2 and get a cost vector c_2 and pay c_2 . x_2, and so on. The question is: can you find a strategy that does nearly as well as the best *fixed* x in hindsight? Formally, we want for any sequence c_1,...,c_T to get: E[Alg's cost] <= min_{x in S} (c_1 + ... + c_T).x + [regret term] For this to make sense, we'll need to assume that the set S and the entries in c are bounded. What are some things we can model this way? =========================================== - One thing we can model this way is the standard experts problem: the feasible region S is the simplex in R^n. Each dimension is an "expert", and the points in S correspond to picking a probability distribution over experts. The vector c corresponds to the loss vector over experts (which ones made a mistake, etc). - Another problem we can model this way is an online shortest path problem. Imagine each day you have to drive between two points s and t. You have a map (a graph G) but you don't know what traffic will be like (what the cost of each edge will be that day). Say the costs are only revealed to you after you get to t (on the evening news). We can model this by having one dimension per edge, and representing a path by the indicator vector listing the edges in the path. We can let S be the convex hull over these paths (probability distributions over paths) and the cost vector c is just the vector with the costs of each edge that day. Notice that you *could* represent this as an experts problem also, with one expert per path, but the number of s-t paths can easily be exponential in the number of edges in the graph (e.g., think of the case of a grid). Actually, since the experts problem gave bounds that depend only logarithmically on the number of experts, this is not bad as far as the bounds are concerned, but it's not a great way to do things in terms of running time. KV setting ========== More generally, Kalai-Vempala allow for any set S (not necessarily convex) so long as we can solve the *offline* problem of minimizing the dot product x . c for a *given* c. That is: - We have an arbitrary set S of feasible points in R^n. Each time step t, we (probabilistically) pick x_t in S. We are then given c_t and pay the dot-product c_t . x_t. - We have a magic offline algorithm M that, given any cost vector c, finds the x in S that minimizes c.x. Want to use to produce an *online* algorithm so that expected cost is not too much worse than for best x in S in hindsight. In particular, we can at least use M to figure out: what *was* the best x in hindsight anyway? argmin_{x in S} (c_1 + ... + c_t).x. So, for the shortest-path problem, M would just be a shortest-path algorithm. Another example =============== - Consider the problem of a self-adjusting binary search tree. As an experts problem, we could instantiate one expert for each possible tree. When a request is made, a tree experiences a loss based on depth of the requested item. (so losses are in the range [0,n]). We would maintain weights, and at each time step probabilistically decide to stay with current tree or move to new one (can do this in a way that movement cost is bounded by lookup cost). Now, if you don't worry about computation time to maintain all those weights, then on any sequence of requests, the cost of RWM is close to that of the best static tree in hindsight. But of course this is crazy in terms of computation time. In the KV setting, this is a bit less crazy. To formally fit our framework, we have one dimension per item. A tree is represented by just the set of depths of the items. A request for an item is represented by just the unit vector in its coordinate direction. The algorithm M is something that given a *set* of requests computes the best static tree for that set. This is something you can do with dynamic programming. So the question is: given that for any set of requests we can compute the best static tree in hindsight, can we use that to do nearly as well as that in an online setting? The KV Algorithm ================ One natural thing to try is just run M to pick the best x in hindsight, argmin_{x in S} (c_1 + ... + c_{t-1}).x, and use that on day t. But this doesn't work. E.g., imagine there are two routes to work, one fast and one slow, but they alternate which is fast and which is slow from day to day. This procedure will always choose the wrong one. Instead, we will "hallucinate" a day 0 in which we have random costs, from an appropriate distribution. Then we will run M to pick the x that minimizes (c_0 + c_1 + ... + c_{t-1}).x. To analyze this, we will argue in stages. Step 1: First, we will show that picking argmin_{x in S} (c_1 + ... + c_t).x *does* work. I.e., if you find the best in hindsight tomorrow and use that today. This is maybe not surprising but also not obvious. Note this is *not* the same as the x that minimizes c_t . x, which obviously would be super-optimal. Now, let algorithm A be the one that minimizes (c_1 + ... + c_{t-1}).x. Let B be the one that minimizes (c_1 + ... + c_t).x. They both go to the bar and start drinking. As they drink, their objective functions get fuzzier and fuzzier. We end up getting: - A' that minimizes (c_0A + c_1 + ... + c_{t-1}).x. - B' that minimizes (c_0B + c_1 + ... + c_t).x. where c_0A, c_0B are picked at random in [-1/epsilon,1/epsilon]^n. (Could use a ball instead of a cube and get somewhat different bounds. Cube is simpler anyway). As epsilon gets smaller, A and B start behaving more and more alike. Step 2 is to show that for an appropriate value of epsilon, B' is not too much worse than B, and A' is close to B'. This then shows that A' is a good algorithm. Preliminaries ============= - When we analyzed the experts algorithm, we got bounds like: E[Alg's cost] < OPT(1+epsilon) + (1/epsilon)*(log n). If you then set epsilon = sqrt(log(n)/T), where T is the number of time steps you plan to run the algorithm, you get: E[Alg's cost] < OPT + OPT*sqrt(log(n)/T) + sqrt(T*log(n)). If you then use the fact that OPT <= T, you get: E[Alg's cost] < OPT + 2*sqrt(T log n). (This is the way a game-theory person would usually phrase it: in terms of additive regret.) - In our case, we will assume that the maximum L_1 length (sum of absolute values) of any cost vector is 1. If D is the L_1 diameter of S, the bound we will show is: E[Alg's cost] < OPT + D(epsilon*T/2 + 1/epsilon) So, setting epsilon = sqrt(2/T), we get: E[Alg's cost] < OPT + D*sqrt(2T) If you convert to the experts setting, where c \in [0,1]^n, the result you will get is an "n" inside the square root. This is not as good as the standard experts bounds. They give a different distribution for hallucinating c_0 that fixes that problem. Step 1 ====== Define x_t = argmin_{x in S} (c_1 + ... + c_t) . x. We need to show that: c_1 . x_1 + ... + c_T . x_T <= (c_1 + ... + c_T) . x_T In other words, we do at least as well using algorithm B as the optimal fixed x in hindsight does. We can do this by induction: - By induction, we can assume: c_1.x_1 + ... + c_{T-1}.x_{T-1} <= (c_1+...+c_{T-1}).x_{T-1}. And we know, by definition of x_t, that: (c_1 + ... + c_{T-1}) . x_{T-1} <= (c_1 + ... + c_{T-1}) . x_T So, putting these together, and adding c_T . x_T to both sides, we get what we want. Step 2: ======= Let's start with the easier part, showing that A' and B' are close. - A' is picking a random objective function from a box of side-length 2/epsilon centered at c_1 + ... + c_{t-1}. B' is picking a random objective function from a box of side-length 2/epsilon centered at c_1 + ... c_t. Let's call the boxes box(A) and box(B). - These boxes each have volume V = (2/epsilon)^n. Since their centers differ by a vector of L_1 length at most 1, the intersection I of the boxes has volume at least (2/epsilon)^n - (2/epsilon)^{n-1} = V(1 - epsilon/2). - This means that A' can't be much worse than B'. In particular, say alpha is your expected loss if you were to pick x by feeding to M a random point c in I. We can view B' as with probability 1-epsilon/2 doing exactly this and getting expected loss alpha, and with probability epsilon/2 choosing a random c in box(B) - I, and at best getting a loss of 0. We can view A' as with probability 1-epsilon/2 getting an expected loss of alpha, and with probability epsilon/2 getting at worst a loss of D. [Actually, we never said the losses had to be non-negative. The point is that the L_1 diameter of S is D, so the biggest possible gap between the losses of 2 points in S on any given c is D.] The difference between the two is D*epsilon/2. This happens for T time steps, so that counts for the DT*epsilon/2 part. Now, to finish, we need to argue about B versus B'. This will give us the D/epsilon part. This turns out to be pretty easy: - View the random offset as an initial day 0. By the argument from Step 1, we know that: c_0 . x_0 + ... + c_T . x_T <= (c_0 + ... + c_T) . x_T. where we define x_t as argmin[x . (c_0 + ... + c_T)] - But the left-hand-side = cost(B') + c_0 . x_0 (B' didn't actually have to pay for day 0). The RHS is at most OPT + c_0 . x_opt (where x_opt is what opt picks if you don't count the fake day 0). Putting this together, we get: cost(B') <= OPT + c_0(x_opt - x_0). Now, since each entry in c_0 has magnitude at most 1/epsilon, the RHS is at most OPT + D/epsilon. Done! Extensions ========== - Can also get multiplicative bounds if you assume costs are non-negative and use a different distribution. - If you make L_2 assumptions on c,D, then it's more natural to pick random vector from ball instead of cube. It all depends on the application. - If you have movement costs (like in BST example) you can implement in a lazy way, and only run M every once in a while.