15-859(B) Machine Learning Theory 04/02/14 Generalizations of "combining expert advice" results * Online LP and the Kalai-Vempala algorithm ========================================================================= Recap from earlier ================== "Combining expert advice": Have N prediction rules and want to perform nearly as well as best of them. Or, in game-theoretic setting, have N strategies and want to do nearly as well of best of them in repeated play. Randomized weighted majority algorithm: - Give each a weight (starting at 1). - Choose rule at random with probability proportional to its weight. - When you are told the correct label, penalize rules that made a mistake by multiplying weight by 1-epsilon. (If costs in [0,1] then multiply by (1-epsilon)^cost or (1-epsilon*cost) (both work). Get: E[Alg cost] <= (1+epsilon)*OPT + (1/epsilon)*log(N). Today: Notice that RWM bounds are good even if "N" is exponential in the natual parameters of the problem. But the running time in that case is bad because we have to explicitly maintain a probability distribution over experts. Are there interesting settings where we can get these kinds of bounds *efficiently*? Today we will discuss an algorithm that can do this in many such cases. [Do motivating example first] Kalai-Vempala Algorithm ======================== The Kalai-Vempala algorithm applies to the following setting. - We have a set S of feasible points in R^m. Each time step t, we (probabilistically) pick x_t in S. We are then given a cost vector c_t and we pay the dot-product c_t . x_t. - The set S may have exponentially-many points, so we don't have time even to list its elements. However, we have a magic offline algorithm M that, given any cost vector c, finds the x in S that minimizes c.x. Want to use to produce an *online* algorithm so that our overall expected cost is not too much worse than for best x in S in hindsight. Specifically, we want for any sequence c_1,...,c_T to get: E[Alg's cost] <= min_{x in S} (c_1 + ... + c_T).x + [regret term] - To have any hope, we will need to assume that S is bounded and so are the cost vectors c_t. What are some things we can model this way? =========================================== - Consider the following adaptive route-choosing problem. Imagine each day you have to drive between two points s and t. You have a map (a graph G) but you don't know what traffic will be like (what the cost of each edge will be that day). Say the costs are only revealed to you after you get to t (on the evening news). We can model this by having one dimension per edge, and each path is represented by the indicator vector listing the edges in the path. Then the cost vector c is just the vector with the costs of each edge that day. Notice that you *could* represent this as an experts problem also, with one expert per path, but the number of s-t paths can easily be exponential in the number of edges in the graph (e.g., think of the case of a grid). However, given any set of edge lengths, we can efficiently compute the best path for that cost vector, since that is just a shortest-path problem. You don't have to explicitly list all possible paths. - Imagine a search engine that given a query q outputs a list of web pages. User clicks on one, and we define cost as the depth in the list of the web-page clicked. The next time q is queried you give a different ordering. Want to do nearly as well as best fixed ordering in hindsight. We could model this by having S be the set of permutation vectors (describing the depth of each element), and c_t will be a coordinate vector (indicating the item the user wanted). - We can also model the standard experts problem this way by having each expert be its own coordinate, and c_t is the vector of costs at time t. (But bounds as a fn of n might be worse). - More generally, this is like online linear programming. The KV Algorithm ================ One natural thing to try is just run M to pick the best x in hindsight, argmin_{x in S} (c_1 + ... + c_{t-1}).x, and use that on day t. But this doesn't work. E.g., imagine there are two routes to work, one fast and one slow, but they alternate which is fast and which is slow from day to day. This procedure will always choose the wrong one. Instead, we will "hallucinate" a day 0 in which we have random costs, from an appropriate distribution. Then we will run M to pick the x that minimizes (c_0 + c_1 + ... + c_{t-1}).x. To analyze this, we will argue in stages. Step 1: First, we will show that picking argmin_{x in S} (c_1 + ... + c_t).x *does* work. I.e., if you find the best in hindsight tomorrow and use that today. This is maybe not surprising but also not obvious. Note this is *not* the same as the x that minimizes c_t . x, which obviously would be super-optimal. Now, let algorithm A be the one that minimizes (c_1 + ... + c_{t-1}).x. Let B be the one that minimizes (c_1 + ... + c_t).x. They both go to the bar and start drinking. As they drink, their objective functions get fuzzier and fuzzier. We end up getting: - A' that minimizes (c_0A + c_1 + ... + c_{t-1}).x. - B' that minimizes (c_0B + c_1 + ... + c_t).x. where c_0A, c_0B are picked at random in [-1/epsilon,1/epsilon]^m. (Could be random in [0,2/epsilon]^m if we don't want negatives, e.g., in shortest path, but conceptually earier to think of centered box) (Could also use a ball instead of a cube and get somewhat different bounds. Cube is simpler anyway). As epsilon gets smaller, A and B start behaving more and more alike. Step 2 is to show that for an appropriate value of epsilon, B' is not too much worse than B, and A' is close to B'. This implies A' is good. Preliminaries ============= - We will assume that the maximum L_1 length (sum of absolute values) of any cost vector is 1. If D is the L_1 diameter of S, the bound we will show is: E[Alg's cost] < OPT + D(epsilon*T/2 + 1/epsilon) So, setting epsilon = sqrt(2/T), we get: E[Alg's cost] < OPT + D*sqrt(2T) If you convert to the experts setting, where c \in [0,1]^m, the result you will get is an "m" inside the square root. This is not as good as the standard experts bounds. They give a different distribution for hallucinating c_0 that fixes that problem. Step 1 ====== Define x_t = argmin_{x in S} (c_1 + ... + c_t) . x. We need to show that: c_1 . x_1 + ... + c_T . x_T <= (c_1 + ... + c_T) . x_T In other words, we do at least as well using algorithm B as the optimal fixed x in hindsight does. We can do this by induction: - By induction, we can assume: c_1.x_1 + ... + c_{T-1}.x_{T-1} <= (c_1+...+c_{T-1}).x_{T-1}. And we know, by definition of x_{T-1}, that: (c_1 + ... + c_{T-1}) . x_{T-1} <= (c_1 + ... + c_{T-1}) . x_T So, putting these together, and adding c_T . x_T to both sides, we get what we want. Step 2: ======= Let's start with the easier part, showing that A' and B' are close. - A' is picking a random objective function from a box of side-length 2/epsilon centered at c_1 + ... + c_{t-1}. B' is picking a random objective function from a box of side-length 2/epsilon centered at c_1 + ... c_t. Let's call the boxes box(A) and box(B). - These boxes each have volume V = (2/epsilon)^m. Since their centers differ by a vector c_t of L_1 length at most 1, the intersection I of the boxes has volume at least (2/epsilon)^m - |c_t|(2/epsilon)^{m-1} >= V(1 - epsilon/2). [[this is where the L_1 length of the cost vectors comes in]] - This means that A' can't be much worse than B'. In particular, say alpha is your expected loss if you were to pick x by feeding a random point c in I to our offline oracle M. We can view B' as with probability 1-epsilon/2 doing exactly this and getting expected loss alpha, and with probability epsilon/2 choosing a random c in box(B)-I, and at best getting a loss of 0. We can view A' as with probability 1-epsilon/2 getting an expected loss of alpha, and with probability epsilon/2 getting at worst a loss of D. [Actually, we never said the losses had to be non-negative. The point is that the diameter of S is at most D, so the biggest possible gap between the losses of 2 points in S on any given c is D. Actually, for this part of the argument, we only need L_infty length of c is at most 1, not L_1 length]. The difference between the two is D*epsilon/2. This happens for T time steps, so that counts for the DT*epsilon/2 part. Now, to finish, we need to argue about B versus B'. This will give us the D/epsilon part. This turns out to be pretty easy: - View the random offset as an initial day 0. By the argument from Step 1, we know that: c_0 . x_0 + ... + c_T . x_T <= (c_0 + ... + c_T) . x_T. where we define x_t as argmin[x . (c_0 + ... + c_T)] - But the left-hand-side = cost(B') + c_0 . x_0 (B' didn't actually have to pay for day 0). The RHS is at most OPT + c_0 . x_opt (where x_opt is what opt picks if you don't count the fake day 0). Putting this together, we get: cost(B') <= OPT + c_0(x_opt - x_0). Now, since each entry in c_0 has magnitude at most 1/epsilon, the RHS is at most OPT + D/epsilon due to the diameter of the space. Done! ==========================================================================