Notes for Jan 21, 1998 ---------------------------------------------------------------------- Today, talk about a problem of "predicting from expert advice" and a nice alg called the weighted majority algorithm. Setting: each day we get a prediction from some number of experts about whether it is going to rain. Then, we use them to make our prediction. Our goal: do nearly as well as best of the experts in hindsight. Can view this as basically a fancy way of saying the following: Let's say we extend the mistake-bound model to allow noise. Originally, we said our goal was to make at most some bounded number of mistakes, assuming that the data is consistent with some concept in C. Could extend to say: "our goal is to make at most some poly(n) number of mistakes, plus a small constant times the number of mistakes made by the best concept in C". Then, experts setting (and in particular, the goal of doing nearly as well as the best expert in hindsight) is really the case of C = {single variable concepts}. Weighted-majority Algorithm (2 versions) --------------------------- Version 1: initialize all experts to weight 1. Given an example, we poll all the experts and predict based on a weighted majority vote of their predictions. We then cut in half the weights of all that make a mistake. (no rewards.) Claim: this makes at most some small constant times the # of mistakes of the best expert, plus additive amount that is logarithmic in the number of experts. Analysis: Let n = number of experts M = # mistakes we've made so far. m = # mistakes best expert so far has made. W = total weight. After each mistake, W <-- W*(3/4). So, W <= n*(3/4)^M. weight of best expert = (1/2)^m. Now, notice that the total weight cannot be smaller than the weight of the best expert. Solve: n*(3/4)^M >= (1/2)^m, or, (4/3)^M <= n*2^m. M <= 2.4(log_2 n + m). Comparison with Halving Alg --------------------------- How does this compare with the halving algorithm from last time? Can think of each concept in a class C as an expert. In the no-noise MB setting (there exists *some* correct concept), the halving alg made log(|C|) mistakes. That's gone up by factor of 2. But, now we're robust to the case where the optimal concept is occasionally faulty: we make at most 2.4 mistakes for every one made by best concept in hindsight. Randomized WM ------------- If best concept (or best expert) has, say, 40% error rate, then this guarantee is not so good. We'd like to improve dependence on m. Version 2: instead of predicting based on majority vote, let's use the weights as probabilities. (Randomized weighted majority). E.g., w_0 weight on outcome 0, w_1 on outcome 1, etc. W = sum. Then, we'll predict outcome i with prob w_i/W. Goal: bound our worst-case expected number of mistakes, and lets assume that the adversary (the world) has to select one of the answers as correct before we make our coin toss. Why is this better in the worst case? Idea: worst case for deterministic algorithm was when weights split 50/50. But, now it's not so bad since we also have 50/50 chance of getting it right. Also, to trade-off between dependence on m and log(n), we'll generalize to multiply by beta < 1, instead of necessarily 1/2. Analysis: Define F_i to be the fraction of weight on the WRONG answers at the ith example. So, F_i is the probability we make a mistake on the ith example. Let M be a random variable denoting our total number of mistakes. So E[M] = sum(F_i), using the fact that Expectation is additive. On ith example, W <-- W(1 - (1 - beta)F_i). Reason: on F_i fraction, we're multiplying by beta. So, final weight W = n(1 - (1-beta)F_1)(1 - (1-beta)F_2)... Let's say that m is the number of mistakes of the best expert so far. We can use the inequality W >= beta^m. Now we solve. First take natural log of both sides. We get: m ln(beta) <= ln(n) + sum[ln(1 - (1-beta)F_i)] Simplify: ln(1-x) = - x - x^2/2 - x^3/3 - ... So, ln(1 - (1-beta)F_i) < -(1-beta)F_i. m ln(beta) <= ln(n) - (1-beta)*sum(F_i) Now, use E[M] = sum(F_i). Result is: m ln(1/beta) + ln(n) E[M] <= -------------------- (1 - beta) Same rough form as before. Have we made progress? Yes: e.g., beta=1/2, get 1.39m + 2ln(n) beta=3/4, get 1.15m + 4ln(n) Roughly, of the form (1+epsilon)*m + 1/epsilon*ln(n). ------------------------------------------------------------------------ Uses of Randomized WM --------------------- One nice feature of RWM is that can apply in settings where predictions are not things that can be combined (or can't be combined easily). Example: each expert is telling you a different way to drive to work. You pick one using RWM. Later you find out how well you would have done (as you drive you listen to the traffic report) and penalize appropriately. To do this right, we want to generalize from just "loss" of 0 or 1 to losses in [0,1]. Goal of having expected loss be not too much worse than loss of best expert. How to generalize? Answer: penalize by beta^loss. I.e., having two examples of loss 1/2 gives same weight as one example of loss 1 and one example of loss 0. Analysis still goes through: basically, just need to replace F_i by p_i*l_i where p_i is our prob distribution at time i, and l_i is the loss vector at time i. (So, if losses are {0,1} this is the same thing as F_i). E.g., Our expected loss at time i is p_i*l_i, so our total expected loss is sum(p_i*l_i). The weight change part is a little trickier: New weight of expert j is (old weight) * beta^(its loss) This is less than or equal to (1 - (1-beta)*(its loss))*(old weight) So, this means we can still use the formula (new W) <= (old W)*(1 - (1 - beta)p_i*l_i) Neat use of RWM: --------------- --> Use in setting of 2-player games.