Notes for Feb 16, 1998 * Boosting. ============================================================================== Def: Let's say that alg A WEAK-LEARNS class C if for all c in C, all distributions D, THERE EXISTS epsilon, delta > 1/poly(n,s) s.t. A achieves error 1/2 - epsilon with prob at least delta. I.e., with some probability you get noticeable correlation. Question: supposed we defined PAC model this way, does this change the notion of what is learnable and what is not? View as algorithm: have a black box (neural net) that can get some fraction correct, and you want to use it to either find a disrib that's inherently hard for that algorithm, or else to boost up your accuracy. Dist specific: can't do it. Imagine uniform dist. concept = x1 ^ (hard). Can get 75% right, but hard to do better. --> Use the fact that it must get weak learning over ALL distributions. An easy case ------------ Suppose our weak-learning algorithm only makes one-sided error. For instance, it gets all of the negatives right and at least 1/poly of the positives right. (E.g., when learning a disjunction, greedy alg picks a variable with this property). Then, boosting is easy: We first find h_1 that gets at least 1/poly of the positives right over distribution D. We then define D_2 to be D restricted to examples x such that h_1(x) = 0 (i.e., throw out the positives we got right) and learn over that, etc. Our final hypothesis is the OR of all the h_i(x)'s. For boosting hypotheses that make 2-sided error, though, it's a little more tricky..... Boosting: [Shapire '90] (a more detailed description of this is in the book) ----------------------- Suppose we have an algorithm A that over any distribution will get at least 70% accuracy with high probability. Say we want to boost it up a bit (which we can then apply recursively). We start by running A on initial distribution D_1 = D and getting hypothesis h_1. Say its accuracy is 70%. Now we want a new distribution. One try: filter D to only look at examples where h_1 predicts incorrectly and try to learn over that. This DOESN'T work. Why? Instead: we filter to create a distribution where h_1 behaves like random guessing. Specifically, let S_C be the set of examples on which h_1 predicts correctly and S_I be the set on which h_1 predicts incorrectly. What we want is a distribution D_2 where S_C and S_I both have equal weight, but besides that is just like D_1 in that Pr_{D_2}(x | x in S_C) = Pr_{D_1}(x | x in S_C) and the same for S_I. For example, we can construct D_2 by flipping a coin and if heads we wait until we see an example from D_1 where h_1 is correct, and if tails we wait until we see one where h_1 is incorrect. Or we can draw examples from D_1 and place them into two queues, one for S_C and one for S_I, and then to get an example from D_2 we randomly choose a queue to select from. Let's write D_2[x] as a function of D_1[x] (this is the probability associated with x, or the density function if we're in a continuous space). To create D_2 we reduced the probability on S_C from 70% to 50%, so if x in S_C then D_2[x] = (5/7)D_1[x]. We increased the probability on S_I from 30% to 50% so if x in S_I then D_2[x] = (5/3)D_1[x]. Now we get a hypothesis h_2 with accuracy 70% on D_2. Finally, we filter D only allowing examples x where h_1(x) != h_2(x). We then get hypothesis h_3 with accuracy 70% on this distribution D_3. Resulting hypothesis: if h_1 and h_2 agree, then go with that, else go with h_3. I.e., majority vote. Analysis -------- What's the accuracy of the new hypothesis? Easiest way is to draw a picture. Divide space into 4 regions: R_1 in which h_1 = c and h_2 = c, R_2 in which h_1 = c and h_2 != c, R_3 in which h_1 != c and h_2 != c, and R_4 in which h_1 != c and h_2 = c. We predict correctly on R_1, incorrectly on R_3 and get 70% correct on R_2 U R_4. Let's say that D_2[R_2] = gamma. By the fact that h_2 has 30% error on D_2 we know that D_2[R_3] = 0.3 - gamma, and by definition of D_2 we know that D_2[R_1 U R_2] = D[R_3 U R_4] = 1/2, so D[R_1] = 7/5 * (1/2 - gamma) and D[R_3] = 3/5 * (3/10 - gamma). We can now work out our failure probability as: Pr_D[fail] = D[R_3] + (3/10)(1 - D[R_1] - D[R_3]) = (7/10)D[R_3] + (3/10)(1 - D[R_1]) = (7/10)[3/5 * (3/10 - gamma)] + (3/10)[3/10 + (7/5)gamma] = (3/10)^2[3 - 3/5] which is approx 0.216 More generally, if we work this through, we get: error <- error^2(3 - 2*error) This is always a decrease assuming that our original error was strictly between 0 and 1/2. ============================================================================ Better algorithm called "AdaBoost" by Freund and Schapire. Idea: use Randomized-WM algorithm, but in a strange way. Setup: first draw large sample that will be sufficient by Occam arguments. Will then have distributions over this fixed sample. E.g., we could have done it this way above, by just re-weighting the points. - Use Weighted-majority, but in peculiar way. Experts will correspond to examples. Weighting of experts will be the current distribution. "trials" will correspond to the weak hypotheses. Specifically, the algorithm is: 0. Draw n examples. 1. Start with weight w_i = 1 on each example x_i. The associated prob dist D has P(x_i) = w_i/W, where W = sum(w_i). 2. Feed distribution to weak algorithm, get back hypothesis h_t. Now update: Multiply w_i by beta_t if h(x_i) = c(x_i). beta_t = (1/2 - gamma_t)/(1/2 + gamma_t), where 1/2-gamma_t is the error of the hypothesis h_t over distrib D_t. 3. Final hypothesis: say "positive" if sum_t[ alpha_t * h_t(x) ] > 1/2 where alpha_t = ln(1/beta_t) / sum_s[ ln(1/beta_s) ] Claim: training error(h_final) < e^{-2*sum_t[ (gamma_t)^2 ]} (So, if each time we have error at most 1/2 - gamma, and our goal is to achieve error epsilon, then we need to run this O((1/gamma)^2 * log(1/epsilon)) times.) E.g., say all have error < 25%, so gamma = 0.25. Say we want error 1%. Solve: T = 8*ln(100), which is about 37. Need to multiply number of samples from VC bounds by T to have training error be guaranteed to be close to test error (whp). Let's analyze a simpler version: use fixed beta = 1/2 - 2*gamma. Final hypothesis is just majority vote over the h_t's. Analysis -------- Let's recall our analysis of RWM. We did this by looking at the total weight W_final at the end of all the trials. If F_t is the fraction of weight that got multiplied by beta on trial t, then an upper bound on W_final is W_final <= n*(1-(1-beta)F_1)*(1-(1-beta)F_2)*... so, taking logs and using ln(1-x) > -x, we got: ln(W_final) <= ln(n) - (1-beta) * sum[ F_t ] Also, if the best expert made m mistakes, then we got a lower bound: ln(W_final) >= ln(beta^m). These two inequalities were useful because sum[F_t] was equal to our expected number of mistakes, so we could then solve for that. In our current setting, F_t is the fraction of the current distribution that the current hypothesis got *correct*. So, F_t > 1/2 + gamma, so ln(W_final) <= ln(n) - (1-beta)*T*(1/2 + gamma) Also, we can lower bound W_final by noticing that any example on which our final hypothesis makes a mistake must have weight at least beta^{T/2}, since if our majority vote is wrong then we could have multiplied it by beta *at most* T/2 times. So, if in the end our error is E (i.e., we make E*n mistakes), then W_final >= beta^{T/2} * E * n or ln(W_final) >= T/2 * ln(beta) + ln(E) + ln(n) Using the upper and lower bounds on ln(W_final) together, we can cross off the ln(n) term and we get: T/2 * ln(beta) + ln(E) <= -(1-beta)*T*(1/2 + gamma) or ln(E) <= -T*(ln(beta)/2 + (1-beta)(1/2 + gamma) now, using beta = 1-2*gamma, we get: ln(E) <= -T*(ln(1-2*gamma)/2 + gamma + 2*gamma^2) now using ln(1-x) is approx -x - x^2/2 (this is cheating a bit...) we get ln(E) <= -T*gamma^2 This is the form of inequality we wanted.