Notes for 15-681 Machine Learning, 10/24/96 Today's topic: Boosting - theory - an application ============================================================================= THE IDEA: In the PAC model, we proved PAC guarantees for learning simple things: conjunctions, decision lists, rectangles, intervals, linear separators (we didn't actually prove this one...) In practice, we often want more expressive hypotheses, even though we no longer can prove PAC-style convergence guarantees: - decision trees (ID3) - Neural Nets (Backprop) (It's not just that we can't prove it -- in fact the algorithms don't guarantee PAC convergence. E.g., can you think of a target concept representable as a small decision tree, and a distribution on examples, where ID3 will fail (in the sense that it needs exponentially many examples to learn)? Today: an idea to bridge the gap: BOOSTING. +----------+ Distrib D on examples-->| alg A |---> hypothesis +----------+ | | | | | V | +----------+ +--------->| Boosting | +----------+ | | +---------+ +------------+ | OR | V V Bad distrib D' for A. Hypothesis with 99% accuracy over D (A gets < 51% accuracy) In other words, given any algorithm at all, can use it to either achieve a PAC kind of guarantee for your problem, or else get a witness that the algorithm doesn't PAC learn. Note: We will need to run A several times on independent sets of examples, so in practice it's only useful when we have lots of data. Corollary: Suppose there exists an algorithm A that PAC learns concept class C, but only for epsilon=1/3 (i.e., for all c in C, all distribs D, all delta>0, A achieves error < 1/3 in polynomial time and samples). THEN, C is PAC learnable. Proof: *ask* In other words, the PAC model is robust: if we fix a specific epsilon < 1/2 instead of saying "for all epsilon", then this doesn't change the set of learnable things. ========================================================================== How can we do this? Let's do an easy case first: Suppose algorithm A has only one-sided error. Given a set of data, it tries to find a hypothesis that gets all of the positives right and at least 10% of the negatives right too. E.g., suppose the target concept is a circle, with positives inside and negatives outside, and A looks for a line that has all the positives on one side and at least 10% of the negatives on the other. How to boost? Run A on the data. It finds a line L_1. Throw out the negatives "outside" L_1 and then repeat on the data remaining. If A continues to get at least 10% of the negatives remaining right, then after S lines, the intersection of our positive regions gets all the positive points right and gets at most (0.9)^S of the negative points wrong. So, to get error epsilon, need to run O(log 1/epsilon) times. (If A ever fails, then we've found a dataset that's hard for A.) Are we overfitting? Not too bad since number of degrees of freedom is proportional to S (VC dimension too), which is only logarithmic in 1/epsilon. So, amount of data needed goes up but not by too much. ========================================================================== General case: Suppose algorithm A claims to get at least 70% accuracy over every probability distribution. Goal is to boost it up a bit (which we can then apply recursively). We start by running A on initial distribution D and getting hypothesis h_1. Say its accuracy is 70%. ***Now we want a new distribution that forces A to give us more information.*** Now, filter D to create a distribution where h_1 behaves like random guessing. Specifically, let CORRECT be the set of examples on which h_1 predicts correctly and INCORRECT be the set on which h_1 predicts incorrectly. What we want is a distribution D_2 where CORRECT and INCORRECT both have equal weight, but besides that is just like D. Notation: D[x] is probability of x under distribution D (or the density of x if we're in a continuous setting). D[S] for a set of examples S is the sum (or integral) of D over x in S. Right now, D[CORRECT] = 0.7, D[INCORRECT] = 0.3. So, we want D_2 defined like this: D_2[x] = 5/7 * D[x], for x in CORRECT D_2[x] = 5/3 * D[x], for x in INCORRECT But, all we have is access to D. How can we use D to create D_2? One way: for each example desired, we flip a coin. If heads, we wait until we see an example from D where h_1 is correct, and if tails we wait until we see one where h_1 is incorrect. Another way: We draw examples from D_1 and place them into two queues, one for CORRECT and one for INCORRECT, and then to get an example from D_2 we randomly choose a queue to select from. Now we run A to get a hypothesis h_2 with accuracy 70% on D_2. Finally, we filter D only allowing examples x where h_1(x) != h_2(x). We then get hypothesis h_3 with accuracy 70% on this distribution D_3. RESULTING HYPOTHESIS: if h_1 and h_2 agree, then go with that, else go with h_3. I.e., majority vote. ======================================================================== Analysis: What's the accuracy of the new hypothesis? Easiest way is to draw a picture. Divide space into 4 regions: R_1: h_1 is correct and h_2 is correct R_2: h_1 is correct and h_2 is incorrect R_3: h_1 is incorrect and h_2 is correct R_4: h_1 is incorrect and h_2 is incorrect We predict correctly on R_1, incorrectly on R_4 and get 70% correct on R_2 and R_3. So, accuracy = D[R_1] + 0.7*D[R_2] + 0.7*D[R_3] We know that: (1) D[R_1] + D[R_2] = 0.7 (2) D_2[R_1] + D_2[R_3] = 0.7 combining (2) with our definition of D_2 gives us: (3) 5/7 * D[R_1] + 5/3 * D[R_3] = 0.7 now we just rewrite our equation for accuracy like this: accuracy = 3/10*D[R_1] + 7/10*D[R_3] + 7/10*D[R_1] + 7/10*D[R_2] The first two terms are the same proportions as eqn (3) and the second two terms let us use eqn (2). So, we get accuracy = 3/5 * (7/10)^2 + (7/10)^2 = 0.784 In general, if you work it out, you get: new accuracy = 3*(old accuracy)^2 - 2*(old accuracy)^3 (think of 3/5 as 2*(1 - old accuracy)) This is always a decrease assuming that our original accuracy was strictly between 1/2 and 1. Note: in getting D_2, we might have tried: filter D to only look at examples where h_1 predicts incorrectly and try to learn over that. This DOESN'T work. Why? Details: If our hypotheses h_i did better than 70% accuracy, then it's not hard to see that this method does even better. As we get accuracies close to 1, then it gets harder to simulate the distribution D_2, but this extra time is linear in 1/epsilon so it's OK. ========================================================================== Application to recognizing handwritten digits: see slides. ========================================================================== Refs: Robert Schapire, "The strength of weak learnability", Machine Learning, 5(2):197-227, 1990. Yoav Freund, "Boosting a weak learning algorithm by majority", Information and Computation, 121(2):256-285, 1995. Yoav Freund and Robert Schapire, "Experiments with a new boosting algorithm", ICML '96.