Topics in Machine Learning Theory 09/24/14 Boosting: - A great practical algorithm - A great theoretical result about basic definitions in the PAC model - A surprising connection between topics in online and distributional learning ======================================================================== Let's start by recalling the PAC model. Def 1: Alg A efficiently PAC-learns class C if for any c in C, any distribution D, any epsilon, delta > 0, with probability 1-delta, A runs in poly-time and poly-samples and produces a poly-time evaluatable hyp. h with error at most epsilon. Def 2: Let's say that alg A WEAK-LEARNS class C if for any c in C, any distributions D, THERE EXISTS gamma, tau > 1/poly(n) s.t. A achieves error at most 1/2 - gamma with prob at least tau. I.e., with some noticeable probability you get noticeable correlation. [using "gamma" to denote the "gap"] Question: suppose we defined PAC model this way, does this change the notion of what is learnable and what is not? Answer, it doesn't. Given an algorithm that satisfies Def 2, can "boost" it to an algorithm that satisfies Def1. This was the weak->strong learning result of Rob Schapire. Later turned into a very practical algorithm AdaBoost by Yoav Freund and Rob Schapire. Note: we can handle tau->delta easily: just repeat (1/tau)log(1/delta) times and whp at least one was successful. Then draw fresh data and use to pick the good one. The big issue is the gamma->epsilon. In the discussion below, we'll ignore delta and assume that each time, we get a hypothesis of error <= 1/2 - gamma. Boosting: discussion -------------------- We're going to have to use the fact that algorithm A works (in this weak way) over *every* distribution. If you look at the "distribution specific" version of this question, the answer is NO. E.g., say your target function was "x is positive if its first bit equals 1 or if x satisfies some hard crypto property (e.g., viewed as an integer, x is a quadratic residue mod some large N). Then, it's easy to get error 25% over uniform random examples: you predict positive if x1=1 and whatever you want otherwise. But you can't do any better. The problem is that there is this hard core to the target function. Boosting shows this situation is in a sense universal. If you have some arbitrary learning algorithm that supposedly works well, we'll view it as a black box and either boost up its accuracy as much as we like, or else we'll find a distribution D where that algorithm gets error > 1/2 - gamma. As a practical matter, the algorithm we give for boosting can be thought of as one for creating good "challenge distributions". Basically will propose a series of challenge distributions that A will either succeed or fail on. Each success will produce an improvement in our overall prediction rule. A failure means A wasn't really a weak-learner over all distributions. So it's a great practical tool esp if you have a bunch of different learning procedures to try. An easy case: algorithms that know when they don't know ------------------------------------------------------- Just to get a feel of a simple case, imagine our weak-learning algorithm produces a hypothesis that on any given example x either makes a prediction or says "not sure". Say it is always correct when it makes a prediction, and and furthermore it says "not sure" at most a 1-epsilon' fraction of the time (it's trivial to produce a hypothesis that always says "not sure"!). Here we can do boosting by creating a decision list. We first run A on D to find h_1, and put as top rule in the DL. We run A on D subject to the constraint that h_1 says "not sure" (i.e., when A asks for an example, we repeatedly draw from D until we get one where h_1 says "not sure") and get h_2. We then run A on D conditioned on both h_1 and h_2 outputting "not sure" and get h_3. Each new rule chops off at least an epsilon' fraction of the remaining probability mass, so we will get down to error epsilon after only O((1/eps')log(1/eps)) runs. Basic idea here: focus on data where previous hyp had trouble. Force next one to learn something *new*. The general case ================ Now let's go to the general case of boosting weak learners. I am going to describe AdaBoost, which is the most popular boosting algorithm (though there are now many variations). It will be most convenient to draw a sample of data S and then do all our work over distributions defined on S. So, let's assume that A's hypotheses are from a class C with C[m] = O(m^d). Our final rule will be from a more complicated class H, but with H[m] = m^O((1/gamma)^2*log(1/epsilon)*d). So we will start by drawing S sufficiently large to get uniform convergence over H, and now we can just worry about our performance on S. Before giving the algorithm, let's just get a feel for what is going on. The first step is going to be to just run A on the uniform distribution over S, getting some hypothesis h_1 with error rate at most 1/2 - gamma on S. Now, we want a new distribution. One idea would be to zero-out the examples that h_1 got right and use a challenge distribution that is uniform over h_1's mistakes. But that's a bad idea. What might A do in order to do well on this distribution but give us no new information. (In particular, you should think of A as being controlled by an *adversary* who is restricted only by the requirement that it's hypotheses must have error at most 1/2 - gamma). Instead, we are just going to down-weight the examples that h_1 got correct, to make their total weight be the same as the weight of the examples h_1 got incorrect. Now, h_1 is just 50/50. In fact, this is exactly what we are going to do each round: down-weight the examples the previous h_t got correct to make the previous h_t 50/50. Notice that this looks a *lot* like the Randomized Weighted Majority algorithm, where the examples are the "experts", and h_t being correct on x_i is like expert i making a mistake at time t. Weird, huh? In fact, the analysis will be very similar. Before doing the algorithm and analysis, here is *why* the algorithm and analysis are so similar to RWM. Consider the following 2-player zero-sum game. We have one row for each example in S, and one column for each h in C. (At most C[m] cols). If h(x) is correct then say that is value +1 for the column-player, else value 0. Our promise is that for any distribution on rows, there exists a column (and moreover, A can find it) with expected payoff at least 1/2 + gamma. By the minimax theorem, this means there must exist a *distribution* q on columns such that for *every* row, it's expected payoff is at least 1/2+gamma for the column player. That means that the majority-vote over these h's, weighted by their probability under q, gets *every* example right (by an L_1 margin of gamma). Now, when we did RWM, we viewed it as an algorithm for finding an approximately minimax-optimal strategy for the row player, but if you play it against a best-response oracle (i.e., A), then the empirical distribution over columns played is an approximately minimax-optimal strategy for the column player too. OK, now let's do the algorithm and analysis! ======================================================================== Adaboost -------- 0. Draw large sample of size n that will be sufficient by Occam/VC arguments. 1. Start with weight w_i = 1 on each example x_i. The associated prob dist D_1 has P(x_i) = w_i/W, where W = sum(w_i). 2. For t=1 to T do: Feed D_t to weak algorithm, get back hypothesis h_t with error rate 1/2 - gamma_t on D_t. Let beta_t = (1/2 - gamma_t)/(1/2 + gamma_t). Now update D: Multiply w_i by beta_t if h_t(x_i) is correct. (notice: h_t now has 50% error on the new distribution D_{t+1}) 3. Final hypothesis: let alpha_t = ln(1/beta_t). Output a majority vote weighted by the alphas. I.e., h(x) = sign(sum_t alpha_t * h_t(x)) if we view labels as +/-1. ========== What could we hope for? Suppose all gamma_t = gamma and consider a super-easy case where all errors were *independent* (each h_t, on any given example, flips a coin of bias 1/2-gamma to decide whether to give the right answer or not). What would be the error rate of the maj vote? Hoeffding bounds say it would be at most e^{-2*T*gamma^2}. We will get exactly this. Claim: training error(h) < e^{-2*sum_t[ (gamma_t)^2 ]}. So, if all gamma_t = gamma, we get e^{-2T gamma^2}. So, this actually means as a byproduct we are proving Hoeffding bounds.... (since giving independent errors is one option for A). ANALYSIS ======== To simplify things, let's assume all gamma_t's are the same. Call it gamma. Then beta is fixed, and the final hypothesis is just a straight majority vote over the h_t's. Let's now analyze as in randomized WM, by putting upper and lower bounds on the final weight. For the upper bound, suppose we currently have a total weight of W. Since a (1/2+gamma) fraction of W gets multiplied by beta, after this step that portion has weight only (1/2 - gamma)*W, so the total weight is 2*(1/2-gamma)*W. Since the total weight starts at n, this means that: W_final <= n*(2(1/2-gamma))^T Now, let's look at the lower bound. To do the lower bound, notice that any example on which our majority vote makes a mistake must have weight at least beta^{T/2}, since if our majority vote is wrong then we could have multiplied it by beta *at most* T/2 times. So, if in the end our error rate is epsilon (so we make a mistake on epsilon*n points), then W_final >= beta^{T/2} * epsilon * n Combining our upper and lower bounds gives us: epsilon <= [4*(1/2 - gamma)^2 / beta]^{T/2} = [4*(1/2 - gamma)*(1/2 + gamma)]^{T/2} = [1 - 4*gamma^2]^{T/2}. Now we use our favorite inequality: 1-x <= e^{-x} epsilon <= e^{-2T*gamma^2}. QED ======================= What about true error versus empirical error? What we really have at this point is a bound on empirical error. To have a bound on true error we need our sample to be large enough so that not only do we get uniform convergence over the hypothesis space C used by A, but also we get uniform convergence over "majority votes of O(1/gamma^2 * log(1/epsilon)) hypotheses in C". But notice that if there are C[m] ways of splitting m points using functions in C, then there are at most {C[m] \choose k} ways of splitting functions using majorities of size k. So, the sample size bound just needs to increase by a multiple of k. Margin analysis =============== Actually, one thing people noticed in practice is that if you keep running Adaboost past this limit, you don't end up overfitting. There's a nice way of looking at this in terms of "L_1 margins". Let's define g = alpha_1 h_1 + ... + alpha_T h_T, so our final hypothesis h(x) is sign(g(x)). Let's scale the alpha_t's so they sum to 1. Now, define the L_1 margin of g (or h) on an example x of label y to be y*g(x). (We're calling it the "L_1 margin" because we have scaled g to have L_1 length equal to 1) If you look carefully at our analysis of adaboost, it ends up showing not only that there can't be too many examples where the majority vote is wrong (negative margin) but in fact there can't be too many where the majority vote is barely right. In fact if you keep running the algorithm, the fraction of examples with margin < gamma*0.99 will go to 0 (because those examples have too high a weight compared to the upper bound of (1 - 2*gamma)^T.) What does this mean? One thing about having margin gamma is that even if T is huge, you can whp produce the exact same prediction that h does by just viewing g as a probability distribution and sampling from it k = (1/gamma^2)*log(1/epsilon) times, and then taking majority vote over the sample. So, as long as we have enough data so that these "small majority votes" aren't overfitting, then h can't be overfitting by much either. Let's do this more formally. Let h' be the randomized prediction function that given a new example x predicts the majority vote over k random draws from distribution g. Then the true error rates must satisfy err(h') >= err(h)/2 since if h(x) is incorrect, this means that each random draw from g has at least a 50% chance of giving the wrong answer on x, so h' has at least a 50% chance of being incorrect. But we argued that the empirical error of h' is low [because if h has margin gamma on x then whp h'(x)=h(x)]. Furthermore we have enough data so that whp *all* small majority-vote functions have empirical error close to their true error, which implies the empirical error of h' is close to its true error. Therefore the true error of h must be low too.