15-859(B) Machine Learning Theory 02/17/14 * Better MB->Batch conversion * SVMs * L_2 Margin bounds ===================================================================== We've seen several algorithms for learning linear separators (Winnow, Perceptron). Today will talk about SVMs and more about sample-size bounds for learning over iid data. Recap of facts so far ===================== 1. VC-dim of linear separators in n-dim space is n+1. So this says that O(1/epsilon[n*log(1/epsilon)+log(1/delta)] examples are sufficient for learning. 2. But, we've seen that if there's a large margin, can get away with less. Let's assume target is w^* . x > 0, and has zero error. |w^*| = 1, and all ||x|| <=1. Given a sample S, define the L_2 margin to be: gamma = min_{x in S} |w^* . x| In this case, the Perceptron algorithm makes at most M = 1/gamma^2 mistakes. Can then convert this into a sample-size bound for iid data. Topic 1: Better MB=>Batch conversion ==================================== Let's do the improved online->batch conversion formally, since we didn't quite do it earlier in class. Theorem: If have conservative alg with mistake-bound M, can use to get PAC sample-complexity O((1/epsilon)[M + log(1/delta)]) Proof: To do this, we will split data into a training set'' S_1 of size max[(4M/epsilon), (16/epsilon)*ln(1/delta)] and a test set'' S_2 of size (32/epsilon)*ln(M/delta) We will run the algorithm on S_1 and test all hypotheses produced on S_2. Claim 1: w.h.p., at least one hyp produced on S_1 has error < epsilon/2. Proof (formally, use martingales): - If all are >= epsilon/2 then the expected number of mistakes >= 2M. - By Chernoff, Pr[# mistakes <= M] <= e^{(-expect)/8} <= delta. - View as game: after M mistakes, alg forced to reveal target. If alg keeps giving bad hyps, then whp will be forced to do it. Claim 2: w.h.p., best one on S_2 has error < epsilon. Proof: Suffices to show that good one is likely to look better than 3*epsilon/4, and all with true error > epsilon are likely to look worse than 3*epsilon/4. Just apply Chernoff again to the set of M hypotheses as in your homework. Note that this could be better or worse than the dimension bound. In fact, can show that whp *any* large-margin separator you can find will generalize well from roughly this much data. So, this motivates why SVMs find large margin separators. Support-Vector Machines ======================= Support vector machines do convex optimization to find the maximum margin separator, and more generally to optimize a given tradeoff between margin and hinge-loss. Let's first do the easier case, where we assume data is linearly separable and we want the separator of maximum margin. Then we could write that as: minimize |w|^2 subject to the constraint that f(x)*(w.x) >= 1 for all examples x in our training set (f(x) is the label of x). This is a convex optimization problem, so we can do it. Equivalently we could fix |w|^2 <= 1 and maximize gamma s.t. l(x)(w.x) >= gamma, but people like to do it this way, for reasons that will make more sense in a minute. Note that if we set the RHS to 1, then 1/|w| is the margin and so |w|^2 = 1/gamma^2. More generally, we want a tradeoff betwen margin and hinge-loss, so what SVMs really do is (C is a given constant): minimize: |w|^2 + C sum_{x_i in S} epsilon_i subject to: f(x_i)(w \cdot x_i) >= 1 - epsilon_i, for all x_i in S. and epsilon_i >= 0 for all i. These "epsilon_i" variables are called slack variables. [also write down objective function divided by |S| since will be easier conceptually for motivation below] Here is the motivation. What we *really* want is to minimize the true error err(h), but what we observe is our empirical error err_S(h). So, let's split err(h) into two parts: (1) err(h) - err_S(h), which is the amount we're overfitting, and (2) err_S(h). One bound on part (1) is approximately (1/gamma^2)/|S|, if err_S(h) is small (since our sample complexity bound is approximately |S| = (1/gamma^2)(1/epsilon) which means epsilon = (1/gamma^2)/|S|). So, this is 1/|S| times the first part of the objective function. The second part of the optimization (for C=1) is our total hinge-loss, summed over all the examples. Dividing by |S| we get our average hinge-loss, which is an upper bound on err_S(h). So, an upper bound on (2) is 1/|S| times the second part of the objective function. So, the two parts of the objective function are upper bounds on the two quantities we care about: overfitting and empirical error. And if you feel your upper bound on (1) is too loose, you might want to set C to be larger. More about margins ================== - We've seen that having a large margin is a good thing because then the perceptron algorithm will behave well. It turns out another thing we can say is that whp, *any* separator with a large margin over the data will have low error. This provides more motivation for finding a large-margin separator. Sample complexity analysis ========================== The sample complexity analysis is done in two steps. First thing to show: what is the maximum number of points in the unit ball that can be split in all possible ways by a separator of margin at least gamma? a.k.a., "gamma fat-shattering dimension". Ans: O(1/gamma^2). Can anyone see a simple proof? Proof: Consider the perceptron algorithm. Suppose the gamma fat-shattering dimension is d. Then we can force perceptron to make d mistakes, and yet still have a separator w^* of margin gamma. But we know the number of mistakes is at most 1/gamma^2. So, that's it. Second part: now want to apply this to get a sample-complexity bound. Sauer's lemma still applies [to bound the number of ways to split a set of m points by margin gamma, as a function of the fat-shattering dimension] so seems like analysis we used for VC-dimension should just go right through, but it's actually quite not so easy. Plus there's one technical fact we'll need. Let's do the analysis and will just give a citation for the technical fact we need. Analysis: Draw 2m points from D. Want to show it is unlikely there exists a separator that gets first half correct by margin gamma, but has more than epsilon*m mistakes on the 2nd half. This then implies the conclusion we want, by same reasoning as when we argued the VC bounds. As in VC proof, will show that for *any* sets S1, S2 of size m each, whp this is true over randomization of pairs {x_i, x_i'} into T1, T2. (Let S = S1 \union S2). In VC argument, we said: fix some h that makes at least epsilon*m mistakes. Said that Prob(all mistakes are in T2) is at most 2^{-epsilon*m}. Then applied union bound over all labelings of data using h in C. For us, it's tempting to say: let's count the number of separators of S with margin gamma over all of S. But this might be undercounting since what about separators where h only has margin gamma on T1? Instead, we'll do the following more complicated thing. Let's group the separators together. Define h(x) = but truncated at +/- gamma. Let dist_S(h1,h2) to be max_{x in S}|h1(x) - h2(x)|. We want a "gamma/2-cover": a set H of separators such that every other separator is within gamma/2 of some separator in H. Claim is: there exists an H that is not too large, as a function of fat-shattering dimension [Alon et al]. Can view this as a souped-up version of Sauer's lemma. Roughly you get |H| ~ (m/gamma^2)^(log(m)/gamma^2). Now, for these functions, define "correct" as "correct by margin at least gamma/2" and define "mistake" as "mistake OR correct by less than gamma/2". Our standard VC argument shows that so long as m is large compared to (1/epsilon)*log(|H|/delta), whp, none of these will get T1 all correct, and yet make > epsilon*m "mistakes" on T2. This then implies (by defn of H) that whp *no* separator gets T1 correct by margin >= gamma and has > epsilon*m real mistakes on T2. log(|H|) is approximately log^2(m)/gamma^2, so in the end you get a bound of m = O(1/epsilon [1/gamma^2 log^2(1/(gamma*epsilon)) + log(1/delta)]). Notice this is almost as good as the Perceptron bound. Luckiness functions =================== Basic idea of margins was in essense to view some separators as "simpler" than others, using margin as the notion of "simple". What makes this different from our Occam bounds, is that the notion of "simple" depends on the data. Basically, we have a data-dependent ordering of functions such that if we're lucky and the the target has low complexity in this ordering, then we don't need much training data. More generally, things like this are called "luckiness functions". If a function is a "legal notion of luckiness" (basically, the ordering depends only on the data points and not their labels, and not too many splits of data with small complexity) then you can apply sample complexity bounds.