15-859(B) Machine Learning Theory 02/10/14 * Tighter sample complexity bounds: Rademacher bounds * A great tool: McDiarmid's inequality =========================================================================== Boosting recap - view as dot-game: we put prob distrib on columns, and then adversary gives a row with dots in at most 40% of the entries by probability mass. What we want is to adjust the probability distribution to ensure than nearly all the *columns* have dots in less than half the entries. =========================================================================== Let's recap our VC-dimension bounds: Let C[m] = max # ways of splitting m examples using concepts in C. THEOREM 1: For any class C, distrib D, if the number of examples seen m satisfies: m > (2/epsilon)[lg(2C[2m]) + log_2(1/delta)], then with prob (1-delta), all bad (error > epsilon) hypotheses in C are inconsistent with data. THEOREM 2: For any class C, distrib D, if the number of examples seen m satisfies: m > (8/epsilon^2)[ln(2C[2m]) + ln(1/delta)] then with prob 1-delta, all h in C have |err_D(h)-err_S(h)| < epsilon. SAUER'S LEMMA: C[m] < m^d, where d = VCdim(C). So can use Sauer's lemma to convert to bounds in terms of VC dimension. Resulting bound is O((1/epsilon^2)[d*ln(1/epsilon) + ln(1/delta)]). --------------------------------------------------------------------- Motivation and plan =================== These bounds are nice but have two drawbacks that we'd like to address: 1. computability/estimability: say we have a hypothesis class C that we don't understand very well. It might be hard to compute or estimate C[m], or even |C[S]| for our sample S. 2. tightness. Our bounds have two sources of loss. One is that we did a union bound over the splittings of the double-sample S, which is overly pessimistic if many of the splittings are very similar to each other. A second is that we did worst-case over S, whereas we would rather do expected case over S, or even just have a bound that depends on our actual training set. We will be able to address both of these, at least in the context of THM 2. In particular, we will show the following remarkable fact. Suppose you replaced the true labels with random coin flips and then found the lowest-empirical error function in the class for these. (Imagine we have a great optimizer that, given any set of labels, will find the h in C that minimizes empirical error with respect to those labels). Say it gives error 45%. That means that we are overfitting by 5% wrt the "coin flip function" since every hypothesis has true error 1/2 wrt that. Take that gap (5%), double it, and the result (plus some low-order term) is an upper bound on how much any h in C is overfitting wrt the true target f. This is pretty remarkable since up to low-order terms we're within a factor of 2 of the best possible target-independent bound (since we do indeed overfit by 5% wrt the coin-flip function). And there do exist cases where you indeed have to lose the factor of 2. E.g., consider the class of all boolean functions over {0,1}^n, with a target that is all negative (describe). One caveat: This is along the "THEOREM 2" style - e.g., best you could hope for is overfitting of 1/sqrt(m). In the realizable case where you just care about the true error of functions whose empirical error is zero, these bounds could be quite a bit worse, since then you get overfitting on the order of (complexity term)/m. Note: rather than writing m as a function of epsilon, let's write epsilon as a function of m. In particular, we can write Thm 2 as: For any class C and distribution D, whp all h in C satisfy err_D(h) <= err_S(h) + sqrt{8ln(2C[2m]/delta) / m}. (And we bound in the other direction as well, but let's just focus on this direction - i.e., how much can we be overfitting the sample). ---------------------------------------------------------------------- Rademacher averages =================== For a given set of data S = {x_1, ..., x_m} and class of functions F, define the "empirical Rademacher complexity" of F as: R_S(F) = E_sigma [ max_{h in F} (1/m)(sum_i sigma_i * h(x_i)) ] where sigma = (sigma_1, ..., sigma_m) is a random {-1,1} labeling. I.e., if you pick a random labeling of the points in S, on average how well correlated is the most-correlated h in F to it? Note: h:X->{-1,1}, so sigma_i * h(x_i) = 1 if they agree and -1 if they disagree. Note that "correlation" = "agreement" - "disagreement" so error 45% means correlation of 10%. We then define the (distributional) Rademacher complexity of F as: R_D(F) = E_S R_S(F). We will then prove the following theorem: THEOREM 3: For any class C, distrib D, if we see m examples then with probability at least 1-delta every h in C satisfies err_D(h) <= err_S(h) + R_D(C) + sqrt(ln(2/delta)/2m). <= err_S(h) + R_S(C) + 3*sqrt(ln(2/delta)/2m). ---------------------------------------------------------------------- Relating Rademacher and VC: We can see that this is never much worse than our VC bounds. In particular, let's consider how big can R_S(C) be? If you fix some h and then pick a random labeling, Hoeffding bounds say the probability the correlation is more than 2epsilon is at most e^{-2*m*epsilon^2}. So setting this to delta/C[m] and doing a union bound, we have at most a delta chance that any of the splittings in F have correlation more than 2*sqrt(ln(C[m]/delta)/(2m)), so the expected max correlation R_S(C) can't be much higher than that. On the other hand, it could be a lot lower if many of the splittings are very similar to each other. OK, now on to the proof of Theorem 3... ---------------------------------------------------------------------- First, a useful tool we will use is McDiarmid's inequality. This generalizes Hoeffding bounds to the case where, rather than considering the *average* of a set of independent RVs, we are considering some other function of them. Specifically, McDiarmid's inequality ====================== Say x_1, ..., x_m are independent RVs, and phi(x_1,...,x_m) is some real-valued function. Assume that phi satisfies the Lipschitz condition that if x_i is changed, phi can change by at most c_i. Then McDiarmid's inequality states that: Pr[phi(x) > E[phi(x)] + epsilon] <= e^{-2*epsilon^2 / sum_i c_i^2} so if all c_i <= 1/m (like if phi = "average" and x_i \in {0,1}), we get: Pr[phi(x) > E[phi(x)] + epsilon] <= e^{-2*epsilon^2 * m} just like Hoeffding. Proof of THEOREM 3 ================== STEP 1: The first step of the proof is to simplify the quantity we care about. Specifically, let's define MAXGAP(S) = max_{h in C} [err_D(h) - err_S(h)]. We want to show that with probability at least 1-delta, MAXGAP(S) is at most some epsilon. As a first step, we can use McDiarmid to say that whp, MAXGAP(S) will be close to its expectation. In particular, the examples x_j are independent RVs and MAXGAP can change by at most 1/m if any individual x_j in S is replaced (because the gap for any specific h can change by at most 1/m). So, using this as our "phi", with probability at least 1 - delta/2, we get: MAXGAP(S) <= E_S [MAXGAP(S)] + sqrt(ln(2/delta)/(2m)). So, to prove the first (main) line of THEOREM 3, we just need to show that E_S[ MAXGAP(S) ] <= R_D(C). In fact, the second line of the theorem follows immediately from the first line plus an application of McDiarmid to the random variable R_S(C) [how much can changing a training example change R_S(C)?]. So, really we just need to prove the first line. STEP 2: The next step is to do a double-sample argument just like we did with VC-dimension. Specifically, let's rewrite err_D(h) as E_{S'} [err_{S'}(h)], where S' is a (new) set of m points drawn from D (this is the "ghost sample"). So, we can rewrite E_S[MAXGAP(S)] as: E_S [ max_{h in C} [ E_{S'} [ err_{S'}(h) - err_{S}(h) ] ] ] <= E_{S,S'} [ max_{h in C} [ err_{S'}(h) - err_{S}(h) ] ] (i.e., get to pick h after seeing both S *and* S') If we let S = {x_1,...,x_m} and let S' = {x_1', ..., x_m'} then we can rewrite this as: E_{S,S'} [ max_{h in C} [ sum_i [err_{x_i'}(h) - err_{x_i}(h)]/m] ] where I'm using err_{x}(h) to denote the loss of h on x (1 if h makes a mistake and 0 if h gets it right). Now, as in the VC proof, let's imagine that for each index i, we flip a coin to decide whether to swap x_i and x_i' or not before taking the max. This doesn't affect the expectation since everything is iid. So, letting sigma_i in {-1,1} at random, we can rewrite our quantity as: E_{S,S',sigma}[max_{h in C}[sum_i sigma_i[err_{x_i'}(h) - err_{x_i}(h)]/m]] <= E_{S',sigma} [ max_{h in C} [sum_i sigma_i * err_{x_i'}(h)/m] ] - E_{S,sigma} [ min_{h in C} [sum_i sigma_i * err_{x_i}(h)/m] ] (i.e., gap is only larger if allow the two h's to differ) = 2*E_{S,sigma} [ max_{h in C} [sum_i sigma_i * err_{x_i}(h)/m] ] (by symmetry, since sigma_i is random {-1,1}) Now, we're just about done. The quantity here is a lot like the Rademacher complexity of C except instead of looking at the dot-product (viewing each as a vector of m entries), we are looking at where loss(h) is the vector of losses of h. In addition we have an extra factor of 2. To translate, suppose that in the definition of Rademacher complexity we replace with where "*" is component-wise multiplication and f is the target function. That is, instead of viewing sigma as a random labeling, we view it as a random perturbation of the target. Notice that "sigma * f" has the same distribution as sigma, so this doesn't change the definition. Now, = . This is nice because f * h is the vector whose entry i is 1 if h(x_i)=f(x_i) and equals -1 if h(x_i) != f(x_i). In other words, f*h(x) = 1 - 2err_x(h). So, R_D(C) = E_{S,sigma} [max_{h in C} [sum_i sigma_i[1-2err_{x_i}(h)]/m]] = 2*E_{S,sigma} [ max_{h in C} [sum_i sigma_i * err_{x_i}(h)/m]] as we wanted. So, this gives us what we want. -----------------------------------------------------------------------