15-859(B) Machine Learning Theory 02/04/09 * Tighter sample complexity bounds: Rademacher bounds * A great tool: McDiarmid's inequality =========================================================================== Let's recap our VC-dimension bounds: Let C[m] = max # ways of splitting m examples using concepts in C. THEOREM 1: For any class C, distrib D, if the number of examples seen m satisfies: m > (2/epsilon)[lg(2C[2m]) + log_2(1/delta)], then with prob (1-delta), all bad (error > epsilon) hypotheses in C are inconsistent with data. THEOREM 2: For any class C, distrib D, if the number of examples seen m satisfies: m > (8/epsilon^2)[ln(2C[2m]) + ln(1/delta)] then with prob 1-delta, all h in C have |err_D(h)-err_S(h)| < epsilon. SAUER'S LEMMA: C[m] < m^d, where d = VCdim(C). So can use Sauer's lemma to convert to bounds in terms of VC dimension. Resulting bound is O((1/epsilon^2)[d*ln(1/epsilon) + ln(1/delta)]). --------------------------------------------------------------------- Motivation and plan =================== These bounds are nice but have two drawbacks that we'd like to address: 1. computability/estimability: say we have a hypothesis class C that we don't understand very well. It might be hard to compute or estimate C[m], or even |C[S]| for our sample S. Say we have a great optimizer that, given any set of labels, will find the h in C that minimizes empirical error with respect to those labels. Then, one approach we could imagine trying is to pick random labels for the points in S and see how often we can achieve them. This will estimate C[m]/2^m. Unfortunately, this quantity will be extremely small for the values of m of interest. In particular, we care about the case where C[m] is approximately 2^{epsilon*m} so we would have to repeat this an exponential number of times. Instead, an easier thing to do would be to pick random labels and see, on average, how close to them we can get: how much better than 50/50. This we could estimate. And if you think about it, if you can on average get empirical error of, say, 40% on random labels, it means you must be overfitting by at least 10%. Rademacher bounds will be exactly in terms of that quantity. 2. tightness. Our bounds have two sources of loss. One is that we did a union bound over the splittings of the double-sample S, which is overly pessimistic if many of the splittings are very similar to each other. A second is that we did worst-case over S, whereas we would rather do expected case over S, or even just have a bound that depends on our actual training set. We will be able to address both of these. Note: we are going to be extending THEOREM 2. Also, rather than writing m as a function of epsilon, let's write epsilon as a function of m. In particular, we can write Thm 2 as: For any class C and distribution D, whp all h in C satisfy err_D(h) <= err_S(h) + sqrt{2ln(2C[2m]/delta) / m}. (And we bound in the other direction as well, but let's just focus on this direction - i.e., how much can we be overfitting the sample). ---------------------------------------------------------------------- Rademacher averages =================== For a given set of data S = {x_1, ..., x_m} and class of functions F, define the "empirical Rademacher complexity" of F as: R_S(F) = E_sigma [ max_{h in F} (1/m)(sum_i sigma_i * h(x_i)) ] where sigma = (sigma_1, ..., sigma_m) is a random {-1,1} labeling. I.e., if you pick a random labeling of the points in S, on average how well correlated is the most-correlated h in F to it? We then define the (distributional) Rademacher complexity of F as: R_D(F) = E_S R_S(F). We will then prove the following theorem: THEOREM 3: For any class C, distrib D, if we see m examples then with probability at least 1-delta every h in C satisfies err_D(h) <= err_S(h) + R_D(C) + sqrt(ln(2/delta)/2m). <= err_S(h) + R_S(C) + 3*sqrt(ln(2/delta)/2m). ---------------------------------------------------------------------- Relating Rademacher and VC: We can see that this is never much worse than our VC bounds: If you fix some h and then pick a random labeling, Hoeffding bounds say the probability the correlation is more than 2epsilon is at most e^{-2*m*epsilon^2}. So setting this to delta/C[m] and doing a union bound, we have at most a delta chance that any of the splittings in F have correlation more than 2*sqrt(ln(C[m]/delta)/(2m)), so the expected max correlation R_S(C) can't be much higher than that. On the other hand, it could be a lot lower if many of the splittings are very similar to each other. OK, now on to the proof of Theorem 3... ---------------------------------------------------------------------- First, a useful tool we will use is McDiarmid's inequality. This generalizes Hoeffding bounds to the case where, rather than considering the *average* of a set of independent RVs, we are considering some other function of them. Specifically, McDiarmid's inequality ====================== Say x_1, ..., x_m are independent RVs, and phi(x_1,...,x_m) is some real-valued function. Assume that phi satisfies the Lipschitz condition that if x_i is changed, phi can change by at most c_i. Then McDiarmid's inequality states that: Pr[phi(x) > E[phi(x)] + epsilon] <= e^{-2*epsilon^2 / sum_i c_i^2} so if all c_i <= 1/m (like in the case that phi = "average"), we get: Pr[phi(x) > E[phi(x)] + epsilon] <= e^{-2*epsilon^2 * m} just like Hoeffding. Proof of THEOREM 3 ================== STEP 1: The first step of the proof is to simplify the quantity we care about. Specifically, let's define MAXGAP(S) = max_{h in C} [err_D(h) - err_S(h)]. We want to show that with probability at least 1-delta, MAXGAP(S) is at most some epsilon. As a first step, we can use McDiarmid to say that whp, MAXGAP(S) will be close to its expectation. In particular, the examples x_j are independent RVs and MAXGAP can change by at most 1/m if any individual x_j in S is replaced (because the gap for any specific h can change by at most 1/m). So, using this as our "phi", with probability at least 1 - delta/2, we get: MAXGAP(S) <= E_S [MAXGAP(S)] + sqrt(ln(2/delta)/(2m)). So, to prove the first (main) line of THEOREM 3, we just need to show that E_S[ MAXGAP(S) ] <= R_D(C). In fact, the second line of the theorem follows immediately from the first line plus an application of McDiarmid to the random variable R_S(C), so really we just need to prove the first line. STEP 2: The next step is to do a double-sample argument much like we did with VC-dimension, but with a little twist in the "symmetrization" step. Specifically, let's rewrite err_D(h) as E_{S'} [err_{S'}(h)], where S' is a (new) set of m points drawn from D (sometimes called a "ghost sample"). So, we can rewrite E_S[MAXGAP(S)] as: E_S [ max_{h in C} [ E_{S'} [ err_{S'}(h) - err_{S}(h) ] ] ] <= E_{S,S'} [ max_{h in C} [ err_{S'}(h) - err_{S}(h) ] ] If we let S = {x_1,...,x_m} and let S' = {x_1', ..., x_m'} then we can rewrite this as: E_{S,S'} [ max_{h in C} [ sum_i [err_{x_i'}(h) - err_{x_i}(h)]/m] ] where I'm using err_{x}(h) to denote the loss of h on x (1 if h makes a mistake and 0 if h gets it right). Now, the twist in the symmetrization step is let's imagine that for each index i, we flip a coin to decide whether to swap x_i and x_i' or not. This doesn't affect the expectation since everything is iid. So, letting sigma_i in {-1,1} at random, we can rewrite our quantity as: E_{S,S',sigma}[max_{h in C}[sum_i sigma_i[err_{x_i'}(h) - err_{x_i}(h)]/m]] <= E_{S',sigma} [ max_{h in C} [sum_i sigma_i * err_{x_i'}(h)/m] ] - E_{S,sigma} [ min_{h in C} [sum_i sigma_i * err_{x_i}(h)/m] ] (i.e., gap is only larger if allow the two h's to differ) = 2*E_{S,sigma} [ max_{h in C} [sum_i sigma_i * err_{x_i}(h)/m] ] (by symmetry, since sigma_i is random {-1,1}) Now, we're just about done. The quantity here is a lot like the Rademacher complexity of C except instead of looking at the dot-product (viewing each as a vector of m entries), we are looking at where loss(h) is the vector of losses of h. In addition we have an extra factor of 2. To translate, suppose that in the definition of Rademacher complexity we replace with where "*" is component-wise multiplication and f is the target function. That is, instead of viewing sigma as a random labeling, we view it as a random perturbation of the target. Notice that "sigma * f" has the same distribution as sigma, so this doesn't change the definition. Now, = . This is nice because f * h is the vector whose entry i is 1 if h(x_i)=f(x_i) and equals -1 if h(x_i) != f(x_i). In other words, f*h(x) = 1 - 2err_x(h). So, R_D(C) = E_{S,sigma} [max_{h in C} [sum_i sigma_i[1-2err_{x_i}(h)]/m]] = 2*E_{S,sigma} [ max_{h in C} [sum_i sigma_i * err_{x_i}(h)/m]] as we wanted. So, this gives us what we want. -----------------------------------------------------------------------