Topics in Machine Learning Theory 09/17/14 VC-dimension and Uniform Convergence * Sauer's lemma * Proving the main result ======================================================================== First: going over the preliminaries for uniform convergence and VC-dimension in the slides. Plan for the rest of today: two very nice proofs. (1) Sauer's lemma: the number of different ways of labeling m points using functions in class C is at most {m \choose \leq d}, where d=VCdim(C). (2) the uniform convergence bound where we (roughly) replace |C| with C[2m]. Given a concept class C and set of examples S, let C[S] be the set of possible labelings of S using concepts in C. C[m] = max_{S:|S|=m} |C[S]|. SAUER'S LEMMA: C[m] <= {m \choose \leq d}, where d = VCdim(C) and we define {m \choose \leq d} = # ways of choosing d or fewer items out of m. Note: {m \choose \leq d} = {m-1 \choose \leq d} + {m-1 \choose \leq d-1} Proof: to choose d or fewer out of m you either choose the first element and then d-1 or fewer out of the remaining m-1 or you don't, and then choose d or fewer out of the remaining m. PROOF OF SAUER'S LEMMA: Let's say we have a set S of m examples. We want to bound |C[S]|. Pick some x in S. So, we know that C[S - {x}] has at most {m-1 \choose \leq d} distinct partitions by induction. How are there more labelings of S are there than of S - {x}? |C[S]| - |C[S - {x}]| = the number of pairs in C[S] that differ only on x, since these are the ones that collapse when you remove x. For each pair, let's focus on the one that labels x positive. Let C' = {c in C[S] that label x positive for which there is another concept in C[S] that is identical to c except it labels x negative}. (Remember, these are concepts that are only defined on S) Claim: VCDIM(C) >= VCDIM(C') + 1. Proof: Pick set S' of pts in S that are shattered by C'. S' doesn't include x since all functions in C' label x positive. Now, S' U {x} is shattered by C because of how C' was defined. This means VCDIM(C') <= d-1, so |C'[S]| <= {m-1 \choose \leq d-1} by induction and we are done. QED --------------------------------------------------------------------- THEOREM 1: For any class C, distrib D, if we draw a sample S from D of size: m > (2/epsilon)[log_2(2C[2m]) + log_2(1/delta)], then with prob 1-delta, all h with err_D(h)>epsilon have err_S(h)>0. Proof: Given set S of m examples, define A = event there exists h in C with err_D(h)>epsilon but err_S(h)=0. We want to show Pr[A] is low. Now, consider drawing *two* sets S, S' of m examples each. Let A be defined as before. Define B = event that there exists a concept h in C with err_S(h)=0 but err_{S'}(h) >= epsilon/2. Claim: Pr[A]/2 <= Pr[B]. So, if Pr[B] is low then Pr[A] is low. Pf: Pr[B] >= Pr[B|A]Pr[A]. Conditioned on event A (pick some such h), Pr[B|A] > 1/2 by Chernoff since m > 8/epsilon. (the bad h has true error > epsilon so unlikely it will have observed error < epsilon/2). This means that Pr[B] >= Pr[A]/2. QED Now, show Pr[B] is low. (Note: no longer need to talk about "bad" hyps. Reason: for really good hyps, unlikely to satisfy the second part) To do this, consider related event: draw S = {x1, x2, ..., xm} and S' = {x1', x2', ..., xm'} and now create sets T, T' using the following procedure Swap: For each i, flip a fair coin: - If heads, put xi in T and put xi' in T'. - If tails, put xi' in T and put xi in T'. Claim: (T,T') has the same distribution as (S,S'). So, equivalent to bound the probability of B_T = event that exists concept h in C with err_T(h)=0 but err_{T'}(h) >= epsilon/2. What's the point of all this?? Instead of Pr_{S,S'}[B] we have Pr_{S,S',swap}[B_T]. Will show this is small by showing that for *all* S,S', Pr_{swap}[B_T] is small. In particular, the key here is that even if there are infinitely many concepts in C, once we have drawn S,S', the number of different labelings we have to worry about is at most C[2m], and will argue that whp (over the randomness in "swap") none of them will hurt us. Now, fix S,S' and fix some labeling h. If, for any i, h makes a mistake on *both* xi and xi' then Pr_{swap}(err_T(h)=0) = 0. If h makes a mistake on less than epsilon*m/2 points, then Pr_{swap}(err_{T'}(h) >= epsilon/2)=0. Else, Pr_{swap}(err_T(h)=0 AND err_{T'}(h) >= epsilon/2) <= 2^{-epsilon*m/2}. Now, we apply union bound: Pr[B] <= C[2m] * 2^{-epsilon*m/2}. Set this to delta/2. That's it. Gives us Theorem 2. =========================================== How about an agnostic/uniform-convergence version of Theorem 1? Theorem 1': For any class C, distrib D, if: m > (8/epsilon^2)[ln(2C[2m]) + ln(1/\delta)] then with prob 1-delta, all h in C have |err_D(h)-err_S(h)| < epsilon. Just need to redo proof using Hoeffding. Draw 2m examples: A = event that on 1st m, there exists hyp in C with empirical and true error that differ by at least epsilon. (This is what we want to bound for the theorem). B = event that there exists a concept in C whose empirical error on 1st half differs from empirical error on 2nd half by at least epsilon/2. Pr[B|A] >= 1/2 so Pr[A] <= 2*Pr[B]. Now, show Pr[B] is low: As before, let's pick S, S' and do the random procedure swap to construct T, T'. Show that for *any* S,S', Pr_{swap}(exists h in C with |err_T(h)-err_{T'}(h)| > epsilon/2) is small. Again, at most C[2m] labelings of S \union S', so fix one such h. Can think of any i such that exactly one of h(xi) and h(xi') is correct as a "coin". Asking: if we flip m' <= m coins, what is Pr(|heads-tails| > epsilon*m/2). Write this as being off from expectation by more than epsilon*m/4 = (1/4)(epsilon*m/m')m'. By Hoeffding, Prob is at most 2*e^{-(epsilon*m/m')^2 m'/8} and this is always <= 2*e^{-epsilon^2 m/8}. Now multiply by C[2m] and set to delta. ========================================================================== To get Cor 3, use: C[2m] <= {2m \choose \leq d} <= (2em/d)^d. Then just some rearranging... =========================================================================== More notes: It's pretty clear we can replace C[2m] with a "high probability" bound with respect to the distribution D. I.e. a number such that with high probability, a random set S,S' of 2m points drawn from D wouldn't have more than this many labelings. With extra work, you can reduce it further to the *expected* number of labelings of a random set of 2m points. In fact, there's been a lot of work on reducing it even further to "the number of possible labelings of your actual sample". See, e.g., the paper "Sample Based Generalization Bounds".