15-859(B) Machine Learning Theory 02/02/09 VC-dimension II * Sauer's lemma * Proving the main result ======================================================================== Plan for today: two proofs. (1) Sauer's lemma: the number of different ways of labeling m points using functions in class C is at most (m <=d), where d=VCdim(C). (2) the uniform convergence bound where we (roughly) replace |C| with C[2m]. Given a concept class C and set of examples S, let C[S] be the set of possible labelings of S using concepts in C. C[m] = max_{S:|S|=m} |C[S]|. SAUER'S LEMMA: C[m] <= (m <=d), where d = VCdim(C) and we define (m <=d) = # ways of choosing d or fewer items out of m. I.e., (m <=d) = (m 0) + (m 1) + (m 2) + ... + (m d). Recall: (m <=d) = (m-1 <=d) + (m-1 <=d-1). Proof: to choose d or fewer out of m you either choose the first element and then d-1 or fewer out of the remaining m-1 or you don't, and then choose d or fewer out of the remaining m. Last time we gave the main idea of the proof of Sauer's lemma. We'll do the formal proof now. PROOF OF SAUER'S LEMMA: Let's say we have a set S of m examples. We want to bound |C[S]|. Pick some x in S. So, we know that C[S - {x}] has at most (m-1 <=d) distinct partitions by induction. How are there more splittings of S are there than of S - {x}? As in the special case we examined last class, |C[S]| - |C[S - {x}]| = the number of pairs in C[S] that differ only on x, since these are the ones that collapse when you remove x. For each pair, let's focus on the one that labels x positive. Let C' = {c in C[S] that label x positive for which there is another concept in C[S] that is identical to c except it labels x negative}. (Remember, these are concepts that are only defined on S) Claim: VCDIM(C) >= VCDIM(C') + 1. Proof: Pick set S' of pts in S that are shattered by C'. S' doesn't include x since all functions in C' label x positive. Now, S' U {x} is shattered by C because of how C' was defined. This means VCDIM(C') <= d-1 and we are done. QED --------------------------------------------------------------------- THEOREM 1: For any class C, distrib D, if the number of examples seen m satisfies: m > (2/epsilon)[log_2(2C[2m]) + log_2(1/delta)], then with prob (1-delta), all bad (error > epsilon) hypotheses in C are inconsistent with data. [can remove the "2" over the epsilon if we like] Proof: (Book has a very nice proof. Will follow it, but simplify further by not introducing notation. Will use the term "bad concept" to mean one with error > epsilon.) Given m examples, define A = event there exists a bad consistent hyp in C. We want to show Pr[A] is low. Define event B as follows. B = event that in 2m examples, there exists a concept in C that is consistent on the first m but has error at least epsilon/2 on the second m. Claim: Pr[A]/2 <= Pr[B]. So, if Pr[B] is low then Pr[A] is low. Pf: Say we pick 2m examples. Let A = event there is a bad consistent hyp on first m. Then Pr[B|A] > 1/2 by Chernoff since m > 8/epsilon. (the bad h has true error > epsilon so unlikely it will have observed error < epsilon/2). This means that Pr[B] >= Pr[A]/2. QED (could have actually defined B with floor[epsilon*m], not epsilon*m/2, since the median of a binomial is between floor and ceiling of the mean.) Now, show Pr[B] is low. (Note: no longer need to talk about "bad" hyps. Reason: for really good hyps, unlikely to satisfy the second part) Re-write event B: Draw set S of 2m examples, and then split randomly into S1, S2 of equal size. B = event that exists hyp in C consistent on S1 but makes a mistake on at least epsilon*m/2 points in S2. How to bound this? There are LOTS of concepts in C. But, we'll use the fact that there aren't so many splittings of S using concepts in C. So, let's fix some set S. Can be arbitrary. at most C[2m] splittings Now, fix some splitting h that disagrees with the target on at least epsilon*m/2 points in S. (this is a candidate for causing event B to occur) Want to show that Pr[h is consistent on S1] is small. (Probability is over the splitting of S into S1 and S2) Then will apply union bound over the at most C[2m] <= (2m <=d) such splittings. Lemma: Suppose h disagrees with target on at least L points in S. Then, Pr[h is consistent on S1] <= 1/2^L. Intuition: Suppose we split S by putting each point randomly into S1 or S2 (i.e., didn't require equal size), then Pr[all the L points fall into S2] = 1/2^L. Turns out that the experiment we care about is only more helpful to us. E.g., if L = m+1 then we get Pr[...] = 0. Proof: prob[all L in S2] = (m L)/(2m L) Reason: instead of fixing the L points and randomly splitting into equal-sized S1, S2, we can equivalently think of as fixing the split and then picking the L points randomly. Now, just solve... < 1/2^L Now, we apply union bound: Pr[B] <= C[2m] * (1/2)^{epsilon*m/2}. Set this to delta/2. That's it. Gives us Theorem 2. Note the key trick in the argument here. We said that an equivalent experiment for looking at event B is to (1) draw 2m examples and then (2) *randomly* partition those examples into a "training" and "test" set. The reason this is important is that after step (1) the set C[S] is fixed, but we've still saved a lot of our randomness to use in getting a confidence bound. =========================================== How about a uniform-convergence version of Theorem 1? Theorem 1': For any class C, distrib D, if: m > (8/epsilon^2)[ln(2C[2m]) + ln(1/\delta)] then with prob 1-delta, all h in C have |err_D(h)-err_S(h)| < epsilon. Just need to redo proof using Hoeffding. Draw 2m examples: A = event that on 1st m, there exists hyp in C with empirical and true error that differ by at least epsilon. (This is what we want to bound for the theorem). B = event that there exists a concept in C whose empirical error on 1st half differs from empirical error on 2nd half by at least epsilon/2. Pr[B|A] >= 1/2 so Pr[A] <= 2*Pr[B]. Now, show Pr[B] is low: As before, let's first pick S, then randomly decide what is 1st half and what is 2nd half. Once S is determined, there are only C[2m] h's we need to worry about. Now, fix some h, making L mistakes overall on S. The bad event is that >= L/2 + epsilon*m/2 fall into S1 (or S2). Can assume L <= m, else we focus on the non-mistakes. Let's pretend we were putting each point iid into S1 or S2 (rather than splitting them up exactly 50/50). As before, this can only increase the chance of the bad event. (informal proof of this: the actual distribution can be thought of as going through the points in S one at a time, where if we currently have k1 slots left in S1 and k2 slots left in S2, then the next point is put into S1 with probability k1/(k1+k2) and S2 with prob k2/(k1+k2). This means if we just watch the mistakes, if more appear in S1 than S2, the next one is more likely to be in S2 and vice versa. So this only makes the bad event less likely than for the iid game). So, by just focusing on where the mistakes go, can think of as just: we flip L<=m fair coins, and we want to bound the chance of having epsilon*m/4 more (or fewer) heads than we expect. Using Hoeffding, this is at most 2e^{-epsilon^2 m / 8}. Set to delta/C[2m] and we're done. ========================================================================== To get Cor 3, use: C[2m] <= (2m <=d) <= (2em/d)^d. Then just some rearranging... =========================================================================== More notes: It's pretty clear we can replace C[2m] with a "high probability" bound with respect to the distribution D. I.e. a number such that with high probability, a random set S of 2m points drawn from D wouldn't have more than this many labelings. With extra work, you can reduce it further to the *expected* number of labelings of a random set S of 2m points. In fact, there's been a lot of work on reducing it even further to "the number of possible labelings of your actual sample". See, e.g., the paper "Sample Based Generalization Bounds".