15-859(A) Machine Learning Theory 02/05/04 * VC-dimension: proving the main result ======================================================================== [show online DL algorithm] [discuss implication of result proved in problem #3 of hwk 2] Where we are right now: See Slides Theorem 2: For any class C, distrib D, if the number of examples seen m satisfies: m > (2/epsilon)[log_2(2C[2m]) + log_2(1/delta)], then with prob (1-delta), all bad (error > epsilon) hypotheses in C are inconsistent with data. Proof: (Book has a very nice proof. Will follow it, but simplify further by not introducing notation. Will use the term "bad concept" to mean one with error > epsilon.) Given m examples, define A = event there exists a bad consistent hyp in C. We want to show Pr[A] is low. Define event B as follows. B = event that in 2m examples, there exists a concept in C that is consistent on the first m but has error at least epsilon/2 on the second m. Claim: Pr[A] < 2*Pr[B]. So, if Pr[B] is low then Pr[A] is low. Pf: Say we pick 2m examples. Let A = event there is a bad consistent hyp on first m. Then Pr[B|A] > 1/2 by Chernoff or std dev. (the bad h has true error > epsilon so unlikely it will have observed error < epsilon/2). This means that Pr[B] >= Pr[A]/2. QED (could have actually defined B with epsilon, not epsilon/2, since for epsilon<1/2, median of binomial is <= mean) Now, show Pr[B] is low. (Note: no longer need to talk about "bad" hyps. Reason: for really good hyps, unlikely to satisfy the second part) Re-write event B: Draw set S of 2m examples, and then split randomly into S1, S2 of equal size. B = event that exists hyp in C consistent on S1 but errs on at least epsilon*m/2 points in S2. How to bound this? There are LOTS of concepts in C. But, we'll use the fact that there aren't so many splittings of S using concepts in C. So, let's fix some set S. Can be arbitrary. at most C[2m] splittings Now, fix some splitting h that disagrees with the target on at least epsilon*m/2 points in S. (this is a candidate for causing event B to occur) Want to show that Pr[h is consistent on S1] is small. (Probability is over the splitting of S into S1 and S2) Then will apply union bound over the at most C[2m] <= ((2m d)) such splittings. Lemma: Suppose h disagrees with target on at least L points in S. Then, Pr[h is consistent on S1] <= 1/2^L. Intuition: Suppose we split S by putting each point randomly into S1 or S2 (i.e., didn't require equal size), then Pr[all the L points fall into S2] = 1/2^L. Turns out that the experiment we care about is only more helpful to us. E.g., if L = m+1 then we get Pr[...] = 0. Proof: prob[all L in S2] = (m L)/(2m L) Reason: instead of fixing the L points and randomly splitting into equal-sized S1, S2, we can equivalently think of as fixing the split and then picking the L points randomly. Now, just solve... < 1/2^L Now, we apply union bound: Pr[B] <= C[2m] * (1/2)^{epsilon*m/2} want < delta/2. That's it. Gives us Theorem 2. ========================================================================== How about a uniform-convergence version of Theorem 2? Theorem 2': For any class C, distrib D, if: m > (2/epsilon^2)[ln(2C[2m]) + ln(1/\delta)] then with probability 1-delta, all h in C have empirical error within epsilon of their true error. Just need to redo proof using Hoeffding: Draw 2m examples A = event that on 1st m, there exists hyp in C with empirical and true error that differ by at least epsilon. (This is what we want to bound for the theorem). B = event that there exists a concept in C whose empirical error on 1st half differs from empirical error on 2nd half by at least epsilon/2. Pr[B|A] >= 1/2 so Pr[A] <= 2*Pr[B]. Now, show Pr[B] is low. As before, let's first pick S, then randomly decide what is 1st half and what is 2nd half. Once S is determined, there are only C[2m] h's we need to worry about. Now, fix some h, making L mistakes overall on S. The bad event is that >= L/2 + epsilon*m/2 fall into S1 (or S2). Can assume L <= m, else we focus on the non-mistakes. Let's pretend we were putting each point iid into S1 or S2 (rather than splitting them up exactly 50/50). This can only increase the chance of the bad event. (informal proof: the actual distribution can be thought of as going through the points in S one at a time, where if we currently have k1 slots left in S1 and k2 slots left in S2, then the next point is put into S1 with probability k1/(k1+k2) and S2 with prob k2/(k1+k2). This means if we just watch the mistakes, if more appear in S1 than S2, the next one is more likely to be in S2 and vice versa. So this only makes the bad pevent less likely than for the iid game). So, can think of as just: we flip L<=m fair coins, and we want to bound the chance of having epsilon*m/2 more heads than tails or vice versa. Using Hoeffding, this is at most 2e^{-epsilon^2 m / 2}. Set to delta/C[2m] and we're done. ========================================================================== To get Theorem 3, use: C[2m] <= ((2m d)) <= (2em/d)^d. Then just some rearranging... Set m = (4d/epsilon)*log_2(1/epsilon) + 2/epsilon*log_2(2/delta) and assume left part is larger than right part and epsilon is small rewrite (1/2)^{epsilon*m/2} as: epsilon^{2d} * delta/2 So, rewrite the whole thing as: (2e * 8d/epsilon * log_2(1/epsilon) * epsilon^2 / d)^d * delta/2 = (16e * log_2(1/epsilon) * epsilon)^d * delta/2. < delta/2 for small enough epsilon.