VC-dimension continued. ----------------------------------------------------------- Today: Give some intuition for what we proved last time, finish proof of VC-dim upper bound. Recap ----- VC-dim(C) = d means there exists a set of d points shattered by C and there is no set of d+1 points that can be shattered. E.g., axis-parallel rectangles: VCdim = 4. How about triangles? --> 7 VC-dim provides a good measure of complexity of a class: 1. Lower bound on # examples needed for uniform convergence (all bad h in C to look bad) of O(VCdim/epsilon). If you see fewer examples, there is some distribution on examples and target concepts in C under which your expected error will be at least epsilon. 2. If we define C[m] = max number of ways of splitting m points with C, then we have: Theorem: C[m] <= ((m d)), where d = VCdim(C) and we define ((m d)) = # ways of choosing d or fewer out of m. In other words, the "effective number of concepts with respect to the unlabeled data points" is growing like m^d. Today: prove that can use C[m] to bound number of examples needed for all bad hypotheses in C to be inconsistent with data. Intuition about proof of (2) above: -------------------------- Look at C = {[a,b]}. Show that C[m] - C[m-1] is not too large ( like m^{d-1}, or specifically, ((m-1 d-1)) ) Add new point in front. How does # splittings increase? Increases by # splittings of previous m-1 that extend in two ways to the new point. Notice that all of these splittings must be "initial subintervals". This is a class of VCdim = 1 (more generally, VCdim = d-1) and so there cannot be too many by induction on d. Upper bound Theorem: m = O(1/epsilon[VCdim(C)*log(1/epsilon) + log(1/delta)]) examples are sufficient for consistency ==> PAC. In fact, use of VCdim is ONLY to get bound on C[m]. Will show that just need m to be large enough so that 2^{-epsilon*m/2} < delta / 2*C[2m] I.e., can replace |C| with C[2m]. Book has a very nice proof. Much simpler than usual one. Will follow it, but simplify further by not introducing notation. "bad" concept = one with error > epsilon. ------ Proof: Given m examples, define A = event there exists a bad consistent h in C. We want to show Pr[A] is low. Define event B as follows. B = event that in 2m examples, there exists a concept h in C that is consistent on the first m but has error at least epsilon/2 on the second m. Claim: if Pr[B] is low then Pr[A] is low. Pr[B] >= Pr[A and B] = Pr[A]*Pr[B|A]. Notice that Pr[B|A] is at least 1/2 by Chernoff bounds. So, this means that Pr[A] < 2*Pr[B]. QED Now, show Pr[B] is low. (Note: no longer need to talk about "bad" hyps. Reason: for really good hyps, unlikely to satisfy the second part) Re-write event B: Draw set S of 2m examples, and then split randomly into S1, S2 of equal size. B = event that exists h in C consistent on S1 but errs on at least epsilon*m/2 points in S2. How to bound this? There are LOTS of concepts in C. But, we'll use the fact that there aren't so many splittings of S using concepts in C. So, let's fix some set S. Can be arbitrary. Now, fix some splitting \hat{h} that disagrees with the target on at least epsilon*m/2 points in S. (Event B implies splitting must exist) Want to show that Pr[\hat{h} is consistent on S1] is small. (Probability is over the splitting of S into S1 and S2) Then will apply union bound over the at most C[2m] <= ((2m d)) such splittings. Lemma: Suppose \hat{h} disagrees with target on at least L points in S. Then, Pr[\hat{h} is consistent on S1] <= 1/2^L. Intuition: Suppose we split S by putting each point randomly into S1 or S2 (i.e., didn't require equal size), then Pr[all the L points fall into S2] = 1/2^L. Prob only DECREASES when we move to the model of splitting into equal sized pieces, since given that some points fell into S2, the probability that the next point also falls into S2 gets smaller. E.g., think of L = m+1. Get Pr[...] = 0. Proof: Instead of fixing the L points and randomly splitting into equal-sized S1, S2, we can equivalently think of as fixing the split and then picking the L points randomly. So, prob[all L in S2] = (m L)/(2m L) = ... < 1/2^L. Now, we apply union bound: Pr[B] <= C[2m] * (1/2)^{epsilon*m/2} want < delta/2. ------------------------------------------------------------------------- That's it. Just some details to solve the equation... Use: C[2m] <= ((2m d)) <= (2em/d)^d. Set m = (4d/epsilon)*log_2(1/epsilon) + 2/epsilon*log_2(2/delta) and assume left part is larger than right part and epsilon is small rewrite (1/2)^{epsilon*m/2} as: epsilon^{2d} * delta/2 So, rewrite the whole thing as: (2e * 8d/epsilon * log_2(1/epsilon) * epsilon^2 / d)^d * delta/2 = (16e * log_2(1/epsilon) * epsilon)^d * delta/2. < delta/2 for small enough epsilon. ------------------------------------------------------------------------