15-859(B) Machine Learning Theory 02/03/04 * Chernoff/Heoffding recap * VC-dimension ======================================================================== [See Slides] In the rest of class we'll prove theorem 4 and then theorem 1. Next time do 2 (and 3). Lot of proofs, so slow me down if gets confusing.... Theorem 4: there exists a distribution of examples and target concepts under which any algorithm requires Omega(VCdim/epsilon) examples. Consider d = VC-dim(C) points that can be shattered and a target that is random on these. Define the distribution D like this: 1-4*epsilon on one point, 4*epsilon/(d-1) on the rest. Let's first argue informally: in m examples we expect only 4*epsilon fraction on rare ones. This is (d-1)/2. so, at least a (4*epsilon)*(1/2) chance that the next example is new, and if so we have a 1/2 chance of making a mistake (since target is random). More formally, Pr(error) >= Pr(example is in rare set)*Pr(unseen|rare)*1/2 == (4epsilon)*[(1 - 4epsilon/(d-1))^m]*(1/2) >= (4epsilon)*[1 - m(4epsilon/(d-1))]*(1/2) >= (4epsilon)*(1/2)*(1/2) == epsilon Proof of Theorem 1: Theorem says we can bound the number of splittings using the VC-dimension. In particular, we'll prove that C[m] <= ((m d)), where d = VCdim(C) and we define ((m d)) = # ways of choosing d or fewer items out of m. I.e., (m 0) + (m 1) + (m 2) + ... + (m d). note: ((m d)) = ((m-1 d)) + ((m-1 d-1)). Can anyone see why? Proof: to choose d or fewer out of m you either choose the first element and then d-1 or fewer out of the remaining m-1 or you don't, and then choose d or fewer out of the remaining m. Before proving the theorem, let's look at an example. Look at C = "intervals [a,b] of the real line" (so, VCdim = 2). Say we have m-1 points, named 1,2,3,...,m-1 and we now add a new point m in front. How does # splittings increase? if we look at the old splittings, some of them, like [2,5], are committed: they have to label m negative. But some, like [2,m-1], can be extended in both ways to m. So the new number of splittings equals the old number, which is at most ((m-1 d)) by induction, plus the number of old splittings that can be extended in two ways. But, these ones that can be extended correspond to "final subintervals" of the previous set, which is a class with VC-dim 1 (or d-1 in general). So, at most ((m-1 d-1)) of these, by induction. Proof is exactly on these lines. Proof of theorem 1: Let's say we have a set S of m examples. We want to bound number of ways of splitting S using concepts in C. Define C[S] = set of different ways of splitting (partitioning) S by concepts in C. Think of these as concepts defined only on S. Pick some x in S. So, we know that C[S - {x}] has at most ((m-1 d)) partitions by induction. How are there more splittings of S than S - {x}? As before, the increase is exactly the number of pairs in C[S] that differ only on x, since these are the ones that collapse when you remove x. For each pair, let's focus on the one that labels x negative. Let C' = {c in C[S] that label x negative for which there is another concept in C[S] that is identical to c except it labels x positive}. (these are concepts that are only defined on S) Now, just need to show that VCDIM(C') <= d-1. Proof by contradiction: If VCdim(C') was d, then there is a set S' of d points in S that can be shattered (remember, C' is defined only on S). Furthermore, x is not in S' (since every concept in C' labels x negative). But then S' U {x} would be shattered by C, which is a contradiction since VCdim(C) = d. QED ===================just in case there's time...========================== Proof of Theorem 2: (Book has a very nice proof. Will follow it, but simplify further by not introducing notation. Will use the term "bad concept" to mean one with error > epsilon.) Given m examples, define A = event there exists a bad consistent hyp in C. We want to show Pr[A] is low. Define event B as follows. B = event that in 2m examples, there exists a concept in C that is consistent on the first m but has error at least epsilon/2 on the second m. Claim: Pr[A] < 2*Pr[B]. So, if Pr[B] is low then Pr[A] is low. Pf: Say we pick 2m examples. Let A = event there is a bad consistent hyp on first m. Then Pr[B|A] > 1/2 by Chernoff or std dev. (the bad h has true error > epsilon so unlikely it will have observed error < epsilon/2). This means that Pr[B] >= Pr[A]/2. QED (could have actually defined B with epsilon, not epsilon/2, since for epsilon<1/2, median of binomial is <= mean) Now, show Pr[B] is low. (Note: no longer need to talk about "bad" hyps. Reason: for really good hyps, unlikely to satisfy the second part) Re-write event B: Draw set S of 2m examples, and then split randomly into S1, S2 of equal size. B = event that exists hyp in C consistent on S1 but errs on at least epsilon*m/2 points in S2. How to bound this? There are LOTS of concepts in C. But, we'll use the fact that there aren't so many splittings of S using concepts in C. So, let's fix some set S. Can be arbitrary. at most C[2m] splittings Now, fix some splitting h that disagrees with the target on at least epsilon*m/2 points in S. (this is a candidate for causing event B to occur) Want to show that Pr[h is consistent on S1] is small. (Probability is over the splitting of S into S1 and S2) Then will apply union bound over the at most C[2m] <= ((2m d)) such splittings. Lemma: Suppose h disagrees with target on at least L points in S. Then, Pr[h is consistent on S1] <= 1/2^L. Intuition: Suppose we split S by putting each point randomly into S1 or S2 (i.e., didn't require equal size), then Pr[all the L points fall into S2] = 1/2^L. Turns out that the experiment we care about is only more helpful to us. E.g., if L = m+1 then we get Pr[...] = 0. Proof: prob[all L in S2] = (m L)/(2m L) Reason: instead of fixing the L points and randomly splitting into equal-sized S1, S2, we can equivalently think of as fixing the split and then picking the L points randomly. Now, just solve... < 1/2^L Now, we apply union bound: Pr[B] <= C[2m] * (1/2)^{epsilon*m/2} want < delta/2. That's it. Gives us Theorem 2. -------------------------- To get Theorem 3, use: C[2m] <= ((2m d)) <= (2em/d)^d. Then just some rearranging... Set m = (4d/epsilon)*log_2(1/epsilon) + 2/epsilon*log_2(2/delta) and assume left part is larger than right part and epsilon is small rewrite (1/2)^{epsilon*m/2} as: epsilon^{2d} * delta/2 So, rewrite the whole thing as: (2e * 8d/epsilon * log_2(1/epsilon) * epsilon^2 / d)^d * delta/2 = (16e * log_2(1/epsilon) * epsilon)^d * delta/2. < delta/2 for small enough epsilon.