15-859(B) Machine Learning Theory 01/27/10 * Chernoff/Heoffding bounds and uniform convergence * VC-dimension ======================================================================== [this is for the last 1/4 or so of the lecture] In the rest of class we'll prove theorem 4 and then theorem 2. Next time do 1 (and 3). Lot of proofs, so slow me down if gets confusing.... Theorem 4: For any alg A, there exists a distrib D and target f in C such that if A sees < (VCdim(C)-1)/(8*epsilon) examples, its expected error is > epsilon. PROOF: Consider d = VC-dim(C) points that can be shattered. Define the distribution D that puts prob 1-4*epsilon on one point, and prob 4*epsilon/(d-1) on each of the rest (call those the rare points). Pick a random labeling of the d points as the target. Let's first argue informally: in m examples we expect only 4*epsilon*m to be rare ones. This is (d-1)/2. So, even if all went to different rare ones, we still have (d-1)/2 left, and a 2*epsilon chance that the next example is new. Since, the target is random, this implies at least an epsilon chance of a mistake. More formally, Pr(error) >= Pr(example is rare )*Pr(unseen|rare)*1/2 == (4epsilon)*[(1 - 4epsilon/(d-1))^m]*(1/2) > (4epsilon)*[1 - m(4epsilon/(d-1))]*(1/2) >= (4epsilon)*(1/2)*(1/2) == epsilon Now let's do Thm 2 (SAUER'S LEMMA). This says we can bound the number of splittings using the VC-dimension. In particular, we'll prove C[m] <= (m <=d), where d = VCdim(C) and we define (m <=d) = # ways of choosing d or fewer items out of m. I.e., (m 0) + (m 1) + (m 2) + ... + (m d). Note: (m <=d) = (m-1 <=d) + (m-1 <=d-1). Can anyone see why? Proof: to choose d or fewer out of m you either choose the first element and then d-1 or fewer out of the remaining m-1 or you don't, and then choose d or fewer out of the remaining m. Before proving the theorem, let's look at an example. Look at C = "intervals [a,b] of the real line" (so, VCdim = 2). Say we have m points, named 1,2,3,...,m in order, and we now delete point m. How does # splittings go down? Answer is that some pairs, like [a,m] and [a,m-1] collapse to become the same splitting. So, the number of splittings of the m points equals the number of splittings of the m-1, which is at most (m-1 <=d) by induction, plus the number of pairs that collapsed. But, these ones that collapsed correspond to "final subintervals" of the previous set, which is a class with VC-dim 1 (or d-1 in general). So, at most (m-1 <=d-1) of these, by induction. Proof is exactly on these lines. PROOF OF THEOREM 2: Let's say we have a set S of m examples. We want to bound number of ways of splitting S using concepts in C. Define C[S] = set of different ways of splitting (partitioning) S by concepts in C. Think of these as concepts defined only on S. Pick some x in S. So, we know that C[S - {x}] has at most (m-1 <=d) partitions by induction. How are there more splittings of S than S - {x}? As before, the increase is exactly the number of pairs in C[S] that differ only on x, since these are the ones that collapse when you remove x. For each pair, let's focus on the one that labels x negative. Let C' = {c in C[S] that label x negative for which there is another concept in C[S] that is identical to c except it labels x positive}. (these are concepts that are only defined on S) Claim: VCDIM(C) >= VCDIM(C') + 1. Proof: Pick set S' of pts in S that are shattered by C'. S' doesn't include x since all functions in C' label x negative. Now, S' U {x} is shattered by C because of how C' was defined. This means VCDIM(C') <= d-1 and we are done.