15-859(B) Machine Learning Theory 02/06/07
* Chernoff/Heoffding bounds and uniform convergence
* VC-dimension
========================================================================
[See Slides and handout for 1st half of class]
In the rest of class we'll prove theorem 4 and then theorem 2. Next
time do 1 (and 3). Lot of proofs, so slow me down if gets confusing....
Theorem 4: For any alg A, there exists a distrib D and target f in C
such that if A sees < (VCdim(C)-1)/(8*epsilon) examples, its expected
error is > epsilon.
PROOF: Consider d = VC-dim(C) points that can be shattered. Define
the distribution D that puts prob 1-4*epsilon on one point, and
prob 4*epsilon/(d-1) on each of the rest (call those the rare points).
Pick a random labeling of the d points as the target. Let's first
argue informally: in m examples we expect only 4*epsilon*m to be
rare ones. This is (d-1)/2. So, even if all went to different rare
ones, we still have (d-1)/2 left, and a 2*epsilon chance that the
next example is new. Since, the target is random, this implies at
least an epsilon chance of a mistake.
More formally, Pr(error) >= Pr(example is rare )*Pr(unseen|rare)*1/2
== (4epsilon)*[(1 - 4epsilon/(d-1))^m]*(1/2)
> (4epsilon)*[1 - m(4epsilon/(d-1))]*(1/2)
>= (4epsilon)*(1/2)*(1/2) == epsilon
Now let's do Thm 2 (SAUER'S LEMMA). This says we can bound the number
of splittings using the VC-dimension. In particular, we'll prove C[m] <=
(m <=d), where d = VCdim(C) and we define (m <=d) = # ways of choosing
d or fewer items out of m. I.e., (m 0) + (m 1) + (m 2) + ... + (m d).
Note: (m <=d) = (m-1 <=d) + (m-1 <=d-1). Can anyone see why?
Proof: to choose d or fewer out of m you either choose the
first element and then d-1 or fewer out of the remaining m-1 or you
don't, and then choose d or fewer out of the remaining m.
Before proving the theorem, let's look at an example.
Look at C = "intervals [a,b] of the real line" (so, VCdim = 2).
Say we have m points, named 1,2,3,...,m in order, and we now delete
point m. How does # splittings go down? Answer is that some pairs,
like [a,m] and [a,m-1] collapse to become the same splitting.
So, the number of splittings of the m points equals the number of
splittings of the m-1, which is at most (m-1 <=d) by induction, plus
the number of pairs that collapsed. But, these ones that collapsed
correspond to "final subintervals" of the previous set, which is a
class with VC-dim 1 (or d-1 in general). So, at most (m-1 <=d-1) of
these, by induction.
Proof is exactly on these lines.
PROOF OF THEOREM 2: Let's say we have a set S of m examples. We want
to bound number of ways of splitting S using concepts in C. Define
C[S] = set of different ways of splitting (partitioning) S by concepts
in C. Think of these as concepts defined only on S.
Pick some x in S. So, we know that C[S - {x}] has at most (m-1 <=d)
partitions by induction.
How are there more splittings of S than S - {x}? As before, the
increase is exactly the number of pairs in C[S] that differ only on x,
since these are the ones that collapse when you remove x.
For each pair, let's focus on the one that labels x negative.
Let C' = {c in C[S] that label x negative for which there is another
concept in C[S] that is identical to c except it labels x positive}.
(these are concepts that are only defined on S)
Claim: VCDIM(C) >= VCDIM(C') + 1.
Proof: Pick set S' of pts in S that are shattered by C'. S' doesn't
include x since all functions in C' label x negative. Now, S' U {x}
is shattered by C because of how C' was defined.
This means VCDIM(C') <= d-1 and we are done.