15-859(B) Machine Learning Theory 02/08/07
VC-dimension II
* Sauer's lemma
* Proving the main result
========================================================================
Where we are right now: [See Slides]
Given a concept class C and set of examples S, define C[S] as the set
of distinct functions over S produced by concepts in C. Then let's
define C[m] = max_{S:|S|=m} |C[S]|.
SAUER'S LEMMA: C[m] <= (m <=d), where d = VCdim(C) and we define
(m <=d) = # ways of choosing d or fewer items out of m. I.e.,
(m <=d) = (m 0) + (m 1) + (m 2) + ... + (m d).
Recall: (m <=d) = (m-1 <=d) + (m-1 <=d-1).
Proof: to choose d or fewer out of m you either choose the
first element and then d-1 or fewer out of the remaining m-1 or you
don't, and then choose d or fewer out of the remaining m.
Last time we gave the main idea of the proof of Sauer's lemma. We'll
do the formal proof now.
PROOF OF SAUER'S LEMMA: Let's say we have a set S of m examples. We
want to bound |C[S]|.
Pick some x in S. So, we know that C[S - {x}] has at most (m-1 <=d)
distinct partitions by induction.
How are there more splittings of S are there than of S - {x}? As in
the special case we examined last class, |C[S]| - |C[S - {x}]| = the
number of pairs in C[S] that differ only on x, since these are the
ones that collapse when you remove x.
For each pair, let's focus on the one that labels x negative.
Let C' = {c in C[S] that label x negative for which there is another
concept in C[S] that is identical to c except it labels x positive}.
(Remember, these are concepts that are only defined on S)
Claim: VCDIM(C) >= VCDIM(C') + 1.
Proof: Pick set S' of pts in S that are shattered by C'. S' doesn't
include x since all functions in C' label x negative. Now, S' U {x}
is shattered by C because of how C' was defined.
This means VCDIM(C') <= d-1 and we are done. QED
---------------------------------------------------------------------
THEOREM 1: For any class C, distrib D, if the number of examples seen
m satisfies: m > (2/epsilon)[log_2(2C[2m]) + log_2(1/delta)], then
with prob (1-delta), all bad (error > epsilon) hypotheses in C are
inconsistent with data. [can remove the "2" over the epsilon if we like]
Proof: (Book has a very nice proof. Will follow it, but simplify
further by not introducing notation. Will use the term "bad concept"
to mean one with error > epsilon.)
Given m examples, define A = event there exists a bad consistent hyp in C.
We want to show Pr[A] is low.
Define event B as follows. B = event that in 2m examples, there exists
a concept in C that is consistent on the first m but has error at
least epsilon/2 on the second m.
Claim: Pr[A] < 2*Pr[B]. So, if Pr[B] is low then Pr[A] is low.
Pf: Say we pick 2m examples. Let A = event there is a bad
consistent hyp on first m. Then Pr[B|A] > 1/2 by Chernoff or std dev.
(the bad h has true error > epsilon so unlikely it will have
observed error < epsilon/2). This means that Pr[B] >= Pr[A]/2.
QED
(could have actually defined B with epsilon, not epsilon/2, since
for epsilon<1/2, median of binomial is <= mean)
Now, show Pr[B] is low. (Note: no longer need to talk about "bad"
hyps. Reason: for really good hyps, unlikely to satisfy the second part)
Re-write event B:
Draw set S of 2m examples, and then split randomly into S1, S2 of
equal size. B = event that exists hyp in C consistent on S1 but
makes a mistake on at least epsilon*m/2 points in S2.
How to bound this? There are LOTS of concepts in C. But, we'll use
the fact that there aren't so many splittings of S using concepts in
C. So, let's fix some set S. Can be arbitrary. at most C[2m] splittings
Now, fix some splitting h that disagrees with the target on at least
epsilon*m/2 points in S. (this is a candidate for causing event B to occur)
Want to show that Pr[h is consistent on S1] is small.
(Probability is over the splitting of S into S1 and S2)
Then will apply union bound over the at most C[2m] <= (2m <=d) such splittings.
Lemma: Suppose h disagrees with target on at least L points
in S. Then, Pr[h is consistent on S1] <= 1/2^L.
Intuition: Suppose we split S by putting each point randomly into S1
or S2 (i.e., didn't require equal size), then Pr[all the L points fall
into S2] = 1/2^L. Turns out that the experiment we care about is
only more helpful to us. E.g., if L = m+1 then we get Pr[...] = 0.
Proof: prob[all L in S2] = (m L)/(2m L) Reason: instead of fixing
the L points and randomly splitting into equal-sized S1, S2, we can
equivalently think of as fixing the split and then picking the L
points randomly. Now, just solve... < 1/2^L
Now, we apply union bound:
Pr[B] <= C[2m] * (1/2)^{epsilon*m/2}. Set this to delta/2.
That's it. Gives us Theorem 2.
Note the key trick in the argument here. We said that an equivalent
experiment for looking at event B is to (1) draw 2m examples and then
(2) *randomly* partition those examples into a "training" and "test"
set. The reason this is important is that after step (1) the set C[S]
is fixed, but we've still saved a lot of our randomness to use in
getting a confidence bound.
===================Do this only if time======================================
How about a uniform-convergence version of Theorem 1?
Theorem 1': For any class C, distrib D, if:
m > (2/epsilon^2)[ln(2C[2m]) + ln(1/\delta)]
then with prob 1-delta, all h in C have |err_D(h)-err_S(h)| < epsilon.
Just need to redo proof using Hoeffding. Draw 2m examples:
A = event that on 1st m, there exists hyp in C with empirical and
true error that differ by at least epsilon. (This is what we want to
bound for the theorem).
B = event that there exists a concept in C whose empirical error on
1st half differs from empirical error on 2nd half by at least
epsilon/2.
Pr[B|A] >= 1/2 so Pr[A] <= 2*Pr[B]. Now, show Pr[B] is low:
As before, let's first pick S, then randomly decide what is 1st half
and what is 2nd half. Once S is determined, there are only C[2m] h's
we need to worry about.
Now, fix some h, making L mistakes overall on S. The bad event is
that >= L/2 + epsilon*m/2 fall into S1 (or S2). Can assume L <= m,
else we focus on the non-mistakes. Let's pretend we were putting each
point iid into S1 or S2 (rather than splitting them up exactly 50/50).
As before, this can only increase the chance of the bad event.
(informal proof of this: the actual distribution can be thought of as going
through the points in S one at a time, where if we currently have
k1 slots left in S1 and k2 slots left in S2, then the next point is
put into S1 with probability k1/(k1+k2) and S2 with prob
k2/(k1+k2). This means if we just watch the mistakes, if more
appear in S1 than S2, the next one is more likely to be in S2 and
vice versa. So this only makes the bad pevent less likely than for
the iid game).
So, by just focusing on where the mistakes go, can think of as just:
we flip L<=m fair coins, and we want to bound the chance of having
epsilon*m/2 more heads than tails or vice versa. Using Hoeffding, this
is at most 2e^{-epsilon^2 m / 2}. Set to delta/C[2m] and we're done.
==========================================================================
To get Cor 3, use: C[2m] <= (2m <=d) <= (2em/d)^d. Then just some
rearranging...
===========================================================================
More notes: It's pretty clear we can replace C[2m] with a "high
probability" bound with respect to the distribution D. I.e. a number
such that with high probability, a random set S of 2m points drawn
from D wouldn't have more than this many labelings. With
extra work, you can reduce it further to the *expected* number of
labelings of a random set S of 2m points. In fact, there's been a lot
of work on reducing it even further to "the number of possible
labelings of your actual sample". See, e.g., the paper "Sample Based
Generalization Bounds".