15-859(A) Machine Learning Theory 02/22/06
* VC-dimension: proving the main result
[Hand out / discuss hwk2 solutions]
========================================================================
Where we are right now: [See Slides]
Recall: using C[m] to denote the maximum number of different ways of
labeling a set of m points using functions in C. (maximum over all
possible sets S of m points)
Theorem 2: For any class C, distrib D, if the number of examples seen
m satisfies: m > (2/epsilon)[log_2(2C[2m]) + log_2(1/delta)], then
with prob (1-delta), all bad (error > epsilon) hypotheses in C are
inconsistent with data.
[can get rid of the "2" over the epsilon if we like]
Proof: (Book has a very nice proof. Will follow it, but simplify
further by not introducing notation. Will use the term "bad concept"
to mean one with error > epsilon.)
Given m examples, define A = event there exists a bad consistent hyp in C.
We want to show Pr[A] is low.
Define event B as follows. B = event that in 2m examples, there exists
a concept in C that is consistent on the first m but has error at
least epsilon/2 on the second m.
Claim: Pr[A] < 2*Pr[B]. So, if Pr[B] is low then Pr[A] is low.
Pf: Say we pick 2m examples. Let A = event there is a bad
consistent hyp on first m. Then Pr[B|A] > 1/2 by Chernoff or std dev.
(the bad h has true error > epsilon so unlikely it will have
observed error < epsilon/2). This means that Pr[B] >= Pr[A]/2.
QED
(could have actually defined B with epsilon, not epsilon/2, since
for epsilon<1/2, median of binomial is <= mean)
Now, show Pr[B] is low. (Note: no longer need to talk about "bad"
hyps. Reason: for really good hyps, unlikely to satisfy the second part)
Re-write event B:
Draw set S of 2m examples, and then split randomly into S1, S2 of
equal size. B = event that exists hyp in C consistent on S1 but errs on
at least epsilon*m/2 points in S2.
How to bound this? There are LOTS of concepts in C. But, we'll use
the fact that there aren't so many splittings of S using concepts in
C. So, let's fix some set S. Can be arbitrary. at most C[2m] splittings
Now, fix some splitting h that disagrees with the target on at least
epsilon*m/2 points in S. (this is a candidate for causing event B to occur)
Want to show that Pr[h is consistent on S1] is small.
(Probability is over the splitting of S into S1 and S2)
Then will apply union bound over the at most C[2m] <= ((2m d)) such splittings.
Lemma: Suppose h disagrees with target on at least L points
in S. Then, Pr[h is consistent on S1] <= 1/2^L.
Intuition: Suppose we split S by putting each point randomly into S1
or S2 (i.e., didn't require equal size), then Pr[all the L points fall
into S2] = 1/2^L. Turns out that the experiment we care about is
only more helpful to us. E.g., if L = m+1 then we get Pr[...] = 0.
Proof: prob[all L in S2] = (m L)/(2m L) Reason: instead of fixing
the L points and randomly splitting into equal-sized S1, S2, we can
equivalently think of as fixing the split and then picking the L
points randomly. Now, just solve... < 1/2^L
Now, we apply union bound:
Pr[B] <= C[2m] * (1/2)^{epsilon*m/2}. Set this to delta/2.
That's it. Gives us Theorem 2.
==========================================================================
How about a uniform-convergence version of Theorem 2?
Theorem 2': For any class C, distrib D, if:
m > (2/epsilon^2)[ln(2C[2m]) + ln(1/\delta)]
then with probability 1-delta, all h in C have empirical error within
epsilon of their true error.
Just need to redo proof using Hoeffding. Draw 2m examples:
A = event that on 1st m, there exists hyp in C with empirical and
true error that differ by at least epsilon. (This is what we want to
bound for the theorem).
B = event that there exists a concept in C whose empirical error on
1st half differs from empirical error on 2nd half by at least
epsilon/2.
Pr[B|A] >= 1/2 so Pr[A] <= 2*Pr[B]. Now, show Pr[B] is low:
As before, let's first pick S, then randomly decide what is 1st half
and what is 2nd half. Once S is determined, there are only C[2m] h's
we need to worry about.
Now, fix some h, making L mistakes overall on S. The bad event is
that >= L/2 + epsilon*m/2 fall into S1 (or S2). Can assume L <= m,
else we focus on the non-mistakes. Let's pretend we were putting each
point iid into S1 or S2 (rather than splitting them up exactly 50/50).
As before, this can only increase the chance of the bad event.
(informal proof of this: the actual distribution can be thought of as going
through the points in S one at a time, where if we currently have
k1 slots left in S1 and k2 slots left in S2, then the next point is
put into S1 with probability k1/(k1+k2) and S2 with prob
k2/(k1+k2). This means if we just watch the mistakes, if more
appear in S1 than S2, the next one is more likely to be in S2 and
vice versa. So this only makes the bad pevent less likely than for
the iid game).
So, by just focusing on where the mistakes go, can think of as just:
we flip L<=m fair coins, and we want to bound the chance of having
epsilon*m/2 more heads than tails or vice versa. Using Hoeffding, this
is at most 2e^{-epsilon^2 m / 2}. Set to delta/C[2m] and we're done.
==========================================================================
To get Theorem 3, use: C[2m] <= ((2m d)) <= (2em/d)^d. Then just some
rearranging...
===========================================================================
More notes: It's pretty clear we can replace C[2m] with a "high
probability" bound with respect to the distribution D. I.e. a number
such that with high probability, a random set S of 2m points drawn
from D wouldn't have more than this many labelings. With
extra work, you can reduce it further to the *expected* number of
labelings of a random set S of 2m points. In fact, there's been a lot
of work on reducing it even further to "the number of possible
labelings of your actual sample". See, e.g., the paper "Sample Based
Generalization Bounds".