15-859(B) Machine Learning Theory 02/15/06
* Chernoff/Heoffding recap
* VC-dimension
========================================================================
[See Slides]
In the rest of class we'll prove theorem 4 and then theorem 1. Next
time do 2 (and 3). Lot of proofs, so slow me down if gets confusing....
Theorem 4: there exists a distribution of examples and target concepts
under which any algorithm requires Omega(VCdim/epsilon) examples.
Consider d = VC-dim(C) points that can be shattered and a target
that is random on these. Define the distribution D like this:
1-4*epsilon on one point, 4*epsilon/(d-1) on the rest. Let's first
argue informally: in m examples we expect only 4*epsilon fraction on
rare ones. This is (d-1)/2. so, at least a (4*epsilon)*(1/2)
chance that the next example is new, and if so we have a 1/2 chance
of making a mistake (since target is random).
More formally, Pr(error) >= Pr(example is in rare set)*Pr(unseen|rare)*1/2
== (4epsilon)*[(1 - 4epsilon/(d-1))^m]*(1/2)
>= (4epsilon)*[1 - m(4epsilon/(d-1))]*(1/2)
>= (4epsilon)*(1/2)*(1/2) == epsilon
Proof of Theorem 1: Theorem says we can bound the number of splittings
using the VC-dimension. In particular, we'll prove that C[m] <= ((m
d)), where d = VCdim(C) and we define ((m d)) = # ways of choosing d
or fewer items out of m. I.e., (m 0) + (m 1) + (m 2) + ... + (m d).
note: ((m d)) = ((m-1 d)) + ((m-1 d-1)). Can anyone see why?
Proof: to choose d or fewer out of m you either choose the
first element and then d-1 or fewer out of the remaining m-1 or you
don't, and then choose d or fewer out of the remaining m.
Before proving the theorem, let's look at an example.
Look at C = "intervals [a,b] of the real line" (so, VCdim = 2).
Say we have m-1 points, named 1,2,3,...,m-1 and we now add a new point
m at the end. How does # splittings increase? if we look at the old
splittings, some of them, like [2,5], are committed: they have to
label m negative. But some, like [2,m-1], can be extended in both
ways to m. So the new number of splittings equals the old number,
which is at most ((m-1 d)) by induction, plus the number of old
splittings that can be extended in two ways. But, these ones that can
be extended correspond to "final subintervals" of the previous set,
which is a class with VC-dim 1 (or d-1 in general). So, at most ((m-1
d-1)) of these, by induction.
Proof is exactly on these lines.
Proof of theorem 1:
Let's say we have a set S of m examples. We want to bound number of
ways of splitting S using concepts in C. Define C[S] = set of
different ways of splitting (partitioning) S by concepts in C. Think
of these as concepts defined only on S.
Pick some x in S. So, we know that C[S - {x}] has at most ((m-1 d))
partitions by induction.
How are there more splittings of S than S - {x}? As before, the
increase is exactly the number of pairs in C[S] that differ only on x,
since these are the ones that collapse when you remove x.
For each pair, let's focus on the one that labels x negative.
Let C' = {c in C[S] that label x negative for which there is another
concept in C[S] that is identical to c except it labels x positive}.
(these are concepts that are only defined on S)
Claim: VCDIM(C) >= VCDIM(C') + 1.
Proof: Pick set S' of pts in S that are shattered by C'. S' doesn't
include x since all functions in C' label x negative. Now, S' U {x}
is shattered by C by how C' was defined.
This means VCDIM(C') <= d-1 and we are done.
===================just in case there's time...==========================
Proof of Theorem 2:
(Book has a very nice proof. Will follow it, but simplify further by
not introducing notation. Will use the term "bad concept" to mean one
with error > epsilon.)
Given m examples, define A = event there exists a bad consistent hyp in C.
We want to show Pr[A] is low.
Define event B as follows. B = event that in 2m examples, there exists
a concept in C that is consistent on the first m but has error at
least epsilon/2 on the second m.
Claim: Pr[A] < 2*Pr[B]. So, if Pr[B] is low then Pr[A] is low.
Pf: Say we pick 2m examples. Let A = event there is a bad
consistent hyp on first m. Then Pr[B|A] > 1/2 by Chernoff or std dev.
(the bad h has true error > epsilon so unlikely it will have
observed error < epsilon/2). This means that Pr[B] >= Pr[A]/2.
QED
(could have actually defined B with epsilon, not epsilon/2, since
for epsilon<1/2, median of binomial is <= mean)
Now, show Pr[B] is low. (Note: no longer need to talk about "bad"
hyps. Reason: for really good hyps, unlikely to satisfy the second part)
Re-write event B:
Draw set S of 2m examples, and then split randomly into S1, S2 of
equal size. B = event that exists hyp in C consistent on S1 but errs on
at least epsilon*m/2 points in S2.
How to bound this? There are LOTS of concepts in C. But, we'll use
the fact that there aren't so many splittings of S using concepts in
C. So, let's fix some set S. Can be arbitrary. at most C[2m] splittings
Now, fix some splitting h that disagrees with the target on at
least epsilon*m/2 points in S. (this is a candidate for causing event
B to occur)
Want to show that Pr[h is consistent on S1] is small.
(Probability is over the splitting of S into S1 and S2)
Then will apply union bound over the at most C[2m] <= ((2m d)) such splittings.
Lemma: Suppose h disagrees with target on at least L points
in S. Then, Pr[h is consistent on S1] <= 1/2^L.
Intuition: Suppose we split S by putting each point randomly into S1
or S2 (i.e., didn't require equal size), then Pr[all the L points fall
into S2] = 1/2^L. Turns out that the experiment we care about is
only more helpful to us. E.g., if L = m+1 then we get Pr[...] = 0.
Proof: prob[all L in S2] = (m L)/(2m L) Reason: instead of fixing
the L points and randomly splitting into equal-sized S1, S2, we can
equivalently think of as fixing the split and then picking the L
points randomly. Now, just solve... < 1/2^L
Now, we apply union bound:
Pr[B] <= C[2m] * (1/2)^{epsilon*m/2}
want < delta/2.
That's it. Gives us Theorem 2.
--------------------------
To get Theorem 3, use: C[2m] <= ((2m d)) <= (2em/d)^d. Then just some
rearranging...
Set m = (4d/epsilon)*log_2(1/epsilon) + 2/epsilon*log_2(2/delta)
and assume left part is larger than right part and epsilon is small
rewrite (1/2)^{epsilon*m/2} as:
epsilon^{2d} * delta/2
So, rewrite the whole thing as:
(2e * 8d/epsilon * log_2(1/epsilon) * epsilon^2 / d)^d * delta/2
= (16e * log_2(1/epsilon) * epsilon)^d * delta/2.
< delta/2 for small enough epsilon.