15-859(A) Machine Learning Theory            02/05/04

* VC-dimension: proving the main result
========================================================================
[show online DL algorithm]
[discuss implication of result proved in problem #3 of hwk 2]

Where we are right now: See Slides

Theorem 2: For any class C, distrib D, if the number of examples seen
m satisfies: m > (2/epsilon)[log_2(2C[2m]) + log_2(1/delta)], then
with prob (1-delta), all bad (error > epsilon) hypotheses in C are
inconsistent with data.

Proof: 

(Book has a very nice proof.  Will follow it, but simplify further by
not introducing notation.  Will use the term "bad concept" to mean one
with error > epsilon.)

Given m examples,  define A = event there exists a bad consistent hyp in C. 
We want to show Pr[A] is low.

Define event B as follows. B = event that in 2m examples, there exists
a concept in C that is consistent on the first m but has error at
least epsilon/2 on the second m.

   Claim: Pr[A] < 2*Pr[B].  So, if Pr[B] is low then Pr[A] is low.

   Pf: Say we pick 2m examples.  Let A = event there is a bad
   consistent hyp on first m.  Then Pr[B|A] > 1/2 by Chernoff or std dev.
   (the bad h has true error > epsilon so unlikely it will have
   observed error < epsilon/2).  This means that Pr[B] >= Pr[A]/2.
   QED

   (could have actually defined B with epsilon, not epsilon/2, since
   for epsilon<1/2, median of binomial is <= mean)

Now, show Pr[B] is low.  (Note: no longer need to talk about "bad"
hyps.  Reason: for really good hyps, unlikely to satisfy the second part)

Re-write event B:
  Draw set S of 2m examples, and then split randomly into S1, S2 of
  equal size.  B = event that exists hyp in C consistent on S1 but errs on
  at least epsilon*m/2 points in S2.

How to bound this?  There are LOTS of concepts in C. But, we'll use
the fact that there aren't so many splittings of S using concepts in
C.  So, let's fix some set S.  Can be arbitrary.  at most C[2m] splittings

Now, fix some splitting h that disagrees with the target on at least
epsilon*m/2 points in S.  (this is a candidate for causing event B to occur)

Want to show that Pr[h is consistent on S1] is small.
	(Probability is over the splitting of S into S1 and S2)

Then will apply union bound over the at most C[2m] <= ((2m d)) such splittings.

   Lemma: Suppose h disagrees with target on at least L points
   in S.  Then, Pr[h is consistent on S1] <= 1/2^L.

   Intuition: Suppose we split S by putting each point randomly into S1
   or S2 (i.e., didn't require equal size), then Pr[all the L points fall
   into S2] = 1/2^L.  Turns out that the experiment we care about is
   only more helpful to us.  E.g., if L = m+1 then we get Pr[...] = 0.
   
   Proof: prob[all L in S2] = (m L)/(2m L)  Reason: instead of fixing
   the L points and randomly splitting into equal-sized S1, S2, we can
   equivalently think of as fixing the split and then picking the L
   points randomly.  Now, just solve... < 1/2^L

Now, we apply union bound:

   Pr[B] <= C[2m] * (1/2)^{epsilon*m/2}
   want < delta/2. 

That's it.  Gives us Theorem 2.
==========================================================================

How about a uniform-convergence version of Theorem 2?

Theorem 2': For any class C, distrib D, if:
	m > (2/epsilon^2)[ln(2C[2m]) + ln(1/\delta)]
then with probability 1-delta, all h in C have empirical error within
epsilon of their true error.

Just need to redo proof using Hoeffding:

Draw 2m examples

A = event that on 1st m, there exists hyp in C with empirical and
true error that differ by at least epsilon.  (This is what we want to
bound for the theorem).  

B = event that there exists a concept in C whose empirical error on
1st half differs from empirical error on 2nd half by at least
epsilon/2.

Pr[B|A] >= 1/2 so Pr[A] <= 2*Pr[B].  

Now, show Pr[B] is low.

As before, let's first pick S, then randomly decide what is 1st half
and what is 2nd half.  Once S is determined, there are only C[2m] h's
we need to worry about.

Now, fix some h, making L mistakes overall on S.  The bad event is
that >= L/2 + epsilon*m/2 fall into S1 (or S2).  Can assume L <= m,
else we focus on the non-mistakes.  Let's pretend we were putting each
point iid into S1 or S2 (rather than splitting them up exactly 50/50).
This can only increase the chance of the bad event.
   (informal proof: the actual distribution can be thought of as going
   through the points in S one at a time, where if we currently have
   k1 slots left in S1 and k2 slots left in S2, then the next point is
   put into S1 with probability k1/(k1+k2) and S2 with prob
   k2/(k1+k2).  This means if we just watch the mistakes, if more
   appear in S1 than S2, the next one is more likely to be in S2 and
   vice versa.  So this only makes the bad pevent less likely than for
   the iid game).
So, can think of as just: we flip L<=m fair coins, and we want to bound
the chance of having epsilon*m/2 more heads than tails or vice versa.
Using Hoeffding, this is at most 2e^{-epsilon^2 m / 2}.  Set to
delta/C[2m] and we're done.
==========================================================================

To get Theorem 3, use: C[2m] <= ((2m d)) <= (2em/d)^d.  Then just some
rearranging...

Set m = (4d/epsilon)*log_2(1/epsilon) + 2/epsilon*log_2(2/delta)
	and assume left part is larger than right part and epsilon is small

rewrite (1/2)^{epsilon*m/2} as: epsilon^{2d} * delta/2

So, rewrite the whole thing as: 
    (2e * 8d/epsilon * log_2(1/epsilon) * epsilon^2 / d)^d  * delta/2
     =  (16e * log_2(1/epsilon) * epsilon)^d  * delta/2.
     < delta/2 for small enough epsilon.