Topics in Machine Learning Theory            09/17/14
VC-dimension and Uniform Convergence
* Sauer's lemma
* Proving the main result
========================================================================
First: going over the preliminaries for uniform convergence
and VC-dimension in the slides.

Plan for the rest of today: two very nice proofs.
(1) Sauer's lemma: the number of different ways of labeling m points
    using functions in class C is at most {m \choose \leq d}, where d=VCdim(C).
(2) the uniform convergence bound where we (roughly) replace |C| with
    C[2m]. 

Given a concept class C and set of examples S, let C[S] be the set of
possible labelings of S using concepts in C.  C[m] = max_{S:|S|=m} |C[S]|.

SAUER'S LEMMA: C[m] <= {m \choose \leq d}, where d = VCdim(C) and we define 
{m \choose  \leq d} = # ways of choosing d or fewer items out of m. 

Note: {m \choose \leq d} = {m-1 \choose \leq d} + {m-1 \choose \leq d-1}

	Proof: to choose d or fewer out of m you either choose the
	first element and then d-1 or fewer out of the remaining m-1 or you
	don't, and then choose d or fewer out of the remaining m.

PROOF OF SAUER'S LEMMA: Let's say we have a set S of m examples.  We
want to bound |C[S]|.

Pick some x in S.  So, we know that C[S - {x}] has at most {m-1
\choose \leq d} distinct partitions by induction.

How are there more labelings of S are there than of S - {x}?  
|C[S]| - |C[S - {x}]| = the number of pairs in C[S] that differ only on x, 
since these are the ones that collapse when you remove x.

For each pair, let's focus on the one that labels x positive.
Let C' = {c in C[S] that label x positive for which there is another
concept in C[S] that is identical to c except it labels x negative}.

(Remember, these are concepts that are only defined on S)

Claim: VCDIM(C) >= VCDIM(C') + 1.
Proof: Pick set S' of pts in S that are shattered by C'. S' doesn't
include x since all functions in C' label x positive.  Now, S' U {x}
is shattered by C because of how C' was defined.

This means VCDIM(C') <= d-1, so |C'[S]| <= {m-1 \choose \leq d-1} by
induction and we are done.  QED

---------------------------------------------------------------------

THEOREM 1: For any class C, distrib D, if we draw a sample S from D of
size:
    m > (2/epsilon)[log_2(2C[2m]) + log_2(1/delta)], 
then with prob 1-delta, all h with err_D(h)>epsilon have err_S(h)>0.

Proof: Given set S of m examples,  define A = event there exists h in
C with err_D(h)>epsilon but err_S(h)=0. 

We want to show Pr[A] is low.

Now, consider drawing *two* sets S, S' of m examples each.  Let A be
defined as before.  Define B = event that there exists
a concept h in C with err_S(h)=0 but err_{S'}(h) >= epsilon/2.

   Claim: Pr[A]/2 <= Pr[B].  So, if Pr[B] is low then Pr[A] is low.

   Pf: Pr[B] >= Pr[B|A]Pr[A].  Conditioned on event A (pick some such
   h), Pr[B|A] > 1/2 by Chernoff since m > 8/epsilon.  (the bad h has
   true error > epsilon so unlikely it will have observed error <
   epsilon/2).  This means that Pr[B] >= Pr[A]/2. QED

Now, show Pr[B] is low.  (Note: no longer need to talk about "bad"
hyps.  Reason: for really good hyps, unlikely to satisfy the second part)

To do this, consider related event: draw S = {x1, x2, ..., xm} and S'
= {x1', x2', ..., xm'} and now create sets T, T' using the following
procedure Swap: 

  For each i, flip a fair coin:  
     - If heads, put xi in T and put xi' in T'.
     - If tails, put xi' in T and put xi in T'.

Claim: (T,T') has the same distribution as (S,S').  So, equivalent to
bound the probability of B_T = event that exists concept h in C with
err_T(h)=0 but err_{T'}(h) >= epsilon/2.

What's the point of all this?? Instead of Pr_{S,S'}[B] we have
Pr_{S,S',swap}[B_T].  Will show this is small by showing that for
*all* S,S', Pr_{swap}[B_T] is small.

In particular, the key here is that even if there are infinitely many
concepts in C, once we have drawn S,S', the number of different
labelings we have to worry about is at most C[2m], and will argue that
whp (over the randomness in "swap") none of them will hurt us.

Now, fix S,S' and fix some labeling h.  If, for any i, h makes a
mistake on *both* xi and xi' then Pr_{swap}(err_T(h)=0) = 0.  If h
makes a mistake on less than epsilon*m/2 points, then 
  Pr_{swap}(err_{T'}(h) >= epsilon/2)=0.  
Else, 
  Pr_{swap}(err_T(h)=0 AND err_{T'}(h) >= epsilon/2) <= 2^{-epsilon*m/2}.
Now, we apply union bound:

   Pr[B] <= C[2m] * 2^{-epsilon*m/2}.   Set this to delta/2.

That's it.  Gives us Theorem 2.

===========================================
How about an agnostic/uniform-convergence version of Theorem 1?

Theorem 1': For any class C, distrib D, if:
	m > (8/epsilon^2)[ln(2C[2m]) + ln(1/\delta)]
then with prob 1-delta, all h in C have |err_D(h)-err_S(h)| < epsilon.

Just need to redo proof using Hoeffding.  Draw 2m examples:

A = event that on 1st m, there exists hyp in C with empirical and
true error that differ by at least epsilon.  (This is what we want to
bound for the theorem).  

B = event that there exists a concept in C whose empirical error on
1st half differs from empirical error on 2nd half by at least
epsilon/2.

Pr[B|A] >= 1/2 so Pr[A] <= 2*Pr[B].  Now, show Pr[B] is low:

As before, let's pick S, S' and do the random procedure swap to
construct T, T'.  Show that for *any* S,S', 
  Pr_{swap}(exists h in C with |err_T(h)-err_{T'}(h)| > epsilon/2)
is small.

Again, at most C[2m] labelings of S \union S', so fix one such h.

Can think of any i such that exactly one of h(xi) and h(xi') is
correct as a "coin".  Asking: if we flip m' <= m coins, what is
Pr(|heads-tails| > epsilon*m/2).  Write this as being off from
expectation by more than epsilon*m/4 = (1/4)(epsilon*m/m')m'.
By Hoeffding, Prob is at most 2*e^{-(epsilon*m/m')^2 m'/8}
and this is always <= 2*e^{-epsilon^2 m/8}.  Now multiply by C[2m] and
set to delta.


==========================================================================
To get Cor 3, use: C[2m] <= {2m \choose \leq d} <= (2em/d)^d.  Then just some
rearranging...

===========================================================================
More notes:  It's pretty clear we can replace C[2m] with a "high
probability" bound with respect to the distribution D.  I.e. a number
such that with high probability, a random set S,S' of 2m points drawn
from D wouldn't have more than this many labelings.  With 
extra work, you can reduce it further to the *expected* number of
labelings of a random set of 2m points.  In fact, there's been a lot
of work on reducing it even further to "the number of possible
labelings of your actual sample".  See, e.g., the paper "Sample Based
Generalization Bounds".