15-859(B) Machine Learning Theory            01/27/14

* Chernoff/Heoffding bounds and uniform convergence
* VC-dimension
========================================================================
[this is for the last 1/4 or so of the lecture]

In the rest of class we'll prove theorem 4 and then theorem 2.  Next
time do 1 (and 3). Lot of proofs, so slow me down if gets confusing....

Theorem 4: For any alg A, there exists a distrib D and target f in C
such that if A sees < (VCdim(C)-1)/(8*epsilon) examples, its expected
error is > epsilon.

PROOF: Consider d = VC-dim(C) points that can be shattered.  Define
  the distribution D that puts prob 1-4*epsilon on one point, and
  prob 4*epsilon/(d-1) on each of the rest (call those the rare points).
  Pick a random labeling of the d points as the target.  Let's first 
  argue informally: in m examples we expect only 4*epsilon*m to be
  rare ones.  This is (d-1)/2.  So, even if all went to different rare
  ones, we still have (d-1)/2 left, and a 2*epsilon chance that the
  next example is new.  Since, the target is random, this implies at
  least an epsilon chance of a mistake.


  More formally, Pr(error) >= Pr(example is rare )*Pr(unseen|rare)*1/2
                           == (4epsilon)*[(1 - 4epsilon/(d-1))^m]*(1/2)
                           >  (4epsilon)*[1 - m(4epsilon/(d-1))]*(1/2)
			   >= (4epsilon)*(1/2)*(1/2) == epsilon


Now let's do Thm 2 (SAUER'S LEMMA). This says we can bound the number
of splittings using the VC-dimension.  In particular, we'll prove C[m] <=
(m <=d), where d = VCdim(C) and we define (m <=d) = # ways of choosing
d or fewer items out of m. I.e., (m 0) + (m 1) + (m 2) + ... + (m d).

Note: (m <=d) = (m-1 <=d) + (m-1 <=d-1).   Can anyone see why?

	Proof: to choose d or fewer out of m you either choose the
	first element and then d-1 or fewer out of the remaining m-1 or you
	don't, and then choose d or fewer out of the remaining m.

Before proving the theorem, let's look at an example.

Look at C = "intervals [a,b] of the real line" (so, VCdim = 2).

Say we have m points, named 1,2,3,...,m in order, and we now delete
point m.  How does # splittings go down?  Answer is that some pairs,
like [a,m] and [a,m-1] collapse to become the same splitting.
So, the number of splittings of the m points equals the number of
splittings of the m-1, which is at most (m-1 <=d) by induction, plus
the number of pairs that collapsed.  But, these ones that collapsed
correspond to "final subintervals" of the previous set, which is a
class with VC-dim 1 (or d-1 in general).  So, at most (m-1 <=d-1) of
these, by induction.

Proof is exactly on these lines.

PROOF OF THEOREM 2: Let's say we have a set S of m examples.  We want
to bound number of ways of splitting S using concepts in C.  Define
C[S] = set of different ways of splitting (partitioning) S by concepts
in C.  Think of these as concepts defined only on S.

Pick some x in S.  So, we know that C[S - {x}] has at most (m-1 <=d)
partitions by induction.

How are there more splittings of S than S - {x}?  As before, the
increase is exactly the number of pairs in C[S] that differ only on x,
since these are the ones that collapse when you remove x.

For each pair, let's focus on the one that labels x negative.
Let C' = {c in C[S] that label x negative for which there is another
concept in C[S] that is identical to c except it labels x positive}.

(these are concepts that are only defined on S)

Claim: VCDIM(C) >= VCDIM(C') + 1.
Proof: Pick set S' of pts in S that are shattered by C'. S' doesn't
include x since all functions in C' label x negative.  Now, S' U {x}
is shattered by C because of how C' was defined.

This means VCDIM(C') <= d-1 and we are done.