15-859(B) Machine Learning Theory            02/10/14

* Tighter sample complexity bounds: Rademacher bounds
* A great tool: McDiarmid's inequality
===========================================================================
Boosting recap - view as dot-game: we put prob distrib on columns, and
then adversary gives a row with dots in at most 40% of the entries by
probability mass.  What we want is to adjust the probability
distribution to ensure than nearly all the *columns* have dots in less
than half the entries.   

===========================================================================
Let's recap our VC-dimension bounds:  Let C[m] = max # ways of
splitting m examples using concepts in C.

THEOREM 1: For any class C, distrib D, if the number of examples seen
m satisfies: m > (2/epsilon)[lg(2C[2m]) + log_2(1/delta)], then
with prob (1-delta), all bad (error > epsilon) hypotheses in C are
inconsistent with data.

THEOREM 2: For any class C, distrib D, if the number of examples seen
m satisfies: m > (8/epsilon^2)[ln(2C[2m]) + ln(1/delta)]
then with prob 1-delta, all h in C have |err_D(h)-err_S(h)| < epsilon.

SAUER'S LEMMA: C[m] < m^d, where d = VCdim(C).  So can use Sauer's
lemma to convert to bounds in terms of VC dimension.  Resulting bound
is O((1/epsilon^2)[d*ln(1/epsilon) + ln(1/delta)]).


---------------------------------------------------------------------
Motivation and plan
===================

These bounds are nice but have two drawbacks that we'd like to
address:

1. computability/estimability: say we have a hypothesis class C
   that we don't understand very well.  It might be hard to compute or
   estimate C[m], or even |C[S]| for our sample S.

2. tightness.  Our bounds have two sources of loss.  One is that
   we did a union bound over the splittings of the double-sample S,
   which is overly pessimistic if many of the splittings are very
   similar to each other.  A second is that we did worst-case over S,
   whereas we would rather do expected case over S, or even just have a
   bound that depends on our actual training set.  We will be able to
   address both of these, at least in the context of THM 2.

In particular, we will show the following remarkable fact.  Suppose
you replaced the true labels with random coin flips and then found the
lowest-empirical error function in the class for these.  (Imagine we
have a great optimizer that, given any set of labels, will find the h
in C that minimizes empirical error with respect to those labels).
Say it gives error 45%.  That means that we are overfitting by 5% wrt
the "coin flip function" since every hypothesis has true error 1/2 wrt
that.  Take that gap (5%), double it, and the result (plus some
low-order term) is an upper bound on how much any h in C is
overfitting wrt the true target f.  This is pretty remarkable since up
to low-order terms we're within a factor of 2 of the best possible
target-independent bound (since we do indeed overfit by 5% wrt the
coin-flip function).  And there do exist cases where you indeed have
to lose the factor of 2.  E.g., consider the class of all boolean
functions over {0,1}^n, with a target that is all negative (describe).

One caveat: This is along the "THEOREM 2" style - e.g., best you could
hope for is overfitting of 1/sqrt(m).  In the realizable case where
you just care about the true error of functions whose empirical error
is zero, these bounds could be quite a bit worse, since then you get
overfitting on the order of (complexity term)/m.

Note: rather than writing m as a function of epsilon, let's write
epsilon as a function of m.  In particular, we can write Thm 2 as: For
any class C and distribution D, whp all h in C satisfy 
    err_D(h) <= err_S(h) + sqrt{8ln(2C[2m]/delta) / m}.
(And we bound in the other direction as well, but let's just focus on
this direction - i.e., how much can we be overfitting the sample).

----------------------------------------------------------------------

Rademacher averages
===================
For a given set of data S = {x_1, ..., x_m} and class of functions F,
define the "empirical Rademacher complexity" of F as:

    R_S(F) = E_sigma [ max_{h in F} (1/m)(sum_i sigma_i * h(x_i)) ]

where sigma = (sigma_1, ..., sigma_m) is a random {-1,1} labeling.

I.e., if you pick a random labeling of the points in S, on average how
well correlated is the most-correlated h in F to it?  Note: h:X->{-1,1},
so sigma_i * h(x_i) = 1 if they agree and -1 if they disagree.  Note
that "correlation" = "agreement" - "disagreement" so error 45% means
correlation of 10%.

We then define the (distributional) Rademacher complexity of F as:

    R_D(F) = E_S R_S(F).


We will then prove the following theorem:

THEOREM 3: For any class C, distrib D, if we see m examples then with
probability at least 1-delta every h in C satisfies

    err_D(h) <= err_S(h) + R_D(C) + sqrt(ln(2/delta)/2m).
             <= err_S(h) + R_S(C) + 3*sqrt(ln(2/delta)/2m).

----------------------------------------------------------------------

Relating Rademacher and VC: We can see that this is never much worse
than our VC bounds.  In particular, let's consider how big can R_S(C)
be?  If you fix some h and then pick a random labeling, Hoeffding
bounds say the probability the correlation is more than 2epsilon is at
most e^{-2*m*epsilon^2}. So setting this to delta/C[m] and doing a
union bound, we have at most a delta chance that any of the splittings
in F have correlation more than 2*sqrt(ln(C[m]/delta)/(2m)), so the
expected max correlation R_S(C) can't be much higher than that.  On
the other hand, it could be a lot lower if many of the splittings are
very similar to each other.  OK, now on to the proof of Theorem 3...

----------------------------------------------------------------------

First, a useful tool we will use is McDiarmid's inequality.
This generalizes Hoeffding bounds to the case where, rather than
considering the *average* of a set of independent RVs, we are
considering some other function of them.  Specifically,

McDiarmid's inequality
======================
Say x_1, ..., x_m are independent RVs, and phi(x_1,...,x_m) is some
real-valued function.  Assume that phi satisfies the Lipschitz
condition that if x_i is changed, phi can change by at most c_i.  Then
McDiarmid's inequality states that:

      Pr[phi(x) > E[phi(x)] + epsilon] <= e^{-2*epsilon^2 / sum_i c_i^2}

so if all c_i <= 1/m (like if phi = "average" and x_i \in {0,1}), we get:

      Pr[phi(x) > E[phi(x)] + epsilon] <= e^{-2*epsilon^2 * m}

just like Hoeffding.

Proof of THEOREM 3
==================
STEP 1: The first step of the proof is to simplify the quantity we
care about.  Specifically, let's define 
    MAXGAP(S) = max_{h in C} [err_D(h) - err_S(h)].

We want to show that with probability at least 1-delta, MAXGAP(S) is
at most some epsilon. 

As a first step, we can use McDiarmid to say that whp, MAXGAP(S) will
be close to its expectation.  In particular, the examples x_j are
independent RVs and MAXGAP can change by at most 1/m if any individual
x_j in S is replaced (because the gap for any specific h can change by
at most 1/m).  So, using this as our "phi", with probability at least
1 - delta/2, we get:
     MAXGAP(S) <= E_S [MAXGAP(S)] + sqrt(ln(2/delta)/(2m)).

So, to prove the first (main) line of THEOREM 3, we just need to show
that  E_S[ MAXGAP(S) ] <= R_D(C).

In fact, the second line of the theorem follows immediately from the
first line plus an application of McDiarmid to the random variable
R_S(C) [how much can changing a training example change R_S(C)?].  So,
really we just need to prove the first line.

STEP 2: The next step is to do a double-sample argument just like we
did with VC-dimension.

Specifically, let's rewrite err_D(h) as E_{S'} [err_{S'}(h)], where S'
is a (new) set of m points drawn from D (this is the "ghost sample").
So, we can rewrite E_S[MAXGAP(S)] as: 

    E_S [ max_{h in C} [ E_{S'} [ err_{S'}(h) - err_{S}(h) ] ] ]

 <= E_{S,S'} [ max_{h in C} [ err_{S'}(h) - err_{S}(h) ] ]
          (i.e., get to pick h after seeing both S *and* S')

If we let S = {x_1,...,x_m} and let S' = {x_1', ..., x_m'} then we can
rewrite this as:

    E_{S,S'} [ max_{h in C} [ sum_i [err_{x_i'}(h) - err_{x_i}(h)]/m] ]

where I'm using err_{x}(h) to denote the loss of h on x (1 if h makes
a mistake and 0 if h gets it right).

Now, as in the VC proof, let's imagine that for each index i, we flip
a coin to decide whether to swap x_i and x_i' or not before taking the
max.  This doesn't affect the expectation since everything is iid.
So, letting sigma_i in {-1,1} at random, we can rewrite our quantity as:

    E_{S,S',sigma}[max_{h in C}[sum_i sigma_i[err_{x_i'}(h) - err_{x_i}(h)]/m]]


 <= E_{S',sigma} [ max_{h in C} [sum_i sigma_i * err_{x_i'}(h)/m] ] -
    E_{S,sigma}  [ min_{h in C} [sum_i sigma_i * err_{x_i}(h)/m] ] 
    (i.e., gap is only larger if allow the two h's to differ)

  = 2*E_{S,sigma} [ max_{h in C} [sum_i sigma_i * err_{x_i}(h)/m] ]
    (by symmetry, since sigma_i is random {-1,1})

Now, we're just about done.  The quantity here is a lot like the
Rademacher complexity of C except instead of looking at the
dot-product <sigma,h> (viewing each as a vector of m entries), we are
looking at <sigma, loss(h)> where loss(h) is the vector of losses of
h.  In addition we have an extra factor of 2.

To translate, suppose that in the definition of Rademacher complexity
we replace <sigma,h> with <sigma * f, h> where "*" is component-wise
multiplication and f is the target function.  That is, instead of
viewing sigma as a random labeling, we view it as a random
perturbation of the target.  Notice that "sigma * f" has the same
distribution as sigma, so this doesn't change the definition.  Now,
<sigma * f, h> = <sigma, f * h>.  This is nice because f * h is the
vector whose entry i is 1 if h(x_i)=f(x_i) and equals -1 if h(x_i) !=
f(x_i).  In other words, f*h(x) = 1 - 2err_x(h).

So, R_D(C) = E_{S,sigma} [max_{h in C} [sum_i sigma_i[1-2err_{x_i}(h)]/m]]
           
           = 2*E_{S,sigma} [ max_{h in C} [sum_i sigma_i * err_{x_i}(h)/m]]
as we wanted.

So, this gives us what we want.
-----------------------------------------------------------------------