15-859(B) Machine Learning Theory            02/17/10

* Tighter sample complexity bounds: Rademacher bounds
* A great tool: McDiarmid's inequality
===========================================================================
Let's recap our VC-dimension bounds:  Let C[m] = max # ways of
splitting m examples using concepts in C.

THEOREM 1: For any class C, distrib D, if the number of examples seen
m satisfies: m > (2/epsilon)[lg(2C[2m]) + log_2(1/delta)], then
with prob (1-delta), all bad (error > epsilon) hypotheses in C are
inconsistent with data.

THEOREM 2: For any class C, distrib D, if the number of examples seen
m satisfies: m > (8/epsilon^2)[ln(2C[2m]) + ln(1/delta)]
then with prob 1-delta, all h in C have |err_D(h)-err_S(h)| < epsilon.

SAUER'S LEMMA: C[m] < m^d, where d = VCdim(C).  So can use Sauer's
lemma to convert to bounds in terms of VC dimension.  Resulting bound
is O((1/epsilon^2)[d*ln(1/epsilon) + ln(1/delta)]).


---------------------------------------------------------------------
Motivation and plan
===================

These bounds are nice but have two drawbacks that we'd like to
address:

1. computability/estimability: say we have a hypothesis class C
   that we don't understand very well.  It might be hard to compute or
   estimate C[m], or even |C[S]| for our sample S.

   Say we have a great optimizer that, given any set of labels, will
   find the h in C that minimizes empirical error with respect to
   those labels.  Then, one approach we could imagine trying is to
   pick random labels for the points in S and see how often we can
   achieve them.  This will estimate C[m]/2^m.  Unfortunately, this
   quantity will be extremely small for the values of m of interest.
   In particular, we care about the case where C[m] is approximately
   2^{epsilon*m} so we would have to repeat this an exponential number
   of times.  Instead, an easier thing to do would be to pick random
   labels and see, on average, how close to them we can get: how much
   better than 50/50.  This we could estimate.  And if you think about
   it, if you can on average get empirical error of, say, 40% on
   random labels, it means you must be overfitting by at least 10%.
   Rademacher bounds will be exactly in terms of that quantity.

2. tightness.  Our bounds have two sources of loss.  One is that
   we did a union bound over the splittings of the double-sample S,
   which is overly pessimistic if many of the splittings are very
   similar to each other.  A second is that we did worst-case over S,
   whereas we would rather do expected case over S, or even just have a
   bound that depends on our actual training set.  We will be able to
   address both of these.


Note: we are going to be extending THEOREM 2.  Also, rather than
writing m as a function of epsilon, let's write epsilon as a function
of m.  In particular, we can write Thm 2 as: For any class C and
distribution D, whp all h in C satisfy 
    err_D(h) <= err_S(h) + sqrt{2ln(2C[2m]/delta) / m}.
(And we bound in the other direction as well, but let's just focus on
this direction - i.e., how much can we be overfitting the sample).

----------------------------------------------------------------------

Rademacher averages
===================
For a given set of data S = {x_1, ..., x_m} and class of functions F,
define the "empirical Rademacher complexity" of F as:

    R_S(F) = E_sigma [ max_{h in F} (1/m)(sum_i sigma_i * h(x_i)) ]

where sigma = (sigma_1, ..., sigma_m) is a random {-1,1} labeling.

I.e., if you pick a random labeling of the points in S, on average how
well correlated is the most-correlated h in F to it?

We then define the (distributional) Rademacher complexity of F as:

    R_D(F) = E_S R_S(F).


We will then prove the following theorem:

THEOREM 3: For any class C, distrib D, if we see m examples then with
probability at least 1-delta every h in C satisfies

    err_D(h) <= err_S(h) + R_D(C) + sqrt(ln(2/delta)/2m).
             <= err_S(h) + R_S(C) + 3*sqrt(ln(2/delta)/2m).

----------------------------------------------------------------------

Relating Rademacher and VC: We can see that this is never much worse
than our VC bounds.  In particular, let's consider how big can R_S(C)
be?  If you fix some h and then pick a random labeling, Hoeffding
bounds say the probability the correlation is more than 2epsilon is at
most e^{-2*m*epsilon^2}. So setting this to delta/C[m] and doing a
union bound, we have at most a delta chance that any of the splittings
in F have correlation more than 2*sqrt(ln(C[m]/delta)/(2m)), so the
expected max correlation R_S(C) can't be much higher than that.  On
the other hand, it could be a lot lower if many of the splittings are
very similar to each other.  OK, now on to the proof of Theorem 3...

----------------------------------------------------------------------

First, a useful tool we will use is McDiarmid's inequality.
This generalizes Hoeffding bounds to the case where, rather than
considering the *average* of a set of independent RVs, we are
considering some other function of them.  Specifically,

McDiarmid's inequality
======================
Say x_1, ..., x_m are independent RVs, and phi(x_1,...,x_m) is some
real-valued function.  Assume that phi satisfies the Lipschitz
condition that if x_i is changed, phi can change by at most c_i.  Then
McDiarmid's inequality states that:

      Pr[phi(x) > E[phi(x)] + epsilon] <= e^{-2*epsilon^2 / sum_i c_i^2}

so if all c_i <= 1/m (like if phi = "average" and x_i \in {0,1}), we get:

      Pr[phi(x) > E[phi(x)] + epsilon] <= e^{-2*epsilon^2 * m}

just like Hoeffding.

Proof of THEOREM 3
==================
STEP 1: The first step of the proof is to simplify the quantity we
care about.  Specifically, let's define 
    MAXGAP(S) = max_{h in C} [err_D(h) - err_S(h)].

We want to show that with probability at least 1-delta, MAXGAP(S) is
at most some epsilon. 

As a first step, we can use McDiarmid to say that whp, MAXGAP(S) will
be close to its expectation.  In particular, the examples x_j are
independent RVs and MAXGAP can change by at most 1/m if any individual
x_j in S is replaced (because the gap for any specific h can change by
at most 1/m).  So, using this as our "phi", with probability at least
1 - delta/2, we get:
     MAXGAP(S) <= E_S [MAXGAP(S)] + sqrt(ln(2/delta)/(2m)).

So, to prove the first (main) line of THEOREM 3, we just need to show
that  E_S[ MAXGAP(S) ] <= R_D(C).

In fact, the second line of the theorem follows immediately from the
first line plus an application of McDiarmid to the random variable
R_S(C) [how much can changing a training example change R_S(C)?].  So,
really we just need to prove the first line.

STEP 2: The next step is to do a double-sample argument just like we
did with VC-dimension.

Specifically, let's rewrite err_D(h) as E_{S'} [err_{S'}(h)], where S'
is a (new) set of m points drawn from D (this is the "ghost sample").
So, we can rewrite E_S[MAXGAP(S)] as: 

    E_S [ max_{h in C} [ E_{S'} [ err_{S'}(h) - err_{S}(h) ] ] ]

 <= E_{S,S'} [ max_{h in C} [ err_{S'}(h) - err_{S}(h) ] ]
          (i.e., get to pick h after seeing both S *and* S')

If we let S = {x_1,...,x_m} and let S' = {x_1', ..., x_m'} then we can
rewrite this as:

    E_{S,S'} [ max_{h in C} [ sum_i [err_{x_i'}(h) - err_{x_i}(h)]/m] ]

where I'm using err_{x}(h) to denote the loss of h on x (1 if h makes
a mistake and 0 if h gets it right).

Now, as in the VC proof, let's imagine that for each index i, we flip
a coin to decide whether to swap x_i and x_i' or not before taking the
max.  This doesn't affect the expectation since everything is iid.
So, letting sigma_i in {-1,1} at random, we can rewrite our quantity as:

    E_{S,S',sigma}[max_{h in C}[sum_i sigma_i[err_{x_i'}(h) - err_{x_i}(h)]/m]]


 <= E_{S',sigma} [ max_{h in C} [sum_i sigma_i * err_{x_i'}(h)/m] ] -
    E_{S,sigma}  [ min_{h in C} [sum_i sigma_i * err_{x_i}(h)/m] ] 
    (i.e., gap is only larger if allow the two h's to differ)

  = 2*E_{S,sigma} [ max_{h in C} [sum_i sigma_i * err_{x_i}(h)/m] ]
    (by symmetry, since sigma_i is random {-1,1})

Now, we're just about done.  The quantity here is a lot like the
Rademacher complexity of C except instead of looking at the
dot-product <sigma,h> (viewing each as a vector of m entries), we are
looking at <sigma, loss(h)> where loss(h) is the vector of losses of
h.  In addition we have an extra factor of 2.

To translate, suppose that in the definition of Rademacher complexity
we replace <sigma,h> with <sigma * f, h> where "*" is component-wise
multiplication and f is the target function.  That is, instead of
viewing sigma as a random labeling, we view it as a random
perturbation of the target.  Notice that "sigma * f" has the same
distribution as sigma, so this doesn't change the definition.  Now,
<sigma * f, h> = <sigma, f * h>.  This is nice because f * h is the
vector whose entry i is 1 if h(x_i)=f(x_i) and equals -1 if h(x_i) !=
f(x_i).  In other words, f*h(x) = 1 - 2err_x(h).

So, R_D(C) = E_{S,sigma} [max_{h in C} [sum_i sigma_i[1-2err_{x_i}(h)]/m]]
           
           = 2*E_{S,sigma} [ max_{h in C} [sum_i sigma_i * err_{x_i}(h)/m]]
as we wanted.

So, this gives us what we want.
-----------------------------------------------------------------------