15-859(B) Machine Learning Theory 02/20/08
* Tighter sample complexity bounds: Rademacher bounds
* A great tool: McDiarmid's inequality
===========================================================================
Let's recap our VC-dimension bounds: Let C[m] = max # ways of
splitting m examples using concepts in C.
THEOREM 1: For any class C, distrib D, if the number of examples seen
m satisfies: m > (2/epsilon)[lg(2C[2m]) + log_2(1/delta)], then
with prob (1-delta), all bad (error > epsilon) hypotheses in C are
inconsistent with data.
THEOREM 2: For any class C, distrib D, if the number of examples seen
m satisfies: m > (2/epsilon^2)[ln(2C[2m]) + ln(1/\delta)]
then with prob 1-delta, all h in C have |err_D(h)-err_S(h)| < epsilon.
SAUER'S LEMMA: C[m] < m^d, where d = VCdim(C). [so can use Sauer's
lemma to convert to bounds in terms of VCdim(C)]
---------------------------------------------------------------------
Motivation and plan
===================
These bounds are nice but have two drawbacks that we'd like to
address:
1. computability/estimability: say we have a hypothesis class C
that we don't understand very well. It might be hard to compute or
estimate C[m], or even |C[S]| for our sample S.
Say we have a great optimizer that, given any set of labels, will
find the h in C that minimizes empirical error with respect to
those labels. Then, one approach we could imagine trying is to
pick random labels for the points in S and see how often we can
achieve them. This will estimate C[m]/2^m. Unfortunately, this
quantity will be extremely small for the values of m of interest,
so we would have to repeat this an extremely large number of times.
Instead, an easier thing to do would be to pick random labels and
see, on average, how close to them we can get: how much better than
50/50. Rademacher bounds will be exactly in terms of that quantity.
2. tightness. Our bounds have two sources of loss. One is that
we did a union bound over the splittings of the double-sample S,
which is overly pessimistic if many of the splittings are very
similar to each other. A second is that we did worst-case over S,
whereas we would rather do expected case over S, or even just have a
bound that depends on our actual training set. We will be able to
address both of these.
Note: we are going to be extending THEOREM 2. Also, rather than
writing m as a function of epsilon, let's write epsilon as a function
of m. In particular, we can write Thm 2 as: For any class C and
distribution D, whp all h in C satisfy
err_D(h) <= err_S(h) + sqrt{2ln(2C[2m]/delta) / m}.
(And we bound in the other direction as well, but this direction is
the more important one, becuase your main worry is that you are
overfitting your sample, and the sqrt{...} quantity is a bound on how
much you can be overfitting.
----------------------------------------------------------------------
Rademacher averages
===================
For a given set of data S = {x_1, ..., x_m} and class of functions F,
define the "empirical Rademacher complexity" of F as:
R_S(F) = E_sigma [ max_{h in F} (1/m)(sum_i sigma_i * h(x_i)) ]
where sigma = (sigma_1, ..., sigma_m) is a random {-1,1} labeling.
I.e., if you pick a random labeling of the points in S, on average how
well correlated is the most-correlated h in F to it?
We then define the (distributional) Rademacher complexity of F as:
R_D(F) = E_S R_S(F).
We will then prove the following theorem:
THEOREM 3: For any class C, distrib D, if we see m examples then with
probability at least 1-delta every h in C satisfies
err_D(h) <= err_S(h) + R_D(C) + sqrt(ln(2/delta)/2m).
<= err_S(h) + R_S(C) + 3*sqrt(ln(2/delta)/2m).
----------------------------------------------------------------------
First, a useful tool we will use is McDiarmid's inequality.
This generalizes Hoeffding bounds to the case where, rather than
considering the *average* of a set of independent RVs, we are
considering some other function of them. Specifically,
McDiarmid's inequality
======================
Say x_1, ..., x_m are independent RVs, and phi(x_1,...,x_m) is some
real-valued function. Assume that phi satisfies the Lipschitz
condition that if x_i is changed, phi can change by at most c_i. Then
McDiarmid's inequality states that:
Pr[phi(x) > E(phi(x) + epsilon] <= e^{-2*epsilon^2 / sum_i c_i^2}
so if all c_i <= 1/m (like in the case that phi = "average"), we get:
Pr[phi(x) > E(phi(x) + epsilon] <= e^{-2*epsilon^2 * m}
just like Hoeffding.
Proof of THEOREM 3
==================
STEP 1: The first step of the proof is to simplify the quantity we
care about. Specifically, let's define
MAXGAP(S) = max_{h in C} [err_D(h) - err_S(h)].
We want to show that with probability at least 1-delta, MAXGAP(S) is
at most some epsilon.
As a first step, we can use McDiarmid to say that whp, MAXGAP(S) will
be close to its expectation. In particular, the x_j are independent
RVs and MAXGAP can change by at most 1/m if any individual x_j in S is
replaced (because the gap for any specific h can change by at most
1/m). So, using this as our "phi", with probability at least 1 -
delta/2, we get:
MAXGAP(S) <= E_S [MAXGAP(S)] + sqrt(ln(2/\delta)/2m).
So, to prove the first (main) line of THEOREM 3, we just need to show
that E_S[ MAXGAP(S) ] <= R_D(C).
In fact, the second line of the theorem follows immediately from the
first line plus an application of McDiarmid to the random variable
R_S(C), so really we just need to prove the first line.
STEP 2: The next step is to do a double-sample argument much like we
did with VC-dimension, but with a little twist in the "symmetrization"
step.
Specifically, let's rewrite err_D(h) as E_{S'} [err_{S'}(h)], where S'
is a (new) set of m points drawn from D. So, we can rewrite
E_S[MAXGAP(S)] as:
E_S [ max_{h in C} [ E_{S'} [ err_{S'}(h) - err_{S}(h) ] ] ]
<= E_{S,S'} [ max_{h in C} [ err_{S'}(h) - err_{S}(h) ] ]
If we let S = {x_1,...,x_m} and let S' = {x_1', ..., x_m'} then we can
rewrite this as:
E_{S,S'} [ max_{h in C} [ sum_i [err_{x_i'}(h) - err_{x_i}(h)]/m] ]
where I'm using err_{x}(h) to denote the loss of h on x (1 if h makes
a mistake and 0 if h gets it right).
Now, the twist in the symmetrization step is let's imagine that for
each index i, we flip a coin to decide whether to swap x_i and x_i' or
not. This doesn't affect the expectation since everything is iid.
So, letting sigma_i in {-1,1} at random, we can rewrite our quantity as:
E_{S,S',sigma}[max_{h in C}[sum_i sigma_i[err_{x_i'}(h) - err_{x_i}(h)]/m]]
<= E_{S',sigma} [ max_{h in C} [sum_i sigma_i * err_{x_i'}(h)/m] ] -
E_{S,sigma} [ min_{h in C} [sum_i sigma_i * err_{x_i}(h)/m] ]
(i.e., gap is only larger if allow the two h's to differ)
= 2*E_{S,sigma} [ max_{h in C} [sum_i sigma_i * err_{x_i}(h)/m] ]
(by symmetry, since sigma_i is random {-1,1})
Now, we're just about done. The quantity here is a lot like the
Rademacher complexity of C except instead of looking at
(viewing each as a vector of m entries), we are looking at
where loss(h) is the vector of losses of h. In addition we have an
extra factor of 2.
To translate, suppose that in the definition of Rademacher complexity
we replace with where "*" is component-wise
multiplication. That is, instead of viewing sigma as a random
labeling, we view it as a random perturbation of the target. Notice
that "sigma * f" has the same distribution as sigma, so this doesn't
change the definition. Now, = . This is
nice because f * h is the vector whose entry i is 1 if h(x_i)=f(x_i)
and equals -1 if h(x_i) != f(x_i). In other words, f*h(x) = 1 - 2err_x(h).
So, R_D(C) = E_{S,sigma} [max_{h in C} [sum_i sigma_i[1-2err_{x_i}(h)]/m]]
= 2*E_{S,sigma} [ max_{h in C} [sum_i sigma_i * err_{x_i}(h)/m]]
as we wanted.
So, this gives us what we want.
-----------------------------------------------------------------------