Topics in Machine Learning Theory 09/24/14
Boosting:
- A great practical algorithm
- A great theoretical result about basic definitions in the PAC model
- A surprising connection between topics in online and distributional learning
========================================================================
Let's start by recalling the PAC model.
Def 1: Alg A efficiently PAC-learns class C if for any c in C, any
distribution D, any epsilon, delta > 0, with probability 1-delta, A
runs in poly-time and poly-samples and produces a poly-time
evaluatable hyp. h with error at most epsilon.
Def 2: Let's say that alg A WEAK-LEARNS class C if for any c in C, any
distributions D, THERE EXISTS gamma, tau > 1/poly(n) s.t. A
achieves error at most 1/2 - gamma with prob at least tau.
I.e., with some noticeable probability you get noticeable correlation.
[using "gamma" to denote the "gap"]
Question: suppose we defined PAC model this way, does this change the
notion of what is learnable and what is not?
Answer, it doesn't. Given an algorithm that satisfies Def 2, can
"boost" it to an algorithm that satisfies Def1. This was the
weak->strong learning result of Rob Schapire. Later turned into a
very practical algorithm AdaBoost by Yoav Freund and Rob Schapire.
Note: we can handle tau->delta easily: just repeat (1/tau)log(1/delta)
times and whp at least one was successful. Then draw fresh data and
use to pick the good one. The big issue is the gamma->epsilon. In
the discussion below, we'll ignore delta and assume that each time, we
get a hypothesis of error <= 1/2 - gamma.
Boosting: discussion
--------------------
We're going to have to use the fact that algorithm A works (in this
weak way) over *every* distribution. If you look at the "distribution
specific" version of this question, the answer is NO. E.g., say your
target function was "x is positive if its first bit equals 1 or if x
satisfies some hard crypto property (e.g., viewed as an integer, x is
a quadratic residue mod some large N). Then, it's easy to get error
25% over uniform random examples: you predict positive if x1=1 and
whatever you want otherwise. But you can't do any better. The
problem is that there is this hard core to the target function.
Boosting shows this situation is in a sense universal. If you have
some arbitrary learning algorithm that supposedly works well, we'll
view it as a black box and either boost up its accuracy as much as we
like, or else we'll find a distribution D where that algorithm gets
error > 1/2 - gamma.
As a practical matter, the algorithm we give for boosting can be
thought of as one for creating good "challenge distributions".
Basically will propose a series of challenge distributions that A will
either succeed or fail on. Each success will produce an improvement
in our overall prediction rule. A failure means A wasn't really a
weak-learner over all distributions. So it's a great practical tool
esp if you have a bunch of different learning procedures to try.
An easy case: algorithms that know when they don't know
-------------------------------------------------------
Just to get a feel of a simple case, imagine our weak-learning
algorithm produces a hypothesis that on any given example x either
makes a prediction or says "not sure". Say it is always correct when
it makes a prediction, and and furthermore it says "not sure" at
most a 1-epsilon' fraction of the time (it's trivial to produce a
hypothesis that always says "not sure"!).
Here we can do boosting by creating a decision list. We first
run A on D to find h_1, and put as top rule in the DL. We run A on D
subject to the constraint that h_1 says "not sure" (i.e., when A asks
for an example, we repeatedly draw from D until we get one where h_1
says "not sure") and get h_2. We then run A on D conditioned on both
h_1 and h_2 outputting "not sure" and get h_3. Each new rule chops
off at least an epsilon' fraction of the remaining probability mass,
so we will get down to error epsilon after only O((1/eps')log(1/eps))
runs.
Basic idea here: focus on data where previous hyp had trouble. Force
next one to learn something *new*.
The general case
================
Now let's go to the general case of boosting weak learners. I am
going to describe AdaBoost, which is the most popular boosting
algorithm (though there are now many variations).
It will be most convenient to draw a sample of data S and then do all
our work over distributions defined on S. So, let's assume that A's
hypotheses are from a class C with C[m] = O(m^d). Our final
rule will be from a more complicated class H, but with H[m] =
m^O((1/gamma)^2*log(1/epsilon)*d). So we will start by drawing S
sufficiently large to get uniform convergence over H, and now we can
just worry about our performance on S.
Before giving the algorithm, let's just get a feel for what is going
on. The first step is going to be to just run A on the uniform
distribution over S, getting some hypothesis h_1 with error rate at
most 1/2 - gamma on S.
Now, we want a new distribution. One idea would be to zero-out the
examples that h_1 got right and use a challenge distribution that is
uniform over h_1's mistakes. But that's a bad idea. What might A do
in order to do well on this distribution but give us no new
information. (In particular, you should think of A as being
controlled by an *adversary* who is restricted only by the requirement
that it's hypotheses must have error at most 1/2 - gamma).
Instead, we are just going to down-weight the examples that h_1 got
correct, to make their total weight be the same as the weight of the
examples h_1 got incorrect. Now, h_1 is just 50/50. In fact, this
is exactly what we are going to do each round: down-weight the
examples the previous h_t got correct to make the previous h_t 50/50.
Notice that this looks a *lot* like the Randomized Weighted Majority
algorithm, where the examples are the "experts", and h_t being correct
on x_i is like expert i making a mistake at time t. Weird, huh? In
fact, the analysis will be very similar.
Before doing the algorithm and analysis, here is *why* the algorithm
and analysis are so similar to RWM.
Consider the following 2-player zero-sum game. We have one row for
each example in S, and one column for each h in C. (At most C[m]
cols). If h(x) is correct then say that is value +1 for the
column-player, else value 0. Our promise is that for any distribution
on rows, there exists a column (and moreover, A can find it) with
expected payoff at least 1/2 + gamma. By the minimax theorem, this
means there must exist a *distribution* q on columns such that for
*every* row, it's expected payoff is at least 1/2+gamma for the column
player. That means that the majority-vote over these h's, weighted by
their probability under q, gets *every* example right (by an L_1
margin of gamma). Now, when we did RWM, we viewed it as an algorithm
for finding an approximately minimax-optimal strategy for the row
player, but if you play it against a best-response oracle (i.e., A),
then the empirical distribution over columns played is an
approximately minimax-optimal strategy for the column player too.
OK, now let's do the algorithm and analysis!
========================================================================
Adaboost
--------
0. Draw large sample of size n that will be sufficient by Occam/VC arguments.
1. Start with weight w_i = 1 on each example x_i.
The associated prob dist D_1 has P(x_i) = w_i/W, where W = sum(w_i).
2. For t=1 to T do:
Feed D_t to weak algorithm, get back hypothesis h_t
with error rate 1/2 - gamma_t on D_t.
Let beta_t = (1/2 - gamma_t)/(1/2 + gamma_t).
Now update D: Multiply w_i by beta_t if h_t(x_i) is correct.
(notice: h_t now has 50% error on the new distribution D_{t+1})
3. Final hypothesis: let alpha_t = ln(1/beta_t). Output a majority
vote weighted by the alphas. I.e., h(x) = sign(sum_t alpha_t * h_t(x))
if we view labels as +/-1.
==========
What could we hope for? Suppose all gamma_t = gamma and consider a
super-easy case where all errors were *independent* (each h_t, on any
given example, flips a coin of bias 1/2-gamma to decide whether to give
the right answer or not). What would be the error rate of the maj
vote? Hoeffding bounds say it would be at most e^{-2*T*gamma^2}. We
will get exactly this.
Claim: training error(h) < e^{-2*sum_t[ (gamma_t)^2 ]}. So, if all
gamma_t = gamma, we get e^{-2T gamma^2}.
So, this actually means as a byproduct we are proving Hoeffding bounds....
(since giving independent errors is one option for A).
ANALYSIS
========
To simplify things, let's assume all gamma_t's are the same. Call
it gamma. Then beta is fixed, and the final hypothesis is just a
straight majority vote over the h_t's.
Let's now analyze as in randomized WM, by putting upper and lower bounds
on the final weight. For the upper bound, suppose we currently have a total
weight of W. Since a (1/2+gamma) fraction of W gets multiplied by beta,
after this step that portion has weight only (1/2 - gamma)*W, so the total
weight is 2*(1/2-gamma)*W. Since the total weight starts at n, this means
that:
W_final <= n*(2(1/2-gamma))^T
Now, let's look at the lower bound. To do the lower bound, notice
that any example on which our majority vote makes a mistake must
have weight at least beta^{T/2}, since if our majority vote is wrong
then we could have multiplied it by beta *at most* T/2 times. So, if
in the end our error rate is epsilon (so we make a mistake on epsilon*n
points), then
W_final >= beta^{T/2} * epsilon * n
Combining our upper and lower bounds gives us:
epsilon <= [4*(1/2 - gamma)^2 / beta]^{T/2}
= [4*(1/2 - gamma)*(1/2 + gamma)]^{T/2}
= [1 - 4*gamma^2]^{T/2}.
Now we use our favorite inequality: 1-x <= e^{-x}
epsilon <= e^{-2T*gamma^2}. QED
=======================
What about true error versus empirical error? What we really have at
this point is a bound on empirical error. To have a bound on true
error we need our sample to be large enough so that not only do we get
uniform convergence over the hypothesis space C used by A, but also we
get uniform convergence over "majority votes of O(1/gamma^2 *
log(1/epsilon)) hypotheses in C". But notice that if there are C[m]
ways of splitting m points using functions in C, then there are at
most {C[m] \choose k} ways of splitting functions using majorities of
size k. So, the sample size bound just needs to increase by a
multiple of k.
Margin analysis
===============
Actually, one thing people noticed in practice is that if you keep
running Adaboost past this limit, you don't end up overfitting.
There's a nice way of looking at this in terms of "L_1 margins".
Let's define g = alpha_1 h_1 + ... + alpha_T h_T, so our final
hypothesis h(x) is sign(g(x)). Let's scale the alpha_t's so they sum
to 1. Now, define the L_1 margin of g (or h) on an example x of label y
to be y*g(x). (We're calling it the "L_1 margin" because we have
scaled g to have L_1 length equal to 1)
If you look carefully at our analysis of adaboost, it ends up showing
not only that there can't be too many examples where the majority vote
is wrong (negative margin) but in fact there can't be too many where
the majority vote is barely right. In fact if you keep running the
algorithm, the fraction of examples with margin < gamma*0.99 will go to 0
(because those examples have too high a weight compared to the upper
bound of (1 - 2*gamma)^T.)
What does this mean? One thing about having margin gamma is that even
if T is huge, you can whp produce the exact same prediction that h
does by just viewing g as a probability distribution and sampling from
it k = (1/gamma^2)*log(1/epsilon) times, and then taking majority vote
over the sample. So, as long as we have enough data so that these
"small majority votes" aren't overfitting, then h can't be overfitting
by much either. Let's do this more formally. Let h' be the
randomized prediction function that given a new example x predicts the
majority vote over k random draws from distribution g. Then the true
error rates must satisfy err(h') >= err(h)/2 since if h(x) is
incorrect, this means that each random draw from g has at least a 50%
chance of giving the wrong answer on x, so h' has at least a 50%
chance of being incorrect. But we argued that the empirical error of
h' is low [because if h has margin gamma on x then whp h'(x)=h(x)].
Furthermore we have enough data so that whp *all* small majority-vote
functions have empirical error close to their true error, which
implies the empirical error of h' is close to its true error.
Therefore the true error of h must be low too.