15-859(B) Machine Learning Theory            02/13/12

* Boosting, contd:  Adaboost
========================================================================
Basic setup:

  Suppose we have an algorithm A that claims on every distribution to
  be able to get error at most p = 1/2 - gamma.  (let's ignore the
  delta parameter). How can we use this as a black box to get error
  epsilon?

  Or, as a practical matter: say we have a learning algorithm that's
  good at finding rules-of-thumb.  How can we use it to produce a
  really good classifier? (Or output a "challenge dataset" where any
  progress will help us do better?) 

Last time we looked at one method for reducing error (Rob Schapire's
original approach).  This involved calling A 3 times on specially
reweighted distributions and taking majority vote.  If hypotheses
produced by A have error p=1/2 - gamma, then the result has error 
3p^2 - 2p^3 = 1/2 - gamma[3/2 - 2*gamma^2].  (so the new gamma is bigger)

We then calculated the depth of tree needed to get to error epsilon,
which was O(log(1/gamma) + loglog(1/epsilon)).

So, the *size* of tree needed (3^depth) is poly(1/gamma, log(1/epsilon)).

==============================================

Adaboost idea: Instead of doing majorities of majorities, let's just
keep repeating step 2 of the previous scheme, each time reweighting
the examples so that the previous hypothesis gets error exactly 1/2.
Then take a big vote of all the hypotheses, weighted in some way by
how good they were on the distributions they came from. 

E.g., a typical way this might work is say the initial data is 80%
negative.  So the first hypothesis h1 might be "all negative" and we
would then reweight so the data is 50/50.  Then the next hypothesis
might be something simple, such as that some variable xi is correlated
with the target, and we would reweight so that xi is now 50/50, and so
on. Not obvious this kind of thing will work since maybe you could get
into a cycle (after reweighting so that h3 is 50/50, the algorithm
goes back and gives us h1!) but it will come out of the proof.

What could we hope for?  Suppose we want to get error epsilon, and
algorithm A always gave us hypotheses of error 1/2 - gamma where the
errors were *independent*.  How many calls would we need so that the
majority vote had error epsilon?  I.e., if we pick a random example,
we want at most an epsilon probability of the vote being wrong.  This
is like flipping a coin of bias 1/2 - gamma and we want at most an
epsilon chance of more heads than tails.  Hoeffding bounds say that
want want the number of flips T to satisfy e^{-2*T*gamma^2} = epsilon,
so the number of flips needed is O((1/gamma)^2 log(1/epsilon)).

We are going to be able to achieve this, even though A is not
necessarily so nice.  In fact, as a byproduct, this argument actually
gives a *proof* of this case of Hoeffding bounds (since giving
independent errors is one option for A).

One more point: the algorithm will in fact look a lot like the
randomized weighted majority algorithm, except the examples are
playing the role of the "experts", the weak hypotheses are playing the
role of the examples, and right is wrong, (don't worry - we'll figure it
out after we see the algorithm(!).

Adaboost
--------
0. Draw large sample of size n that will be sufficient by Occam/VC arguments.

1. Start with weight w_i = 1 on each example x_i.  
   The associated prob dist D has P(x_i) = w_i/W, where W = sum(w_i).

2. For t=1 to T do:

        Feed D to weak algorithm, get back hypothesis h_t.   [Draw picture]

	Let beta_t = (error rate)/(1 - error rate) of h_t on D

        Now update D:  Multiply w_i by beta_t if h_t(x_i) is correct.
   
        (notice: h_t now has 50% error on the new distribution)

3. Final hypothesis: let alpha_t = ln(1/beta_t).  Output a majority
   vote weighted by the alphas.  I.e., h(x) = sign(sum_t alpha_t * h_t(x)) 
   if we view labels as +/-1. 


Claim: training error(hl) < e^{-2*sum_t[ (gamma_t)^2 ]}

   where 1/2 - gamma_t = error rate of h_t on D_t.

So, if each time we have error at most 1/2 - gamma, we run it T times,
we get e^{-2T gamma^2}.  If our goal is to achieve error epsilon, then
we need to run this O((1/gamma)^2 * log(1/epsilon)) times.
(Notice similarity to Hoeffding formula)


ANALYSIS
========

To simplify things, let's assume all gamma_t's are the same.  Call
it gamma.  Then beta is fixed, and the final hypothesis is just a
straight majority vote over the h_t's.

Some notation:
     error = 1/2 - gamma  (error of each weak hypothesis)
     beta = error/(1-error) = (1/2 - gamma)/(1/2 + gamma)
     T = number of steps we run this procedure

Let's now analyze as in randomized WM, by putting upper and lower bounds
on the final weight.  For the upper bound, suppose we currently have a total
weight of W.  Since a (1-error) fraction of W gets multiplied by beta,
after this step that portion has weight only error*W, so the total
weight is 2*error*W.  Since the total weight starts at n, this means
that:

   W_final <= n*(2*error)^T  =  n*(1 - 2*gamma)^T

Now, let's look at the lower bound.  To do the lower bound, notice
that any example on which our majority vote makes a mistake must
have weight at least beta^{T/2}, since if our majority vote is wrong
then we could have multiplied it by beta *at most* T/2 times.  So, if
in the end our error rate is epsilon (so we make a mistake on epsilon*n
points), then

	W_final >= beta^{T/2} * epsilon * n

Combining our upper and lower bounds gives us:

        epsilon <= [4*error^2 / beta]^{T/2}

                 = [4*(1/2 - gamma)*(1/2 + gamma)]^{T/2}

                 = [1 - 4*gamma^2]^{T/2}.

Now we use our favorite inequality: 1-x <= e^{-x}

        epsilon <= e^{-2T*gamma^2}.   QED

=======================
So, just to summarize, if we had T hypotheses of *independent* error
1/2 - gamma, then Hoeffding bounds give us this bound on the error
rate of their majority vote.  What we are able to do with Adaboost is
achieve this error with an arbitrary weak learning algorithm, that
just guarantees that to give error at most 1/2 - gamma on whatever
distribution it's given.

What about true error versus empirical error?  What we really have at
this point is a bound on empirical error.  To have a bound on true
error we need our sample to be large enough so that not only do we get
uniform convergence over the hypothesis space H used by A, but also
we get uniform convergence over "majority votes of O(1/gamma^2 *
log(1/epsilon)) hypotheses in H".  To get a feel for this, let
MAJ_k(H) be the functions you can get by taking majorities of k
functions in H.  If there are H[m] ways of splitting m points using
functions in H, then there are at most {H[m] \choose k} ways of
splitting functions using majorities.  So, the sample size bound just
needs to increase by a multiple of k.

Margin analysis
===============
Actually, one thing people noticed in practice is that if you keep
running Adaboost past this limit, you don't end up overfitting.
There's a nice way of looking at this in terms of "L_1 margins".  

Let's define g = alpha_1 h_1 + ... + alpha_T h_T, so our final
hypothesis h(x) is sign(g(x)).  Let's scale the alpha_t's so they sum
to 1.  Now, define the margin of g (or h) on an example x of label y
to be y*g(x).  

If you look carefully at our analysis of adaboost, it ends up showing
not only that there can't be too many examples where the majority vote
is wrong (negative margin) but in fact there can't be too many where
the majority vote is barely right.  In fact if you keep running the
algorithm, the fraction of examples with margin < gamma will go to 0
(because those examples have too high a weight compared to the upper
bound of (1 - 2*gamma)^T.

What does this mean?  One thing about having margin gamma is that even
if T is huge, you can whp produce the exact same prediction that h
does by just viewing g as a probability distribution and sampling from
it k = (1/gamma^2)*log(1/epsilon) times, and then taking majority vote
over the sample.  So, as long as we have enough data so that these
"small majority votes" aren't overfitting, then h can't be overfitting
by much either.  Let's do this more formally.  Let h' be the
randomized prediction function that given a new example x predicts the
majority vote over k random draws from distribution g.  Then the true
error rates must satisfy err(h') >= err(h)/2 since if h(x) is
incorrect, this means that each random draw from g has at least a 50%
chance of giving the wrong answer on x, so h' has at least a 50%
chance of being incorrect.  But we argued that the empirical error of
h' is low [because if h has margin gamma on x then whp h'(x)=h(x)].
Furthermore we have enough data so that whp *all* small majority-vote
functions have empirical error close to their true error, which
implies the empirical error of h' is close to its true error.
Therefore the true error of h must be low too.