15-859(B) Machine Learning Theory 03/01/06
* Boosting, contd: Adaboost.
========================================================================
Basic setup:
Suppose we have an algorithm A that claims on every distribution to
be able to get error at most p = 1/2 - gamma. (let's ignore the
delta parameter). How can we use this as a black box to get error
epsilon?
Or, as a practical matter: say we have a learning algorithm that's
good at finding rules-of-thumb. How can we use it to produce a
really good classifier?
Last time we looked at one method for reducing error (Rob Schapire's
original approach). This involved calling A 3 times on specially
reweighted distributions and taking majority vote. If hypotheses
produced by A have error p, then the result has error 3p^2 - 2p^3.
One thing we didn't finish last time: Assuming alg A produced
hypotheses of error p = 1/2 - gamma, how deep of a tree do we need to
get error rate down to epsilon? We can split this into two parts.
First, what depth is needed to get to error rate 1/4? If we define
bias = accuracy-error, we can rewrite our formula as:
new bias = old bias + 2(old bias)(old accuracy)(old error).
For gamma < 1/4, we get: new bias >= old bias(1 + 3/8).
Initial bias is 2*gamma. So only need depth O(log 1/gamma).
Now, once our error rate p is < 1/4, we can use the fact that after each
iteration, we get error rate at most 3p^2. error rate after k
iterations at most (3p)^{2^k}. So, only need O(loglog(1/epsilon))
iterations.
So, the *size* of the tree (which is 3^depth) is poly(1/gamma, log(1/epsilon)).
==============================================
Adaboost idea: Let's just keep repeating step 2 of the previous
scheme, each time reweighting the examples so that the previous
hypothesis gets error exactly 1/2. Then take a big vote of all the
hypotheses, weighted in some way by how good they were on the
distributions they came from. In fact, this will look a lot like the
randomized weighted majority algorithm, except the examples are
playing the role of the "experts".
What could we hope for? Suppose we want to get error epsilon, and A
always gave us hypotheses of error 1/2 - gamma where the errors were
*independent*. How many calls would we need so that the majority vote
had error epsilon? I.e., if we pick a random example, we want at most an
epsilon probability of the vote being wrong. This is like flipping a
coin of bias 1/2 - gamma and we want at most an epsilon chance of more
heads than tails. Hoeffding bounds say that want want the number of
flips T to satisfy e^{-2*T*gamma^2} = epsilon, so the number of flips
needed is O((1/gamma)^2 log(1/epsilon)).
We are going to be able to achieve this, even though A is not
necessarily so nice. In fact, as a byproduct, we'll see that this
argument gives us a *proof* of Hoeffding bounds.
Adaboost
--------
First draw large sample that will be sufficient by Occam/VC arguments.
Use Weighted-majority, but in peculiar way. Experts will correspond
to examples. Weighting of experts will be the current distribution.
"trials" will correspond to the weak hypotheses.
Specifically, the algorithm is:
0. Draw n examples.
1. Start with weight w_i = 1 on each example x_i.
The associated prob dist D has P(x_i) = w_i/W, where W = sum(w_i).
2. For t=1 to T do:
Feed D to weak algorithm, get back hypothesis h_t.
Let beta_t = (error rate)/(1 - error rate) of h_t on D
Now update D: Multiply w_i by beta_t if h_t(x_i) is correct.
(notice: h_t now has 50% error on the new distribution)
3. Final hypothesis: say "positive" if sum_t[ alpha_t * h_t(x) ] > 1/2
where alpha_t = ln(1/beta_t) / sum_{t'}[ ln(1/beta_{t'}) ]
Claim: training error(h_final) < e^{-2*sum_t[ (gamma_t)^2 ]}
where 1/2 - gamma_t = error rate of h_t on D_t.
So, if each time we have error at most 1/2 - gamma, we run it T times,
we get e^{-2T gamma^2}. If our goal is to achieve error epsilon, then
we need to run this O((1/gamma)^2 * log(1/epsilon)) times.
(Notice similarity to Hoeffding formula)
ANALYSIS
========
To simplify things, let's assume all gamma_t's are the same. Call
it gamma. Then beta is fixed, and the final hypothesis is just a
straight majority vote over the h_t's.
Some notation:
error = 1/2 - gamma (error of each weak hypothesis)
accuracy = 1 - error = 1/2 + gamma
beta = error/accuracy
T = number of steps we run this procedure
Let's now analyze as in randomized WM, by putting upper and lower bounds
on the final weight. For the upper bound, we use the fact that the
weight starts at "n", and then on each step, we multiply an "accuracy"
fraction of the weight by beta. So, we have (the inequality is really
an equality if all weak hyps have error exactly 1/2 - gamma):
W_final <= n*(1 - (1-beta)*accuracy)^T.
Just simplifying this a little,
1 - (1-beta)*accuracy = 1 - accuracy + error = 2*error.
So we have:
W_final <= n*(2*error)^T
Now, let's look at the lower bound. To do the lower bound, notice
that any example on which our final hypothesis makes a mistake must
have weight at least beta^{T/2}, since if our majority vote is wrong
then we could have multiplied it by beta *at most* T/2 times. So, if
in the end our error rate is epsilon (so we make a mistake on epsilon*n
points), then
W_final >= beta^{T/2} * epsilon * n
Combining our upper and lower bounds gives us:
epsilon <= [(2*error)^2 / beta]^{T/2}
= [4*accuracy*error]^{T/2}
= [1 - 4*gamma^2]^{T/2}.
Now we use our favorite inequality: 1-x <= e^{-x}
epsilon <= e^{-2T*gamma^2}. QED
=======================
So, just to summarize, if we had T hypotheses of *independent* error
1/2 - gamma, then Hoeffding bounds give us this bound on the error
rate of their majority vote. What we are able to do with Adaboost is
achieve this error with an arbitrary weak learning algorithm, that
just guarantees that to give error at most 1/2 - gamma on whatever
distribution it's given.
What about true error versus empirical error? What we really have now
is a bound on empirical error. Can then use the fact that we only
called A a small number of times to argue how many samples we need so
that this gives us good bounds on true error also. Actually, you can
also get a bound via a margin analysis.
Proof of (special case of) Hoeffding inequality
================================================
Our argument actually gives a proof that if you flip T coins of bias p then
Pr(S > p + gamma) <= e^{-2*T*gamma^2}
for p = 1/2 - gamma.
Here's why. If we think formally about this experiment, there are 2^T
"elementary events" in the sample space, each corresponding to one of
the possible T-tuples of outcomes, and each having its own
probability. E.g., the one corresponding to all heads gets
probability p^T. Let's view them as the "examples" or "experts", and
view the probabilities as weights. So initially the total weight is 1.
If we take all those elementary events corresponding to the first coin
coming up tails and multiply their weights by beta, this doesn't
affect the bias of the 2nd coin because of independence. I.e., if you
add up the weights of the elementary events corresponding to the
second coin being heads, it will be a p fraction of the weight.
[e.g., draw out example with T=3]. Same with 3rd etc.
So, we can just use the same argument as above. We have:
W_final <= 1*(2p)^T
and also for each elementary event with more heads than tails, it got
multiplied by beta at most T/2 times, so its final weight is at most
beta^{T/2} times its initial weight. The total initial weight of
these events is Pr(S > p + gamma). The total final weight of these
events is at least beta^{T/2} times this. So,
W_final >= beta^{T/2} * Pr(S > p + gamma).
Solving as before, we get:
Pr(S > p + gamma) < (4p^2 / beta)^{T/2}
= (4p(1-p))^{T/2}
= (1 - 4*gamma^2)^{T/2}
< e^{-2T*gamma^2}
It's interesting to try to solve using general values of p. Not sure
if can get the exact constants right with this analysis but you get
the right form of the bounds.