Notes for 15-681 Machine Learning, 10/24/96
Today's topic: Boosting
- theory
- an application
=============================================================================
THE IDEA:
In the PAC model, we proved PAC guarantees for learning simple things:
conjunctions, decision lists, rectangles, intervals,
linear separators (we didn't actually prove this one...)
In practice, we often want more expressive hypotheses, even though we
no longer can prove PAC-style convergence guarantees:
- decision trees (ID3)
- Neural Nets (Backprop)
(It's not just that we can't prove it -- in fact the algorithms
don't guarantee PAC convergence. E.g., can you think of a target
concept representable as a small decision tree, and a distribution on
examples, where ID3 will fail (in the sense that it needs
exponentially many examples to learn)?
Today: an idea to bridge the gap: BOOSTING.
+----------+
Distrib D on examples-->| alg A |---> hypothesis
+----------+
| |
| |
| V
| +----------+
+--------->| Boosting |
+----------+
| |
+---------+ +------------+
| OR |
V V
Bad distrib D' for A. Hypothesis with 99% accuracy over D
(A gets < 51% accuracy)
In other words, given any algorithm at all, can use it to either
achieve a PAC kind of guarantee for your problem, or else get a
witness that the algorithm doesn't PAC learn.
Note: We will need to run A several times on independent sets of
examples, so in practice it's only useful when we have lots of data.
Corollary: Suppose there exists an algorithm A that PAC learns
concept class C, but only for epsilon=1/3 (i.e., for all c in C, all
distribs D, all delta>0, A achieves error < 1/3 in polynomial time and
samples). THEN, C is PAC learnable.
Proof: *ask*
In other words, the PAC model is robust: if we fix a specific epsilon
< 1/2 instead of saying "for all epsilon", then this doesn't change
the set of learnable things.
==========================================================================
How can we do this? Let's do an easy case first:
Suppose algorithm A has only one-sided error. Given a set of data, it
tries to find a hypothesis that gets all of the positives right and at
least 10% of the negatives right too. E.g., suppose the target concept
is a circle, with positives inside and negatives outside, and A looks
for a line that has all the positives on one side and at least 10% of
the negatives on the other.
How to boost?
Run A on the data. It finds a line L_1. Throw out the negatives
"outside" L_1 and then repeat on the data remaining. If A continues
to get at least 10% of the negatives remaining right, then after S
lines, the intersection of our positive regions gets all the positive
points right and gets at most (0.9)^S of the negative points wrong.
So, to get error epsilon, need to run O(log 1/epsilon) times.
(If A ever fails, then we've found a dataset that's hard for A.)
Are we overfitting? Not too bad since number of degrees of freedom is
proportional to S (VC dimension too), which is only logarithmic in
1/epsilon. So, amount of data needed goes up but not by too much.
==========================================================================
General case:
Suppose algorithm A claims to get at least 70% accuracy over every
probability distribution. Goal is to boost it up a bit (which we can
then apply recursively).
We start by running A on initial distribution D and getting hypothesis
h_1. Say its accuracy is 70%.
***Now we want a new distribution that forces A to give us more information.***
Now, filter D to create a distribution where h_1 behaves like random
guessing. Specifically, let CORRECT be the set of examples on which
h_1 predicts correctly and INCORRECT be the set on which h_1
predicts incorrectly. What we want is a distribution D_2 where
CORRECT and INCORRECT both have equal weight, but besides that is
just like D.
Notation: D[x] is probability of x under distribution D (or
the density of x if we're in a continuous setting). D[S] for a
set of examples S is the sum (or integral) of D over x in S.
Right now, D[CORRECT] = 0.7, D[INCORRECT] = 0.3. So, we want D_2
defined like this:
D_2[x] = 5/7 * D[x], for x in CORRECT
D_2[x] = 5/3 * D[x], for x in INCORRECT
But, all we have is access to D. How can we use D to create D_2?
One way: for each example desired, we flip a coin. If heads, we wait
until we see an example from D where h_1 is correct, and if tails we
wait until we see one where h_1 is incorrect.
Another way: We draw examples from D_1 and place them into two queues,
one for CORRECT and one for INCORRECT, and then to get an example from
D_2 we randomly choose a queue to select from.
Now we run A to get a hypothesis h_2 with accuracy 70% on D_2.
Finally, we filter D only allowing examples x where h_1(x) != h_2(x).
We then get hypothesis h_3 with accuracy 70% on this distribution
D_3.
RESULTING HYPOTHESIS: if h_1 and h_2 agree, then go with that,
else go with h_3. I.e., majority vote.
========================================================================
Analysis: What's the accuracy of the new hypothesis?
Easiest way is to draw a picture. Divide space into 4 regions:
R_1: h_1 is correct and h_2 is correct
R_2: h_1 is correct and h_2 is incorrect
R_3: h_1 is incorrect and h_2 is correct
R_4: h_1 is incorrect and h_2 is incorrect
We predict correctly on R_1, incorrectly on R_4 and get 70% correct on
R_2 and R_3. So,
accuracy = D[R_1] + 0.7*D[R_2] + 0.7*D[R_3]
We know that:
(1) D[R_1] + D[R_2] = 0.7
(2) D_2[R_1] + D_2[R_3] = 0.7
combining (2) with our definition of D_2 gives us:
(3) 5/7 * D[R_1] + 5/3 * D[R_3] = 0.7
now we just rewrite our equation for accuracy like this:
accuracy = 3/10*D[R_1] + 7/10*D[R_3] + 7/10*D[R_1] + 7/10*D[R_2]
The first two terms are the same proportions as eqn (3) and the second
two terms let us use eqn (2). So, we get
accuracy = 3/5 * (7/10)^2 + (7/10)^2 = 0.784
In general, if you work it out, you get:
new accuracy = 3*(old accuracy)^2 - 2*(old accuracy)^3
(think of 3/5 as 2*(1 - old accuracy))
This is always a decrease assuming that our original accuracy was
strictly between 1/2 and 1.
Note: in getting D_2, we might have tried: filter D to only look at
examples where h_1 predicts incorrectly and try to learn over that.
This DOESN'T work. Why?
Details: If our hypotheses h_i did better than 70% accuracy, then it's not
hard to see that this method does even better. As we get accuracies close
to 1, then it gets harder to simulate the distribution D_2, but this extra
time is linear in 1/epsilon so it's OK.
==========================================================================
Application to recognizing handwritten digits: see slides.
==========================================================================
Refs:
Robert Schapire, "The strength of weak learnability", Machine
Learning, 5(2):197-227, 1990.
Yoav Freund, "Boosting a weak learning algorithm by majority",
Information and Computation, 121(2):256-285, 1995.
Yoav Freund and Robert Schapire, "Experiments with a new boosting
algorithm", ICML '96.