15-859(A) Machine Learning Theory 02/10/04
* Weak vs Strong learning. Boosting
========================================================================
Today: take a look at some basic questions about the PAC model. This
will lead us into our next topic: boosting.
We defined the PAC model like this: (prediction version: we assume the
target is in class C, but learner can output any poly-time evaluatable
hypothesis).
Def1: Alg A PAC-learns concept class C if for any c in C, any
distribution D, any epsilon, delta > 0, with probability 1-delta, A
produces a poly-time evaluatable hypothesis h with error at most
epsilon.
Goal is to do this with # examples and running time polynomial in
relevant parameters.
ISSUE #1: DO WE NEED TO SAY "FOR ALL DELTA"?
===========================================
What if we change the definition to say delta = 1/2. Or, even delta =
1 - 1/poly(n). I.e., there is *some* noticeable chance it succeeds.
Def2: same as Def1 but replace "any delta > 0" with "some delta <
1-1/poly(n)".
If C is learnable using Def2, is it also learnable using Def1? Yes.
Proof:
Say we're given alg A that works using Def2 with chance of
success 1/n^c. We'll run it n^c * ln(3/delta) times, with epsilon/4
as error parameter.
Pr(all failed) <= (1 - 1/n^c)^{n^c * ln(3/delta)}
<= e^{-ln(3/delta)}
= delta/3.
Now, draw a new sample of m points and test the hypotheses on it,
picking one whose observed error is < epsilon/2. Use Chernoff bounds:
Pr[sum > 2*(expectation)] <= e^{-(expectation)/3}
Pr[sum < (expectation)/2] <= e^{-(expectation)/8}
=>Pr(the good hyp looks bad) < e^{-m*epsilon/12}.
=>Pr(a given bad hyp looks good) < e^{-m*epsilon/8}
Want first quantity to be less than delta/3. Want second quantity to
be less than delta/(3*(# of hypotheses)). delta/(3n^c * ln(3/delta))
[this is messy, but log(1/...) = O(log n/delta)]
Solve: m = O((1/epsilon)*log(n/delta)).
So, if A ran in polynomial running time, then new algorithm does too.
Main blowup in 1st part: blowup by O(n^c * log(1/delta)).
So, delta not so important. Can fix to 1/2 or even 1 - 1/n^c and still
get same notion of "learnability".
[Another way to do this argument: run it once, then test on test set.
If good halt. If not, repeat. Then argue that whp don't have to
repeat too many times.]
ISSUE #2: DO WE NEED TO SAY "FOR ALL EPSILON"?
=============================================
Def 3: Let's say that alg A WEAK-LEARNS class C if for all c in C, all
distributions D, THERE EXISTS epsilon, delta > 1/poly(n) s.t. A
achieves error 1/2 - epsilon with prob at least delta.
I.e., with some probability you get noticeable correlation.
Question: supposed we defined PAC model this way, does this change the
notion of what is learnable and what is not?
Answer, it doesn't. Given an algorithm that satisfies Def3, can
"boost" it to an algorithm that satisfies Def1.
First of all, we can handle the delta by testing on a test set, and if
it failed, we repeat. So, in the following, we'll ignore delta and
assume that each time, we get a hypothesis of error <= 1/2 - epsilon.
Boosting: preliminaries
-----------------------
1. We're going to have to use the fact that algorithm A works (in
this weak way) over *every* distribution. If you look at the
"distribution specific" version of this question, the answer is NO.
E.g., say your target function was "x is positive if x1 = 1 or if x
satisfies some hard crypto property (e.g., viewed as an integer, x
is a quadratic residue mod some large N). Then, it's easy to get
error 25% over uniform random examples: you predict positive if
x1=1 and whatever you want otherwise. But you can't do any better.
The problem is that there is this hard core to the target function.
Boosting shows this situation is in a sense universal. If you have
some arbitrary learning algorithm that supposedly works well, we'll
view it as a black box and either boost up its accuracy as much as we
like, or else we'll find a distribution D where that algorithm gets
error > 1/2 - epsilon.
2. It will make things a little simpler to imagine we start by
drawing some large sample S, and then add an assumption that A is
choosing hypotheses from some class of limited VC-dimension d, so
that for an appropriate-size sample we can assume empirical error
is close to true error for A's hypotheses. That way we can just
talk about drawing one large sample, and doing various things with
the sample.
An easy case: one-sided error
-----------------------------
Just to get a feel of a simple case, imagine our weak-learning has the
property that given a sample S it either gets all the + points correct
and at least a 1/k fraction of the - points correct, or vice versa.
Example: imagine we're learning a convex region in 2-D. Natural
thing is see if we can find a line with all the + points on one
side, and some reasonable fraction of - points on the other. If
target is a polygon with k sides, then by trying all O(m^2) lines
defined by pairs of points in our set, we can get all + points right
and at least 1/k of - points right too.
In this case, can view weak hypotheses as either making a prediction
or saying "not sure". (E.g., in above example, hyps are predicting -
if point is on one side of line and saying "not sure" on the other).
Then we can do boosting by creating a decision list. We first run A
to find h_1, and put as top rule in the DL. Let S_2 be points in S
where h_1 said "not sure". Then run A again on S_2 and so on.
Finally, when the leftover sample is pure enough, we just make a
default prediction. In the above example, all the h_i would predict
negative when they make a prediction, so this is the same as ANDing
them all together.
Since each step crosses off at least a 1/k fraction of one of the
classes, we only need to run O(k * log(1/epsilon)) times to get error
down to epsilon.
Note: in practice, you'd stop when the new rules are no longer
statistically significant given the sample. Really, in practice you
don't know in advance if your weak learner will succeed or fail: you
just keep running on the new datasets produced until you get to a
state where your weak learner no longer has anything to contribute.
Basic idea here: focus on data where previous hyp had trouble. Force
next one to learn something *new*.
The general case: 2-sided error
===============================
Now let's go to the general case of boosting hypotheses that make
2-sided error. We'll first talk about Schapire's original method (the
one described in the textbook) and then in the next class we'll talk
about Adaboost (by Freund and Schapire) that does this more
efficiently and gives a more practical algorithm in the end. Both
have the same basic principle: feed in a series of distributions (or
weighted datasets) based on the hyps so far. Do in a way that forces
alg to give us some *new* info.
To reduce notation, let's fix some weak error rate. Assume that we
have an algorithm A that over any distribution will get at least 70%
accuracy with high probability. Say we want to boost it up a bit
(which we can then apply recursively).
We start by running A on S, and getting hypothesis h_1. Say its
accuracy is 70%. Now we want a new distribution.
One try: only look at examples in S where h_1 predicts incorrectly and
try to learn over that. This DOESN'T work. Why?
Instead: we want to create a distribution where h_1 behaves like random
guessing. Specifically, let S_C be the set of examples in S on
which h_1 predicts correctly and S_I be the set on which h_1
predicts incorrectly. What we want is a distribution D_2 where
S_C and S_I both have equal weight. We can do that by creating a
non-uniform distribution over S, where each example in S_C gets
probability 1/(2|S_C|) and each example in S_I gets probability
1/(2|S_I|).
Another way to think of this. Let D_1[x] be the original probability
on x (i.e., 1/|S|). Then to create D_2 we reduced the probability on
S_C from 70% to 50%, so if x in S_C then D_2[x] = (5/7)D_1[x]. We
increased the probability on S_I from 30% to 50% so if x in S_I then
D_2[x] = (5/3)D_1[x].
Now we get a hypothesis h_2 with accuracy 70% on D_2.
Finally, let's feed into A the examples in S where h_1(x) != h_2(x).
We then get hypothesis h_3 with accuracy 70% over this set.
Then we put the three together by taking majority vote. Or, can think
of it this way: if h_1 and h_2 agree, then go with that, else go with
h_3.
Analysis of original scheme
---------------------------
What's the accuracy of the new hypothesis?
Easiest way is to draw a picture. Divide S into 4 regions: R_1 in
which h_1 = c and h_2 = c, R_2 in which h_1 = c and h_2 != c, R_3 in
which h_1 != c and h_2 != c, and R_4 in which h_1 != c and h_2 = c.
We predict correctly on R_1, incorrectly on R_3 and get 70% correct on
R_2 U R_4. Let's say that D_2[R_2] = gamma. By the fact that h_2 has
30% error on D_2 we know that D_2[R_3] = 0.3 - gamma. By definition of
D_2 we know D_2[R_1] = 1/2 - gamma, and D_2[R_4] = 0.2 + gamma.
Working back to our original uniform distribution over S, we get:
D_1[R_1] = 7/5 * (1/2 - gamma)
and
D_1[R_3] = 3/5 * (3/10 - gamma).
We can now work out our error rate over S as:
Pr[fail] = D_1[R_3] + (3/10)(1 - D_1[R_1] - D_1[R_3])
= (7/10)D_1[R_3] + (3/10)(1 - D_1[R_1])
= (7/10)[3/5 * (3/10 - gamma)] + (3/10)[3/10 + (7/5)gamma]
= (3/10)^2[(7/5) + 1]
which is approx 0.216
More generally, if we work this through, we get:
error <- 3*error^2 - 2*error^3.
This is always a decrease assuming that our original error was
strictly between 0 and 1/2.
Note: if you had 3 hypotheses of error rate p with *independent*
errors, then the error rate of the majority vote would be p^3 +
3p^2(1-p) = 3p^2 - 2p^3.
This is exactly the bound we showed for Schapire's algorithm. So in
a sense, its really the best we could hope for 3 calls. And what the
reweighting is doing is in a sense forcing the hyps to be as useful as
if independent.
Or can look at it this way. Let bias = accuracy - error = 1 - 2*error.
Then new bias = (old bias) + 2(old bias)(old accuracy)(old error)