15-859(B) Machine Learning Theory 02/13/07
* Basic definitional questions in the PAC model
* Weak vs Strong learning. Boosting
========================================================================
Today: take a look at some basic questions about the PAC model. This
will lead us into our next topic: boosting.
We defined the notion of PAC-learnability like this (prediction
version where not restricting form of hypothesis):
Defn1: Alg A PAC-learns concept class C if for any c in C, any
distribution D, any epsilon, delta > 0, with probability 1-delta, A
produces a poly-time evaluatable hypothesis h with error at most
epsilon.
Say C is PAC-learnable if exists A that does this with # examples and
running time polynomial in relevant parameters.
ISSUE #1: DO WE NEED TO SAY "FOR ALL DELTA"?
===========================================
What if we change the definition to say delta = 1/2. Or, even change
"for any delta > 0, succeeds with probability at least 1-delta"
to
"for some delta' > 0, succeeds with probability at least delta'"
where we will just require delta' >= 1/poly(n). I.e., there is *some*
noticeable chance it succeeds. Let's call this "Defn2".
If C is learnable using Defn2, is it also learnable using Defn1? Yes.
Proof:
Say we're given alg A that works using Def2 with success probability
delta'. Idea: just run it and test the hypothesis h produced on a
test set: if h is good then halt and if not, repeat. How many times
do we need to repeat so that with probability at least 1-delta/2, we
had at least one success?
Answer: N = (1/delta')log(2/delta) is sufficient.
Now, let's look at how much test data we need to verify the
hypotheses. Suppose when we run A, we use error parameter epsilon/2.
So, we just need to distinguish "<= epsilon/2" versus "> epsilon".
So, it boils down in the worst case to needing to estimate N
hypotheses of error rate Theta(epsilon) to +/- 25%, with at most a
delta/2 chance of failure. Plugging in Chernoff, we find that
O((1/epsilon) log(N/delta)) examples are sufficient.
So, delta not so important. Can fix to 1/2 or even 1 - 1/n^c and still
get same notion of "learnability".
ISSUE #2: DO WE NEED TO SAY "FOR ALL EPSILON"?
=============================================
Def 3: Let's say that alg A WEAK-LEARNS class C if for all c in C, all
distributions D, THERE EXISTS epsilon, delta > 1/poly(n) s.t. A
achieves error at most 1/2 - epsilon with prob at least delta.
I.e., with some noticeable probability you get noticeable correlation.
Question: supposed we defined PAC model this way, does this change the
notion of what is learnable and what is not?
Answer, it doesn't. Given an algorithm that satisfies Def3, can
"boost" it to an algorithm that satisfies Def1.
Note: we can handle delta as before. So, we'll ignore delta and
assume that each time, we get a hypothesis of error <= 1/2 - epsilon.
Boosting: preliminaries
-----------------------
1. We're going to have to use the fact that algorithm A works (in
this weak way) over *every* distribution. If you look at the
"distribution specific" version of this question, the answer is NO.
E.g., say your target function was "x is positive if its first bit
equals 1 or if x satisfies some hard crypto property (e.g., viewed
as an integer, x is a quadratic residue mod some large N). Then,
it's easy to get error 25% over uniform random examples: you
predict positive if x1=1 and whatever you want otherwise. But you
can't do any better. The problem is that there is this hard core
to the target function.
Boosting shows this situation is in a sense universal. If you have
some arbitrary learning algorithm that supposedly works well, we'll
view it as a black box and either boost up its accuracy as much as we
like, or else we'll find a distribution D where that algorithm gets
error > 1/2 - epsilon.
2. It will make things a little simpler to imagine we start by
drawing some large sample S, and then do everything over this
sample. Let's assume S is big enough so that we have uniform
convergence over the hypothesis class that A is using so we can
roughly equate performance on S with performance on D. We'll actually
need it to be a bit bigger because we will be considering majority
votes over functions returned by A.
An easy case: one-sided error
-----------------------------
Just to get a feel of a simple case, imagine our weak-learning
algorithm has the property that given a sample S it either gets all
the + points correct and at least an epsilon fraction of the - points
correct, or vice versa. In this case we can view weak hypotheses as
either making a prediction or saying "not sure", and we are asking
that they say "not sure" not too often
Example: imagine our data lies in 2-D. Natural thing is to see if
we can find a line with one side pure (all + or all -). Let's say
our learning problem is such that we can always fine a line with at
least an epsilon fraction of remaining points from one of the
classes on the pure side.
Then we can do boosting by creating a decision list. We first run A
to find h_1, and put as top rule in the DL. Let S_2 be points in S
where h_1 said "not sure". Then run A again on S_2 and so on.
Finally, when the leftover sample is pure enough, we just make a
default prediction.
Since each step crosses off at least an epsilon fraction of one of the
classes, we only need to run O((1/epsilon) log(1/epsilon')) times to
get error down to epsilon'.
Note: in practice, you'd stop when the new rules are no longer
statistically significant given the sample. Really, in practice you
don't know in advance if your weak learner will succeed or fail: you
just keep running on the new datasets produced until you get to a
state where your weak learner no longer has anything to contribute.
Basic idea here: focus on data where previous hyp had trouble. Force
next one to learn something *new*.
The general case: 2-sided error
===============================
Now let's go to the general case of boosting hypotheses that make
2-sided error. We'll first talk about Schapire's original method (the
one described in the textbook) and then in the next class we'll talk
about Adaboost (by Freund and Schapire) that does this more
efficiently and gives a more practical algorithm in the end. Both
have the same basic principle: feed in a series of distributions (or
weighted datasets) based on the hyps so far. Do in a way that forces
alg to give us some *new* info.
To reduce notation, let's fix some weak error rate p. Assume that we
have an algorithm A that over any distribution will produce a hyp of
error at most p with high probability. Say we want to boost it up a bit
(which we can then apply recursively).
We start by running A on the uniform distribution over S, and getting
hypothesis h_1 of error rate at most p. Now we want a new distribution.
One try: only look at examples in S where h_1 predicts incorrectly and
try to learn over that. This DOESN'T work. Why?
Instead: we want to create a distribution where h_1 behaves like random
guessing. Specifically, let S_C be the set of examples in S on
which h_1 predicts correctly and S_I be the set on which h_1
predicts incorrectly. What we want is a distribution D_2 where
S_C and S_I both have equal weight. We can do that by creating a
non-uniform distribution over S, where each example in S_C gets
probability 1/(2|S_C|) and each example in S_I gets probability
1/(2|S_I|).
Let's assume error rate of h was exactly p for convenience. So, |S_C|
= (1-p)|S| and |S_I| = p|S|. Let D_1[x] be the original probability
on x (i.e., 1/|S|). Then if x in S_C, D_2[x] = 1/(2|S_C|) = D_1[x]/(2(1-p)).
If x in S_I then D_2[x] = D_1[x]/(2p).
Now we get a hypothesis h_2 with accuracy 1-p on D_2.
Finally, let's feed into A the examples in S where h_1(x) != h_2(x).
(i.e., the uniform distrib over those). We then get hypothesis h_3
with accuracy 1-p over this set.
Then we put the three together by taking majority vote. Or, can think
of it this way: if h_1(x)=h_2(x), then go with that, else go with h_3.
What is the best we could reasonably hope for? Suppose h_1, h_2, h_3
were all independent predictors of error rate p. Then the error rate
of the majority vote would be p^3 + 3p^2(1-p) = 3p^2 - 2p^3. E.g., if
p=1/4 then this drops to 5/32. We will show the above scheme in fact
meets this bound. Then can apply recursively to drive error down even
further.
Analysis of this scheme
---------------------------
What's the accuracy of the new hypothesis?
Easiest way is to draw a picture. Divide S into 4 regions: S_CC where
both h1 and h2 are correct, S_CI where h1 is correct and h2 is
incorrect, S_IC which is the reverse, and S_II where both are
incorrect. Let's use p_CC, p_CI, p_IC, p_II as their respective
probabilities.
We predict correctly on S_CC, incorrectly on S_II and get 1-p fraction
correct on S_CI U S_IC. So, our error rate is p(p_CI + p_IC) + p_II.
Let's define gamma = D_2[S_CI].
So, p_CI = 2(1-p)gamma.
By defn of h2, D_2[S_II] = p-gamma.
So, p_II = 2p(p-gamma).
By construction of D_2, D_2[S_IC] = 1/2 - (p-gamma).
So, p_IC = 2p(1/2 - (p-gamma)).
Putting this all together, our error rate is:
p(2(1-p)gamma + 2p(1/2 - p + gamma))) + 2p(p-gamma)
= 3p^2 - 2p^3. QED
So, what the reweighting is doing is in a sense forcing the hyps to be
as useful as if independent.
Or can look at it this way. Let bias = accuracy - error = 1 - 2*error.
Then new bias = (old bias) + 2(old bias)(old accuracy)(old error)