15-859(B) Machine Learning Theory 02/03/14
* Recap of growth-function upper-bounds
* Basic definitional questions in the PAC model
* Weak vs Strong learning. Boosting
========================================================================
[Start with recap from last time -- see previous notes]
Today: take a look at some basic questions about the PAC model. This
will lead us into our next topic: boosting.
We defined the notion of PAC-learnability like this (prediction
version where not restricting form of hypothesis):
Defn1: Alg A PAC-learns concept class C if for any target in C, any
distribution D, any epsilon, delta > 0, with probability 1-delta, A
produces a poly-time evaluatable hypothesis h with error at most
epsilon.
Say C is PAC-learnable if exists A that does this with # examples and
running time polynomial in relevant parameters.
ISSUE #1: DO WE NEED TO SAY "FOR ALL DELTA"?
===========================================
On your homework assignment, you will show that if we change the definition
to say delta = 1/2, then this doesn't affect what is learnable. In
fact, could even change
"for any delta > 0, succeeds with probability at least 1-delta"
to
"for some delta' > 0, succeeds with probability at least delta'"
Basically, given an algorithm satisfying this second definition, we
could just run it N = (1/delta')log(2/delta) times on fresh data each
time, so that whp at least one was successful, and then test the
hypotheses on a new test set of size O((1/epsilon) log(N/delta)).
So, delta not so important. Can fix to 1/2 or even 1 - 1/n^c and still
get same notion of polynomial "learnability".
ISSUE #2: DO WE NEED TO SAY "FOR ALL EPSILON"?
=============================================
Def 2: Let's say that alg A WEAK-LEARNS class C if for all c in C, all
distributions D, THERE EXISTS gamma, delta' > 1/poly(n) s.t. A
achieves error at most 1/2 - gamma with prob at least delta'.
I.e., with some noticeable probability you get noticeable correlation.
[using "gamma" to denote the "gap"]
Question: suppose we defined PAC model this way, does this change the
notion of what is learnable and what is not?
Answer, it doesn't. Given an algorithm that satisfies Def2, can
"boost" it to an algorithm that satisfies Def1.
Note: we can handle delta as before. So, we'll ignore delta and
assume that each time, we get a hypothesis of error <= 1/2 - gamma.
Boosting: preliminaries
-----------------------
We're going to have to use the fact that algorithm A works (in this
weak way) over *every* distribution. If you look at the "distribution
specific" version of this question, the answer is NO. E.g., say your
target function was "x is positive if its first bit equals 1 or if x
satisfies some hard crypto property (e.g., viewed as an integer, x is
a quadratic residue mod some large N). Then, it's easy to get error
25% over uniform random examples: you predict positive if x1=1 and
whatever you want otherwise. But you can't do any better. The
problem is that there is this hard core to the target function.
Boosting shows this situation is in a sense universal. If you have
some arbitrary learning algorithm that supposedly works well, we'll
view it as a black box and either boost up its accuracy as much as we
like, or else we'll find a distribution D where that algorithm gets
error > 1/2 - gamma.
An easy case: algorithms that know when they don't know
-------------------------------------------------------
Just to get a feel of a simple case, imagine our weak-learning
algorithm produces a hypothesis that on any given example x either
makes a prediction or says "not sure". Say it achieves the standard
PAC guarantee on the prediction region (when it makes a prediction it
has error rate at most epsilon), and furthermore it says "not sure" at
most a 1-epsilon' fraction of the time (it's trivial to produce a
hypothesis that always says "not sure"). So, if we replaced "not
sure" with flipping a coin, the error rate would be at most
1/2(1-epsilon') + epsilon*epsilon' <= 1/2 - gamma for
gamma=epsilon'(1/2-epsilon). But we have more information since the
algorithm "knows when it doesn't know".
Here we can do boosting by creating a decision list. We first
run A on D to find h_1, and put as top rule in the DL. We run A on D
subject to the constraint that h_1 says "not sure" (i.e., when A asks
for an example, we repeatedly draw from D until we get one where h_1
says "not sure") and get h_2. We then run A on D conditioned on both
h_1 and h_2 outputting "not sure" and get h_3. Say our goal is to
achieve error 2*epsilon. Then we can stop when the probability mass
on the unknown region reaches epsilon. Since each run chops off an epsilon'
fraction of the distribution remaining, after N runs, the probability
mass left is at most (1-epsilon')^N which we can set to epsilon.
Putting this together, only need to run O((1/epsilon') log(1/epsilon))
times to get error down to 2*epsilon.
Basic idea here: focus on data where previous hyp had trouble. Force
next one to learn something *new*.
The general case: 2-sided error
===============================
Now let's go to the general case of boosting hypotheses that make
2-sided error. We'll first talk about Schapire's original method (the
one described in the textbook) and then in the next class we'll talk
about Adaboost (by Freund and Schapire) that does this more
efficiently and gives a much more practical algorithm in the end.
Both have the same basic principle: feed in a series of distributions
(or weighted datasets) based on the hyps so far. Do in a way that
forces alg to give us some *new* info.
To reduce notation, let's fix some weak error rate p. Assume that we
have an algorithm A that over any distribution will produce a hyp of
error at most p with high probability. In fact, let's for convenience
imagine that A always produces hypotheses of error rate *exactly* p
(this will just simplify notation...). Say we want to boost it up a
bit, which we can then apply recursively.
We start by running A on the original distribution D and getting a
hypothesis h_1 of error rate at most p. Now we want a new distribution.
One try: Filter D to only let through examples where h_1 predicts
incorrectly and try to learn over that. This DOESN'T work. Why?
Instead: we filter to create a distribution where h_1 behaves like random
guessing. Specifically, let S_C be the set of examples on
which h_1 predicts correctly and S_I be the set on which h_1
predicts incorrectly. What we want is a distribution D_2 where
S_C and S_I both have equal weight, but besides that is just like
D in that Pr_{D_2}(x | x in S_C) = Pr_{D}(x | x in S_C)
and the same for S_I. For example, we can construct D_2 by
flipping a coin and if heads we wait until we see an
example where h_1 is correct, and if tails we wait until we
see one where h_1 is incorrect. Or we can draw examples from D
and place them into two queues, one for S_C and one for S_I, and
then to get an example from D_2 we randomly choose a queue to select
from.
Let's write D_2[x] as a function of D_1[x] (this is the probability
associated with x, or the density function if we're in a continuous
space). To create D_2 we reduced the probability on S_C from 1-p to
0.5, so if x in S_C then D_2[x] = (0.5/(1-p))D[x]. We increased the
probability on S_I from p to 0.5 so if x in S_I then D_2[x] =
(0.5/p)D[x].
Now we get a hypothesis h_2 with error p on D_2.
Finally, we filter D only allowing examples x where h_1(x) != h_2(x).
We then get hypothesis h_3 with error p on this distribution D_3.
Then we put the three together by taking majority vote. Or, can think
of it this way: if h_1(x)=h_2(x), then go with that, else go with h_3.
What is the best we could reasonably hope for? Suppose h_1, h_2, h_3
were all independent predictors of error rate p. Then the error rate
of the majority vote would be p^3 + 3p^2(1-p) = 3p^2 - 2p^3. E.g., if
p=1/4 then this drops to 5/32. We will show the above scheme in fact
meets this bound exactly. Then we can apply recursively to drive the
error down even further.
Analysis of this scheme
---------------------------
What's the accuracy of the new hypothesis?
Easiest way is to draw a picture. Divide space into 4 regions: S_CC where
both h1 and h2 are correct, S_CI where h1 is correct and h2 is
incorrect, S_IC which is the reverse, and S_II where both are
incorrect. Let's use p_CC, p_CI, p_IC, p_II as their respective
probabilities under D.
We predict correctly on S_CC, incorrectly on S_II and get 1-p fraction
correct on S_CI U S_IC. So, our error rate is p(p_CI + p_IC) + p_II.
Let's define alpha = D_2[S_CI].
So, p_CI = 2(1-p)alpha.
By defn of h2, D_2[S_II] = p-alpha.
So, p_II = 2p(p-alpha).
By construction of D_2, D_2[S_IC] = 1/2 - (p-alpha).
So, p_IC = 2p(1/2 - (p-alpha)).
Putting this all together, our error rate is:
p(2(1-p)alpha + 2p(1/2 - p + alpha))) + 2p(p-alpha)
= 3p^2 - 2p^3. QED
So, what the reweighting is doing is in a sense forcing the hyps to be
as useful as if they were independent.
==============================================
If we rewrite p as p = 1/2 - gamma (so gamma is the "edge" over random
guessing) we get a new error of 1/2 - gamma[3/2 - 2*gamma^2]. (so you
can see the new edge is bigger)
Now, if we want to drive the error down to epsilon we can do this
recursively, creating a ternary tree. Let's look at how large a tree
this requires. Will just do a rough calculation since will soon look
at a more efficient method next time. First, for gamma < 1/4,
plugging into the above we get that at each iteration new-gamma >=
old-gamma*(1 + 3/8). So, this means that to reach error rate 1/4 we
just need O(log 1/gamma) levels of recursion.
Now, once our error rate p is < 1/4, we can use the fact that after each
iteration, we get error rate at most 3p^2. error rate after k
iterations at most 3^{2^k-1}*(1/4)^{2^k} < (3/4)^{2^k}. So, to get
error rate down to epsilon only need O(loglog(1/epsilon)) more iterations.
So, the depth of the tree is O(log(1/gamma) + loglog(1/epsilon)).
This means the *size* of the tree (3^depth) is poly(1/gamma, log(1/epsilon)).
Lastly, filtering blows up sample size by additional O(1/epsilon) factor.