15-859(A) Machine Learning Theory 01/27/04
* PAC model
* some basic results.
* Occam's razor
* relations between PAC and MB [if time]
========================================================================
One problem with the mistake-bound model is that you can't answer
questions like: "how many examples do you need to converge to a good
hypothesis?" That's because the MB model doesn't make any assumptions
about the order of examples. Of course, this is a strength of the
model too! To address this kind of question, we now go on to the PAC
model.
The PAC model
-------------
PAC model is a model for "batch" learning: you train your algorithm on
some fixed set of data, and then deploy it in the real world. To do
this, we'll need to have some relation between the training set and
the world.
The basic idea of the PAC model is to assume that examples are being
provided from a fixed (but perhaps unknown) distribution over the
instance space. The assumption of a fixed distribution gives us hope
that what we learn based on some training data will carry over to new
test data we haven't seen yet. Also, it provides us a well-defined
notion of the error of a hypothesis with respect to target concept.
Definition: Given a distribution D over examples, the *error* of a
hypothesis h with respect to a target concept c is Pr_{x in D}[h(x) != c(x)].
Suppose we are seeing examples labeled according to some target
concept in class C. What kind of guarantee could we hope to make?
Two issues:
1) can't hope to get target exactly since distribution might place
low weight on some part of instance space. So, instead
goal is to approximate the target.
2) can't necessarily guarantee to approximate target since we might be
unlucky and get a data sample that doesn't accurately reflect the
distribution.
So, our goal will be to get an approximation with high probability.
-> Probably Approximately Correct (PAC) model.
Setting: We have a button we can press: gives us an example from D,
labeled according to target function c. Goal is after not too many
button-presses (and not too much running time) to produce a hypothesis
that (whp) is close to the target.
Definition:
An algorithm A PAC-learns concept class C by hypothesis class H
if for any c in C, any distribution D over the instance space, any
epsilon, delta > 0, the algorithm with probability 1-delta produces a
hypothesis h in H that has error at most epsilon. Furthermore, we
would like the total number of examples and running time to be
poly(n, 1/epsilon, 1/delta, size(c)).
[size(c)= # bits it takes to describe c in whatever description
language we are using.]
epsilon is usually called the accuracy parameter and delta is called
the confidence parameter. A hypothesis with error at most epsilon is
often called ``epsilon-good.''
This definition allows us to make statements such as: ``the class of
k-term DNF formulas is learnable by the hypothesis class of k-CNF
formulas.''
Remark 1: If we require H = C, then this is sometimes called
``proper PAC learning''. If we allow H to be the class of
polynomial time programs (i.e., we don't care what representation the
learner uses so long as it can predict well) then this is typically
called ``PAC prediction''. I will usually say: ``concept class C is
PAC-learnable'' to mean that C is learnable in the PAC-prediction
sense.
Remark 2: One nice extension of this model is instead of requiring the
error of h be <= epsilon, to just require that the error be at most
1/2 - epsilon. This is called *weak-learning* and we will talk more
about this later.
Remark 3: Another nice extension is to the case where H is not
necessarily a superset of C. In this case, let epsilon_H be the least
possible error using hypotheses from H. Now, we relax the goal to
having the error of h be at most epsilon + epsilon_H. If we let C be
the set of all concepts (and we remove ``size(c)'' from the set of
parameters we are allowed to be polynomial in), then this is often
called the *agnostic* model of learning: we simply want to find the
(approximately) best h in H we can, without any prior assumptions on
the target concept. This is like what we got with Weighted-Majority
for H = {single variable concepts}. A lot harder to get this
guarantee efficiently for interesting H's.
A simple example
----------------
Let's show that the simple algorithm for learning Decision Lists in the
consistency model is a PAC learning algorithm for this class. There
are many ways of showing this fact. Here is one particularly nice way.
1. Consider some specific ``bad'' DL whose error is at least epsilon.
The probability that this bad DL is consistent with m examples drawn
from D is at most (1-epsilon)^m.
2. How many different DLs are there over {0,1}^n? Hard to count
exactly, but what's a reasonable upper bound? Remember that each rule
in the DL looks like:
"if then <+ or ->"
- How about (4n)!. Or can be a little tighter with n! * 4^n by
using the fact that you never need to have two rules with the same
variable and can always assume the DL ends with "else -".
3. (1) and (2) imply that given m examples drawn from D, the
probability there *exists* a bad DL consistent with all of them is at
most (4n)!(1-epsilon)^m. The "4n!" looks like trouble, but notice
that the other term drops exponentially with m. So, lets set this to
<= delta and solve for m. This will imply that with probability
1-delta there are *no* consistent DLs whose error is more than
epsilon. Since our algorithm finds a consistent DL whenever one
exists (and we are assuming one exists here), this means that with
probability at least 1-delta our algorithm finds a hypothesis with
error at most epsilon.
We want: (1-epsilon)^m <= delta/(4n)!
Now, (1-epsilon)^m <= e^{-epsilon * m}. That's because
(1-epsilon)^{1/epsilon} <= 1/e [e.g., (1/2)^2, (99/100)^100]
[equivalently, ln(1-epsilon) <= -epsilon]
so, it suffices to have:
(1/e)^{epsilon*m} <= delta/(4n)!, or equivalently,
e^{epsilon*m} >= (4n)! / delta.
Taking logs,
m >= (1/epsilon)[ln(4n!) + ln(1/delta)].
Notice that even though 4n! is a big number, ln(4n!) is just O(n log
n). Since this is polynomial in our parameters (and the algorithm runs
in poly time) we have a PAC-learning algorithm. qed
Relating Consistency and the PAC model
---------------------------------------
Let's go over analysis.
1. prob a bad hypothesis is consistent with m examples is at most
(1-epsilon)^m
2. So, prob exists a bad consistent hypothesis is at most
|H|(1-epsilon)^m
3. Set to delta, solve to get number examples needed at most
1/epsilon[ln(|H|) + ln(1/delta)]
This gives us a way of relating consistency with PAC model.
Theorem: Let A be an algorithm that learns class C by H in the consistency
model (i.e., it finds a consistent h in H whenever one exists).
Then A needs only (1/epsilon)(ln|H| + ln(1/delta)) examples to learn C
(by H) in the PAC model. Therefore, A is a PAC-learning algorithm so
long as ln|H| is polynomial in size(c), and n.
The above quantity is polynomial for conjunctions, k-CNF, and
k-DNF. It is not polynomial for general DNF. Explains why our
"memorize the positive examples seen" DNF algorithm didn't feel right,
but we felt more confident in the conjunctions algorithm.
=============================================================
Occam's razor
=============
Here's a nice way to look at this in terms of description languages.
Say you have some arbitrary description language for hypotheses --
the only requirement is that at most 2^s concepts can have size < s
(in bits).
Theorem: For any description language, given $m$ examples, with
probability > 1-delta there are no bad (error > epsilon) consistent
hypotheses of size less than s, so long as:
m > 1/epsilon [s*ln(2) + ln(1/delta)]
or
s < 1/ln(2) [epsilon*m - ln(1/delta)]
Proof: just replacing |H| with 2^s.
Interesting relation to philosophical notion of "Occam's razor".
The following statement is attributed to William of Occam ( 1320 AD):
``Entities should not be multiplied unnecessarily'' (in Latin)
which is generally interpreted to mean "don't make unnecessarily
complex explanations".
In our context, can see one justification: simple consistent
explanations are trustworthy. Why? Because there aren't too many of
them. On the other hand, there are lots of complex explanations so
it's easier to get one of them to fit the data just by chance.
Also, another nice thing about this theorem is it's very convenient
for proving PAC-learning. E.g., how many bits does it take to
describe a Decision List? Can do with O(n log n) bits by using O(log
n) bits per rule. So log|H| = O(n log n). The point here is that
thinking in terms of "how many bits would it take me to describe one?"
is often a convenient way of counting how many there are.
-------------------------------------------------------
One potentially subtle issue about the above theorem. We can fairly
interpret the theorem as saying "if m is sufficiently large, whp all
consistent simple hypotheses will be good" (where "simple" means size
< s). But, we can't say "whp all consistent simple hypotheses will be
good *conditioned* on the event that a simple hypothesis is consistent
with the data". Because maybe there is no good simple rule. E.g., a
bad experiment would be "get some data, see if it's consistent with a
short disjunction; if not then throw out the data and keep repeating
until by luck the data happens to be so unrepresentative as to have a
consistent short disjunction."
--------------------------------------------------------
Here is an interesting extension of the above theorem.
Theorem: Suppose we have an algorithm that continues to request
examples until it finds a hypothesis such that (1) the hypothesis is
consistent with the data, and (2) the size s of the hypothesis satisfies
s <= 1/(ln 2) * [epsilon * m - ln(m(m+1)/delta)]
where m is there number of examples it has seen so far. Then, with
probability 1 - \delta, this algorithm will never output a
hypothesis with error > epsilon.
Proof: Consider a fixed value of m. The chance of any small,
``bad'' hypotheses being consistent with the data so far is only
delta/(m(m+1)), by our previous analysis. Since
sum 1/m(m+1) = 1,
m
the probability that we'll ever output a bad hypothesis is less than delta.
---------------------------------------------------------
What's nice about this is it gives a justification for using Occam's
razor, even if we each have different notions of what's simpler than
what. No matter what our description language is, so long as we
follow the above guidelines for how much data is needed to justify a
hypothesis, whp we will never be fooled. Of course, we can still talk
of some description languages being more useful than others: a useful
one will have a good hypothesis that can be described with a small
number of bits.
=============================================================================
[probably won't get to this today]
Relating PAC and Mistake-Bound models:
PAC model should be easier than MB model since we are restricting
examples to be coming from a distribution. Can make this formal:
Theorem: Say we have an algorithm for learning C in MB model with some
mistake-bound M. Then we can PAC-learn using a training set of size
O(M/epsilon * log(M/delta)).
Proof: (think of Winnow algorithm)
Assume MB alg is "conservative". Run it on the data set.
Look at sequence of hypotheses produced: h_1, h_2,...
For each one, if consistent with following 1/epsilon * log(M/delta)
examples, then stop.
If h_i has error > epsilon, the chance we stopped was at most delta/M.
So there's at most a delta chance we are fooled by *any* of the hypotheses.
You can actually get a better bound of
O(M/epsilon + (1/epsilon)log(1/delta))
using the fact that *most* bad hypotheses will not take fully
1/epsilon*log(1/delta) time to be discovered. We'll get back to this
once we talk about Chernoff bounds.