15-681 Machine Learning.  Lecture 9/5
o Recap: The PAC model, intervals.
o PAC and consistency, conjunctions
o decision lists
o Occam's razor
=========================================================================
Note:  web page has all handouts, plus lecture notes, slides, etc.
Note:  See Terri in Wean 4117 if want to reg or reg'd late. 

***The PAC model***

Assume seeing data from some probability distribution D over instance
space X.  Assume labeled by some target concept c in C.

Defn: Learner learns class C (by H) in PAC model if for any target
concept c in C, any distribution D, any epsilon,delta > 0:
	with probability > 1-delta it finds a hypothesis h in H
	with error < epsilon.

Allowed: Time, examples poly(n,size(c),1/epsilon,1/delta).


Example from last time: learning an initial subinterval.

C = { [0,a]: a > 0}
target concept is some specific interval.

        _________________        ____       _____
  _____/                 \______/    \     /     \
 /                                    \___/       \____
|-----------------|----------|-------------------------|
              a_epsilon      a

Algorithm:  produce hypothesis [0,b], where b is the largest positive
example seen.

Analyze in PAC model.

Clearly poly time per example.  Question: how many examples?

Idea:  define a_epsilon as the number less than *a* such that the interval
[a_epsilon, a] has probability epsilon.  Notice that we succeed
exactly when we have seen some example in this range.

Say we see M examples.

Want Prob(failure) < delta.

     Prob(seen no examples in [a_epsilon, a]) < delta.

     (1-epsilon)^M  < delta.  

	[Note: for a given example, it's in range with prob epsilon,
	so not in range with prob 1-epsilon.  Want the AND of M independent
	events, so multiply them to get probability.]

     Solve for M.  We want:

	M ln(1-epsilon) < ln(delta), or  -M ln(1-epsilon) > ln(1/delta)

	M > ln(1/delta) / [-ln(1-epsilon)]

	Use: -ln(1-x) = x + x^2/2 + x^3/3..., so suffices to have:

	M > (1/epsilon) ln(1/delta).
     This is polynomial in 1/epsilon and 1/delta.

If we see this many examples, w.h.p we will have seen one in the
desired range (even though we don't know what the range is) and
therefore we will have succeeded.

Let's step back a bit.

Notes: 
1. H=C is saying we want to learn anything we can represent.

   Could make H = poly-time programs.  (i.e., we don't care what
   representation the learner uses so long as it can predict well) then
   this is typically called ``PAC prediction''. 

2. Could make H smaller than C.  define epsilon_H = error of best
   hypothesis in H.  Then define "good" as error only epsilon_H + epsilon.
   E.g., set C = all functions.  Goal is to find best h in H without
   assumptions.   

   E.g: say we don't assume anything about our data in the 1-d example
   above.  We might still want to find the "approximately best"
   subinterval.  In fact you can do that by taking a bunch of data,
   looking at interval [0,b] for all positive examples b, and taking the
   one with least error.  But, it takes a more sophisticated analysis to
   prove that this works.

3. Model is robust.  Say we fix epsilon=1/4, delta = 1/4.  Get same
   notion of learnability.  


***Conjunctions***

Remember last time we were asking "why should consistency mean
anything about predicting well on new examples?"

Let's look at our FIND-S algorithm for finding the most specific
conjunction, and try to prove that it succeeds in the PAC model.  In
fact, will show that any algorithm that finds a consistent conjunction
(and has polynomial running time per example) works.  Why find the
most specific?  Because it's something we can do efficiently.  Finding
the most general (i.e., smallest) conjunction is NP-hard.

Here's a nice way to prove it.

1. Consider some specific ``bad'' conjunction whose error is at least
epsilon.  The probability that this bad conjunction is consistent
with M examples drawn from D is at most (1-epsilon)^M.

2. How many conjunctions are there over n variables: 3^n.

3. (1) and (2) imply that given M examples drawn from D, the
probability THERE EXISTS a bad conjunction consistent with all of them
is at most: 3^n(1-epsilon)^M. 

	Why?

	Prob(A or B) <= Prob(A) + Prob(B).

Let's think about this: Suppose that M is sufficiently large so that 

	3^n(1-epsilon)^M < delta.

That means that with probability (1-delta), all BAD conjunctions are
INCONSISTENT.  Therefore, our algorithm (which finds a consistent one)
succeeds.

Solving for M just like before:

	M ln(1-epsilon) + n ln(3) < ln(delta)

	M > (1/epsilon)[n ln(3) + ln(1/delta)]

We're polynomial in 1/epsilon, 1/delta, and n.


***PAC and CONSISTENCY***

Let's generalize from this one example....


THEOREM: Say we have an efficient (fast enough running time) algorithm
for learning class C in the consistency model.  Then we can learn C in
the PAC model, and we need at most

	(1/epsilon)[ ln|C| + ln(1/delta) ]
examples.


-> In PAC setting, consistency DOES imply predicting well on new data,
so long as class of concepts is restricted.  Gives an answer to why
the conjunction alg seemed more like it was learning than the DNF alg
that basically memorized the data. If you let C = DNF formulas, then
|C| = 2^2^n, so ln(C) is exponential in n.  This is saying you aren't
guaranteed success until you see (roughly) all the examples.


***Another example: Decision lists***

Example: If x_1 then positive else if x_3 then negative else if x_5 then
positive else negative.

How many decision lists are there?  Hard to count exactly but how
about an upper bound:

	2n (or 4n if allow negations) different rules.  So at most
	(4n)! different lists.  Sounds like a lot, but we get to take
	the log, so we're OK in PAC model.

How about finding a consistent one?

Is there a consistent decision list?

	1  0  0  1  1       +
	0  1  1  0  0       -
	1  1  1  0  0       +
	0  0  0  1  0       -
	1  1  0  1  1       +
	1  0  0  0  1       -

Algorithm: build from top down, putting down any consistent rule you
can that classifies at least 1 example.

Works because you never get stuck.  Look at examples you haven't
classified so far.  Each is classified by some rule in target
function.  Top one of these is a good legal rule to put down next.


***OCCAM's RAZOR***

William of Occam (approx 1320 AD) said:

	``Entities should not be multiplied unnecessarily'' (in Latin)

There are a bunch of ways you could interpret this but the simplest
one is: all things equal, prefer simpler explanations.

Let's try to justify this in the PAC model.

Let's write down all explanations from simplest on up:

	h_1, h_2, h_3, h_4,...

Let's define "size" or "complexity" of explanation h_i as number of
bits it takes me to tell it to you.  So, there are at most 2^s
hypotheses of size < s.  (Since there are at most 2^s different
strings of length < s).

Remember we were arguing that for a hypothesis class H, w.h.p. there are
no bad concepts consistent with the data so long as 

	|H|*(1-epsilon)^M < delta, or,  |H| < delta*e^{epsilon*M}

Let's apply to our setting.  What is the chance that there is a bad
consistent hypothesis of size < s?  Replace |H| by 2^s.  w.h.p. there
are no bad consistent hypotheses with size less than

	s < 1/ln(2) (epsilon*M - ln(1/delta))

What does this mean?

Can think of this as saying that simple (small size) hypotheses are
"safe" in the sense that it's unlikely one of them is bad and yet
looks consistent with the data.  And, the reason is that there aren't
that many simple hypotheses.  Therefore, Occam's razor is a good idea.

Really saying that ANY bias is fine (doesn't say that MY notion of
simpler is better than YOUR notion of simpler).  The only requirement
is that you specify the bias ahead of time.  Of course, there is no
guarantee that there EXISTS a consistent small hypothesis.  Our
statement could be satisfied by all small hypotheses being
inconsistent.  So, a good ordering is one in which you put the things
you think are more likely to be true first, to increase the chances
that there will in fact be a small consistent hypothesis.


**To end up**

Our bound 1/epsilon(ln(|C| + ln(1/delta)) is an UPPER BOUND.  It's a
guarantee, but you probably in practice will need many fewer examples.
In practice, you would train on some data, and then test on some new
data to see if in fact your error was small enough.  Where is
overcounting coming from:

1. assuming all bad hypotheses are sitting at just above epsilon
error.  In practice a lot of bad hypotheses are really bad and get
thrown out much sooner.

2. using prob(A or B) < prob(A) + prob(B).  If hypotheses are
correlated, then fact that one is inconsistent means the other is
probably inconsistent too.   E.g., there are an infinite number of
intervals [0,a] and yet we got a nice bound by a direct analysis.