Lecture notes for 2/2/98

* PAC model
* MB ==> PAC
* Occam's razor
-------------------------

Last time we introduced the PAC model.
 - assuming data is coming from a fixed distribution D.  This will let
   us make statistical statements about relation between performance on
   a training set and performance on new unseen data.

 - Like in MB model, want to make guarantees like "if data is
   consistent with a concept in C then the alg does well"

 - Instead of a mistake-bound, we'll ask: how much data do we need to
   guarantee that with prob > 1-delta, the alg finds a hypothesis with
   error < epsilon?

 - Also, more general question: if alg is using hypothesis class H,
   then how much data do we need to see so that all hypotheses have
   observed error near to true error?  Or, maybe just that the hypothesis
   with *best* observed error should have nearly best true error, etc.

 - In general, try for algs with running time poly(1/eps,1/delta, n, size(c))

----------------------------------------------

PAC model should be easier than MB model since restricting examples to
be coming from a distribution.  Can make this formal:

Theorem: Say we have an algorithm for learning C in MB model with some
mistake-bound M.  Then, in PAC model, can get hyp of error < epsilon,
(with confidence 1-delta) given a training set of size O(M/epsilon *
log(M/delta)). 

Proof:   (think of Winnow algorithm)

   Assume MB alg is "conservative".  Run it on the data set.

   Look at sequence of hypotheses produced: h_1, h_2,...

   For each one, if consistent with following X examples, then stop.

   What's X?
   if mistake bound M is known:   1/epsilon * log(M/delta).  
	If hyp had true error greater than epsilon, then with prob 1-delta/M
	it would have been discovered to be bad.
   So, the chance that *any* of our hypotheses fools us is at most 1-delta.

   What if M is not known?  In this case, for the ith hypothesis,
	can use: 1/epsilon * log(2i^2/delta).
   Chance we're ever fooled is sum_i delta/2i^2 < delta.

   Or, as Kamal pointed out, can run MB alg over the data you have and
   then pick the hypothesis that lasted the longest.  This is nicer
   since it doesn't put "epsilon" and "delta" into the algorithm itself.

Turns out, can get a better bound of O(1/epsilon [M + log(1/delta)])
using the fact that *most* bad hypotheses will not take fully
1/epsilon*log(1/delta) time to be discovered.  Roughly, the problem
is: we're flipping a coin of bias epsilon, and want to know how long
we need to flip so that with prob 1-delta we've seen M heads.  To get
this analysis to work, you use one portion of data to generate the
hypotheses and then test them on a separate portion of data.  Maybe
will get back to later.

--------------------------

Last time: if you see 

	m = 1/epsilon[ln(|H|) + ln(1/delta)]

examples, then whp all hyps in H of true error >= epsilon have
observed error greater than zero (at least 1 mistake). 

So, if we can solve consistency problem for H efficiently, then we can
learn H in PAC model, so long as ln(|H|) is reasonable (e.g., OK for
conjunctions, not OK for DNF).

What about decision lists?  How many are there?

----------------------------------------------

What if we want to talk about hypothesis spaces that have more and
less complex hypotheses in them?

Def: A {\bf legal size function} on concepts is one such
that at most $2^s$ concepts have size $s$.  E.g., size == description
length in bits, using your favorite language.

Theorem: Given $m$ examples, with probability > 1-delta there
are no bad (error > epsilon) consistent hypotheses of size less
than s, so long as:
	m >  1/epsilon [s*ln(2) + ln(1/delta)]
or
	s < 1/ln(2) [epsilon*m - ln(1/delta)]

Proof: follows simply by noticing that the total number of hypotheses
of size less than $s$ is at most $2^s$, and plugging in our earlier
bound.

Interesting relation to philosophical notion of "Occam's razor".  

The following statement is attributed to William of Occam ( 1320 AD):
	``Entities should not be multiplied unnecessarily'' (in Latin)
which is generally interpreted to mean "don't make unnecessarily
complex explanations".

In our context, can see one justification:  simple consistent
explanations are trustworthy.  Why?  Because there aren't too many of
them.  On the other hand, there are lots of complex explanations so
it's easier to get one of them to fit the data just by chance.

-------------------------------------------------------

A couple tricky issues regarding the above simple theorems:

  Suppose I repeatedly perform this experiment:  For some favorite
hypothesis space H (or hypothesis representation) I pick 10,000
random examples from some database and observe the frequencies of the
following events: 

	a. There is no consistent simple hypothesis.

	b. There is a consistent simple hypothesis, and it has low
		true error.

	c. There is a consistent simple hypothesis with high true error.

  Question: is Prob(c) < Prob(b)?
    I.e., are simple hypotheses good if we condition on the event that
  simple consistent hypotheses exist?  Answer: no

  What does the theorem say?  Theorem says Prob(c) is small.  But,
	Prob(b) could be even smaller or zero.

---------------------------------------------------------

Here is an interesting extension of the above theorem.

Theorem: Suppose we have an algorithm that continues to request
examples until it finds a hypothesis such that (1) the hypothesis is
consistent with the data, and (2) the size s of the hypothesis satisfies

	s <= 1/(ln 2) * [epsilon * m - ln(2m^2/delta)]

where m is there number of examples it has seen so far.  Then, with
probability 1 - \delta, this algorithm will never output a
hypothesis with error > epsilon.


Proof: Consider a fixed value of m.  The chance of any small,
``bad'' hypotheses being consistent with the data so far is only \delta' =
\frac{\delta}{2m^2}, by our previous analysis.  Since

	sum delta/(2m^2)    <=  delta
         m

the probability that we'll ever output a bad hypothesis is less than delta.

=============================================================================