Notes for 2/4/98

[current material is in the book.]

Last time:
  "Occam's razor theorem":  simple hypotheses are trustworthy since
not too many of them. Another way to think about it: if you can find a
simple enough hypothesis, then you can be confident with less data.

One nice way of stating this is given as Theorem 2.1 in the book:

Theorem: Suppose alg A, given m examples labeled according to c in C
finds a consistent hypothesis h in H such that
	size(h) < (n*size(c))^alpha * m^beta
where beta < 1.  Then A PAC learns C by H.

Proof is just that since RHS grows less than linear with m, it won't
be too long until size(h) is sufficiently smaller than m to apply our
theorem.

--> compression implies learning

Example:  learning a monotone disjunction with few relevant variables.
Goal: use fact that target has few relevants to get away with less data.

Standard alg, get m = 1/epsilon[log(2^n) + log(1/delta)], or about n/epsilon

We saw one way to take advantage of having few relevant variables using
Winnow, which had linear threshold hypotheses.  Analyzed in MB model.
Can convert to PAC using conversion described last time.

Another way: "Greedy set cover" algorithm.  Given a sample of m data
points, throw out all inconsistent variables.  But, instead of just
returning the OR of all of them, repeatedly do this:

  - pick the variable that covers the *most* positive examples, and
	put into h.

  - throw out the (positive) examples covered.

Idea: if we're lucky, we'll have found a consistent OR function with
not too many variables, so we don't need as much data.

Theorem: if the data is consistent with a disjunction of r variables,
then we will find a disjunction of at most r*log(m) variables.

Proof: each time we throw out at least a 1/r fraction of the positive
examples remaining.  So, after t steps, the number remaining is at
most m(1-1/r)^t.  This is less than 1 for 

	t = r*log(m).

To get a PAC guarantee, just need all bad "small" disjunctions to be
inconsistent.  One way to say this: let's say we describe a disjunction by
listing the variables in it.  Takes log(n) bits to describe a
variable.  So, size(h) = (# vars in h)*log(n).  
So, to learn in PAC model, suffices to see m examples for 

	m > 1/epsilon [r*log(m)*log(n)*ln(2) + ln(1/delta)]

This is a mess to calculate exactly but ignoring delta and loglog
terms this comes out to  

	O(1/epsilon [r*log(n)*log(r/epsilon)])


Is having too *many* choices of hypothesis necessarily bad?
----------------------------------------------------------

One way to look at our argument:  if H has N hypotheses of error
epsilon, then
	E[ # consistent ] = |H|(1-eps)^m. 
Now we note that if expected number of events is small (say 0.1) then  
Prob(at least 1 event occurs) must be small too.

Can we say that if expectation is large, then prob(at least 1) is large too?

Answer: it depends.  This is often not true.  Could be tiny
chance of (at least 1) but when it occurs you get a lot.  E.g, if you buy
100 lottery tickets then expected[#dollars won] > 1 but prob[>0] is
small.

BUT it is true when events are INDEPENDENT.  If INDEP, then
expectation = 1 implies reasonable prob of at least 1. 

E.g., say the events that the h in H are consistent with sample are
independent.  Then, if
	m = 1/epsilon * ln(N),
then for a fixed h in H, the probability it is consistent with data
is (1-epsilon)^m, which is roughly 1/N.  So, expectation is approx 1, and

	Pr[at least 1 consistent] ~ 1 - (1 - 1/N)^N ~  1 - 1/e

So, we have a chance about 1 - 1/e of being fooled.


VC dimension
------------
To get a better handle on our question, we need a way of measuring how
many "really different" hypotheses there are under consideration.
VC dimension, and the "effective number of hypotheses" are a way of
getting at this.  

First, another motivating problem:

Consider problem of learning an initial subinterval of [0,1].
Example is a real number in [0,1].
C = {[0,a] : a <= 1}

Here's an algorithm: pick the interval [0,b] where b is the largest
positive example seen so far.

Question: how much data does this need to achieve PAC guarantee?

Answer: consider the region [a',a] of probability measure espilon. If
we see an example inside that region, then we're done. Prob we fail
in m examples is (1-epsilon)^m.  ==> 1/epsilon * log(1/delta) are
sufficient. 


We're doing a lot better than formula that had log(|C|) in it. Why??
--> Only one "degree of freedom".  How much is a degree of freedom
worth?  Are they all worth the same?

--> Look at "effective number" of concepts as a function of number of
data points seen.  Only grows linearly....

Turns out, can use this "effective number" in place of actual number.
But, it's not obvious. 


Define: C[m] = max number of ways to split m points using concepts in C.
(book calls this PI_C(m))

Theorem: If number of examples seen m satisfies

	m > 2/epsilon * [log_2(2C[2m]) + log_2(1/delta)]

then whp (1-delta), all bad (error > epsilon) hypotheses in C are
inconsistent with data.

C[m] is sometimes hard to calculate exactly, but turns out can get a
pretty tight bound on it using something called "VC-dimension".

==> go to slides