Notes for 1/28/98

Today: 
 - learning with string-valued features (the infinite attribute model).
 - Intro to the PAC model.

String-valued features and the IA model
---------------------------------------
--> See on-line learning survey paper handed out.

The discussion so far has focused on learning over the instance space
{0,1}^n. I.e., examples have Boolean-valued attributes.
Another common setting is one in which the attributes are
string-valued; that is, X = (\Sigma^*)^n.  For instance, one
attribute might represent an object's color, another its texture, etc.
If the number of choices for each attribute is small, we can just
convert this to the Boolean case, for instance by letting ``x_1 =
red'' be a Boolean variable that is either true or false in any
given example.  However, if the number of choices for an attribute is
large or is unknown apriori, this conversion may blow up the number of
variables.  

Question: how does this change the learning problem?

Sometimes, this is called the Infinite Attribute model.  Here, we have
an infinite number of boolean features but each example has only n
that are "on".  Example described as the set of the on-features.  Goal
is to be polynomial in n.

Example: learning an OR function.  "x_1 = orange OR x_2 = smooth".

   - list-and-cross-off is not so easy.
   
   Idea: start off saying negative.  See 1st positive 
   	(red, smooth, large ,Wean4615) 
   and use this to initialize hypothesis. Note that at least one of these
   is important.
   
   What to do when make a mistake on a negative?
   
   What to do when make a mistake on a positive?
   
   Say there are r terms in the target.  How many mistakes at most?

Can also use Winnow.  Idea:

   Say we know r.  solve for smallest N >= n + 3r(log N + 1) + 2.  

	(e.g., N = n*r^2 is sufficient if r and n are large, since RHS
	 gives us O(n + rlog(nr)) < rn^2)

   Define y_1, ..., y_N as N boolean features.  What we'll do is
   assign these to terms like "x_1 = orange" as they come up in
   examples that winnow makes a mistake on.

   Say we see example: (red, smooth, large ,Wean4615).  Then we'll
   *temporarily* assign y_1 = "x_1 = red", etc.  Give winnow the
   example:
		1 1 1 1 0 0 ... 0

   If we make a mistake, we make this assignment permanent.  Otherwise
   we undo it (we forget we ever saw the example).

   Claim: What winnow sees is consistent with an OR of r boolean features.

   So, by its mistake bound and the defn of N, we won't ever run out
   of y's to assign.

   # mistakes at most 3r(log N + 1) + 2, which is O(rlog(nr)).


Note: decision list alg fails.  Not known if it is possible to learn
DLs in MB model in this setting.  (E.g., "if x_1 = orange then
positive, else if x_2 = rough then negative else ...")

-----------------------------------------------------------------

The PAC model
-------------
PAC model is a model for "batch" learning:  train algorithm on some
fixed set of data, and then test on the real world.  To do this, we'll
need to have some relation between the training set and the world.

The basic idea of the PAC model is to assume that examples are being
provided from a fixed (but perhaps unknown) distribution over the
instance space.   The assumption of a fixed distribution gives us hope
that what we learn based on some training data will carry over to new
test data we haven't seen yet.  Also, it provides us a well-defined
notion of the error of a hypothesis with respect to target concept.

Difference with MB model: MB got away without making distribution
assumptions by its online nature.  PAC is a little messier since
introducing this distribution, but allows us to get at important
questions like "how much data do I need to see to be confident of my
generalizations based on that data?"


Definition: Given a example distribution D, the *error* of a hypothesis h
with respect to a target concept c is Pr_{x in D}[h(x) != c(x)].

Suppose we are seeing examples labeled according to some target
concept in class C.  What kind of guarantee could we hope to make?

Two issues: 
 1) can't hope to get target exactly since distribution might place
	low weight on some part of instance space.  So, instead 
	goal is to approximate the target.
 2) can't necessarily guarantee to approximate target since might be
	unlucky and get a data sample that doesn't accurately reflect the
	distribution. 

So, goal is to get an approximation with high probability.

-> Probably Approximately Correct (PAC) model.

Definition:
An algorithm A PAC-learns concept class C by hypothesis class H
if for any c in C, any distribution D over the instance space, any
epsilon, delta >  0, A needs at most poly (n, 1/epsilon, 1/delta,
size(c) examples and running time to produce a hypothesis h in H that,
with probability 1-delta, has error at most epsilon.

The quantity epsilon is usually called the accuracy parameter and
delta is called the confidence parameter.  A hypothesis with error
at most epsilon is often called ``epsilon-good.''

This definition allows us to make statements such as: ``the class of
k-term DNF formulas is learnable by the hypothesis class of k-CNF
formulas.'' 

Remark 1:  If we require H = C, then this is typically called
``proper PAC learning''.  If we allow H to be the class of
polynomial time programs (i.e., we don't care what representation the
learner uses so long as it can predict well) then this is typically
called ``PAC prediction''.   I will usually say: ``concept class C is
PAC-learnable'' to mean that C is learnable in the PAC-prediction
sense.

Remark 2:  One nice extension of this model is instead of
requiring the error of h by at most epsilon to just require that
the error be at most 1/2 - epsilon.  This is called *weak-learning*
and we will talk more about this later. 

Remark 3: Another nice extension is to the case where H is
not necessarily a superset of C.  In this case, let epsilon_H be
the least possible error using hypotheses from H.  Now, we relax the
goal to having the error of h be at most epsilon + epsilon_H.
If we let C be the set of all concepts (and we remove ``size(c)''
from the set of parameters we are allowed to be polynomial in), then
this is often called the *agnostic* model of learning: we simply
want to find the (approximately) best h in H we can, without any
prior assumptions on the target concept.  This is like what we got
with Weighted-Majority for H = {single variable concepts}.  A lot
harder to get this guarantee efficiently for interesting H's.  

A simple example
----------------

Let's show that the simple algorithm for learning conjunctions in the
consistency model is a PAC learning algorithm for this class.  There
are many ways of showing this fact.  Here is one particularly nice way.

* Consider some specific ``bad'' conjunction whose error is at least
epsilon.  The probability that this bad conjunction is consistent
with m examples drawn from D is at most (1-epsilon)^m.

* Notice that there are (only) 3^n conjunctions over n variables. 

* (1) and (2) imply that given m examples drawn from D, the
probability there *exists* a bad conjunction consistent with all of
them is at most 3^n(1-epsilon)^m.  Suppose that m is sufficiently
large so that this quantity is at most delta.  That means that with
probability (1-delta) there are *no* consistent conjunctions whose
error is more than epsilon.  Since our algorithm finds a consistent
conjunction whenever one exists (and we are assuming one exists here),
this means that with probability at least 1-delta our algorithm finds
a hypothesis with error at most epsilon.

*  The final step is to calculate the value m needed to make
sure that it is polynomial in our parameters (from the description of
the algorithm, it is clear that its running time is polynomial so long
as m is).  We can do this as follows:

Want: 
	3^n(1-epsilon)^m <= delta.  
This is equivalent to
	m ln(1-epsilon) + nln(3) <= ln(delta)

which is equivalent to 
	m >= [nln(3) + ln(1/delta)]/[-ln(1 - epsilon)]

which (simplifying using -ln(1 - x) = x + x^2/2 + x^3/3 + ... > x) is
implied by

	m >= (1/epsilon)[nln(3) + ln(1/delta)].

Since this is polynomial in our parameters, we have a PAC-learning
algorithm. qed 


Relating Consistency and  the PAC model
---------------------------------------

Let's go over analysis.
1. prob a bad hypothesis is consistent with m examples is at most
	(1-epsilon)^m
2. So, prob exists a bad consistent hypothesis is at most
	|H|(1-epsilon)^m
3. Set to delta, solve to get number examples needed at most
	1/epsilon[ln(|H|) + ln(1/delta)]

This gives us a way of relating consistency with PAC model.

Theorem: Let A be an algorithm that learns class C by H in the consistency
model (i.e., it finds a consistent h in H whenever one exists).
Then A needs only (1/epsilon)(ln|H| + ln(1/delta)) examples to learn C
(by H) in the PAC model.  Therefore, A is a PAC-learning algorithm so
long as this quantity is polynomial in 1/epsilon, 1/delta, size(c), and n.

The above quantity is polynomial for conjunctions, k-CNF, and
k-DNF.  It is not polynomial for general DNF.  Explains why our DNF
algorithm didn't feel right, but we felt more confident in the
conjunctions algorithm.