Notes for Jan 14, 1998

Last time we discussed "consistency model".  Noticed some problems with it:
 problem: not clear a positive result says anything about ability to
	predict well on new examples.
 problem (or peculiarity): negative result doesn't seem related to
	inherent predictability of concept class.  More about ease of finding
	optimal representation.

Latter is sometimes what you want.  E.g:., neural nets. Powerful and
useful.  Train with local optimization rule (backprop / gradient descent)
Often finds a local minimum (solution with locally the smallest error)
that is not a global minimum.  Basic question we might ask: Can we
come up with a more closed-form solution or poly time algorithm to
find the global optimum?

To do this, need to solve the consistency problem.

For basically any interesting nnet, consistency turns out to be
NP-hard ==> achieving this guarantee is unlikely. 

Give proof for a simple network: an AND of two linear threshold
functions.  I.e., a typical concept looks like:
     (v_1*x_1 + ... + v_n*x_n > v_0) AND (w_1*x_1 + ... + w_n*x_n > w_0)

So, given a set of data, the goal is to set the v_i's and the w_i's to
fit the data, if possible.

Theorem: consistency problem is NP-hard for this concept class.

Proof: reduce from set-splitting.  In set-splitting you have n points,
and a collection of subsets S_1, S_2, ..., S_m.  Your goal is to color
the points red or blue so that no set is monochrome.  (Also called
hypergraph 2-coloring.)   What we want to do is: given a set-splitting
problem, construct a set of data that has a consistent net if and only
if the set-splitting problem was solvable.

Idea: origin is positive.  n-dimensional space.  One negative example
at each neighbor of origin, corresponding to the n points in the
set-splitting problem.  Note that every point must be separated from
origin by at least one plane.  Want to set things up so that
(A) given a nnet solution, points separated from origin by plane 1 can be
    colored red, those separated by plane 2 can be colored blue (and those
    separated by both can be colored either way), and 
(B) given a set-splitting solution, can find two planes.

Ideas?

Basically, for each set S_j, want to make sure can't have all points
on one side of the same plane, but still want to allow all other
splits. 

Put pos for each S_j at location with 1's for each element in the set.
Nnet ==> split: easy (just look at fixed set S_j)
split ==> nnet: red plane is w_1*x_1 + ... + w_n*x_n < 1/2
	where w_i = -1 if i is colored red, and
	      w_i = +n if i is colored blue.

Mistake-bound model
-------------------
The consistency model had the problem that a positive result didn't
necessarily imply finding a rule with good "predictive power" in any
sense.  The Mistake-Bound model addresses this problem by explicitly
modeling learning as an on-line process.

In the MB model, learning is in stages.  In each stage:

	 1. Learner gets unlabeled example.

	 2. Learner predicts classification.

	 3. Learner is told correct answer.

Goal: bound the total number of mistakes.

Definition: An algorithm A learns C in the mistake bound model
if for any sequence of examples consistent with some concept in C, the
total number of mistakes ever made by A is bounded by some polynomial
in n and s, where
	n = size of an example
	s = size (description length) of simplest concept in C
		consistent with data so far.
(s is only important if C contains concepts with description length
more than polynomial in n, like DNF or decision trees)

Definition: C is (efficiently) learnable in the mistake-bound model if
there exists an algorithm A that learns C, where the running time of A
per stage is also polynomial in n and s.


Conjunctions:  
Start: h = x_1 \Not{x}_1 x_2 \Not{x}_2 ... x_n \Not{x}_n.

	 Mistake on positive: remove from h all literals not satisfied.

	 Mistake on negative: say ``no consistent conjunction.''

{\bf Guarantee: {literals in target} contained within {literals in h}.

Max # of mistakes: n+1

--------------------------

Disjunctions: --> same thing just reverse positive and negative.

--------------------------

Lower bound of n: We've seen an algorithm that makes at most n+1
mistakes.  In fact, *no* algorithm can guarantee a mistake bound less
than n.  Easiest to see with disjunctions:  Imagine seeing the
following sequence of examples:
	1 0 0 0 0 0
	0 1 0 0 0 0
	0 0 1 0 0 0
	...
	0 0 0 0 0 1
then, *any* labeling of these is consistent with some concept in C, so
for any algorithm, there exists a sequence of examples forcing it to
make n mistakes.   [[this in some ways foreshadows the notion of 
"VC-dimension" that we will get to later...]]


Relation to consistency model
-----------------------------
If can solve in consistency model, it doesn't necessarily give an
algorithm for the mistake-bound model.  But, other direction IS true
in the following sense:

Theorem: if class C is learnable in the MB model by a poly-time
algorithm A that uses hypotheses in C, then C is learnable in the
consistency model.

Proof:
	Given set S of data, run algorithm through the set over and
	over until it makes a pass with no mistakes.

	Let's assume the algorithm is "conservative", meaning that it
	only changes its hypothesis when it makes a mistake.

	Then, if it makes a pass with no mistakes, we have a
	consistent hypothesis.  On the other hand, if it keeps making
	mistakes and exceeds its mistake bound, then we know there is no
	consistent hypothesis.

	What about "non-conservative" algorithms?
	--> can reset its state after seeing a non-mistake
	--> or, can just use its hypothesis to see if there is any
		inconsistent example and then feed that example in.

In some cases, learning in the MB model can be a bit trickier than
learning in the consistency model.  Here is an interesting example:
(Note: using cryptographic assumptions (like "factoring is hard") one can
construct weird concept classes that *are* learnable in the
consistency model but not in the MB model.   Get to this later when
talk about relations between learning and cryptography).

Decision lists
--------------
say have variables "rainy?" "warm?"  "have-jacket?"  "snowy?"
say want to give rule for when you like to go for a walk.  Say you
don't like to go for a walk if it's raining. Otherwise, you like to go
for a walk if it's warm or if it's snowy, so long as you have a
jacket.  Could describe as a decision list like this:
	if rainy then no
	else if warm then yes
	else if not(have-jacket) then no
	else if snowy then yes
	else no. 


(continue next time....)