Notes for Jan 19, 1998

Handouts: Minsky-Paper "Perceptrons" chapter.  Littlestone paper. 

Admin: These handouts are related to today's topic and also topics for
next few lectures.  On web page: putting my typed notes and also
related readings that you might want to look at. -- e.g., paper that
describes the neural-net NP-hardness proof. 

-------------------------------------------------------------------------------
To finish up from last time: algorithm for Decision Lists in the
consistency model:

	A DL is a list of if-then rules where each condition is a
	literal and each consequent is + or -. 

	(Equivalent to a decision tree with just one long path)

Let's find algorithm to learn DLs in the consistency model.  Extending
to Mistake-bound model is on the homework.

An example:
	1 0 0 1 1   + 
	0 1 1 0 0   -
	1 1 1 0 0   +
	0 0 0 1 0   -
	1 1 0 1 1   +
	1 0 0 0 1   -

Algorithm:
 1. Find some rule consistent with the current set of examples that
   applies to at least one of them. If no such rule exists, halt with
   failure. 

 2. Put the rule at the bottom of the hypothesis.

 3. Throw out those examples classified by the hypothesis so far.

 4. If there are any examples left, repeat.

It is clear that this algorithm runs in polynomial time, since
each iteration removes at least one example, and each iteration can be
performed in poly time.

To prove: if there exists a consistent list, this algorithm will find one. 

Proof: If there is a decision list L consistent with remaining data,
then there is at least one choice for step 1 (the highest rule in L
satisfied by at least one example).  Furthermore, if there was a
consistent DL at the beginning, then there must exist a DL consistent
with the remaining data --- namely, the original DL.  ta da. (qed)


-------------------------------------------------------------------------------

Last time: ended showing that alg for MB model could be used directly
to solve consistency problem is hypotheses belong to class C.

An example where not so easy to go the other way:
 * each example is a number in range: 0 to 2^n-1.  (think of as binary).
 * C = { [0,a] : a < 2^n}
	(i.e., C = initial sub-intervals).

 (e.g., imagine that you are observing some device that fails when it
  gets too hot, and each day you write down current temperature and
  whether device is working right or not)

Simple alg for consistency model: choose concept [0,b] where b is the
largest positive example seen.  Then verify that no negative example
is < b.

- Does this work for MB model?
- What would work for MB model?

	use current hypothesis as [0,b] where b is halfway between
	largest positive and smallest negative example.  Cuts down unknown
	region in half on every mistake.   At most n mistakes.

Halving algorithm
-----------------
Suppose we don't care about computation time.  Can we generalize the
above idea to learn an *arbitrary* concept class C making at most
log(|C|) mistakes, where |C| is the number of concepts in C?

Halving algorithm: predict using majority vote over all concepts in C
consistent with past data.

E.g., there are 3^n conjunctions, so this makes at most O(n) mistakes.

(later, we'll see a nice noise-tolerant version of this)

Question: is halving algorithm optimal?  Say you have unlimited
computation time, does this have the best possible mistake bound?


On-line learning as a 2-person game
-----------------------------------
If we ignore the issue of computational efficiency. we can think of
learning in the mistake-bound model as a 2-person game.  In
particular, for a deterministic algorithm, the game goes like this:
1. The adversary selects some example x, which splits C into
  two sets: C_0(x) is the set of concepts that label x negative, and
  C_1(x) is the set of concepts that label x positive.

2. The algorithm gets to pick one of C_0(x) and C_1(x) to throw away
  (by predicting postive and making a mistake it throws out C_1(x), or
   by predicting negative and making a mistake it throws out C_2(x))

3. If only one concept left, then done  (view two semantically
   equivalent concepts as the same).  Else, go back to step 1.

The value of this game (to the adversary) is the number of rounds the
game is played.

Looking at the problem this way, we can see a couple of things. First,
there is a well-defined notion of optimal strategies for each player
and a well-defined number OPT(C) that is the optimal mistake bound for
concept class C.  
What is optimal strategy? 
	Given an example x, we ``just'' calculate OPT(C_0(x)) and
	OPT(C_1(x)) (by applying this idea recursively), and throw out
	whichever set has larger mistake bound.

Another way to look at it is that OPT(C) is defined recursively as:

	If |C| = 1 then OPT(C) = 0.

	Else OPT(C) = 1 + max( min( OPT(C_0(x)), OPT(C_1(x)) ) )
	                   x

This algorithm is often called the "Standard optimal algorithm" (more
discussion in Littlestone paper).   Notice that this is {\em not}
necessarily the same thing as the Halving Algorithm.

	
The Perceptron Algorithm
------------------------
Now, I want to describe one of the oldest algorithms (from 60's or
earlier) in machine learning, called perceptron algorithm. 


Perceptron Algorithm: Algorithm for learning a linear threshold
function.  For simplicity, use threshold of 0, so looking at functions
like:  w_1x_1 + w_2x_2 + ... + w_nx_n > 0.   (write this as w \cdot x > 0) 

(can simulate a nonzero threshold with "dummy" input x_0 that is always 1)


Guarantee is that if data is consistent with a linear threshold
function W, then number of mistakes is at most (1/delta)^2, where 

	delta =   min  |W \cdot x|/(|W|*|x|)
                   x

(delta is a measure of the amount of wiggle room available: if we
normalize all examples to have length==1, then it is the maximum, out
of all consistent planes, of the minimum distance of an example to
that plane)

Algorithm:  
	Start with all-zeroes weight vector w.

	Predict positive iff w \cdot x > 0.

	Mistake on positive:  w <- w + x
	Mistake on negative:  w <- w - x

	(if some examples have really large feature values, then
	can normalize examples to length 1.  I.e., add or subtract
	x/|x| instead of adding or subtracting x).

Example: let's say we have examples:
			0 0 +
			1 0 -
			1 1 +
and we cycle through.  Let's add dummy x_0 = 1 to all examples.
Get sequence of weight vectors:
			 0  0  0
			 1  0  0
			 0 -1  0
                         1  0  1
			 0 -1  1
			 1  0  2
			 0 -1  2
			 1 -1  2
		         
Proof of convergence:  First, for simplicity, let's assume target
plane doesn't pass through any examples.  Also, for simplicity, let's
replace negative point (x_0, x_1,..., x_n) by positive point (-x_0,
..., -x_n).  Also, normalize so all examples have length 1.

let w^* be a unit vector representing the target plane.  
	delta = min_x (w^* \cdot x)

let w be our hypothesis plane.
Look at: (w \cdot w^*) / |w|

if we add x to w, then:
	numerator goes up by x \cdot w^* >= delta.
	(denominator)^2 becomes (w+x)^2 = w^2 + x^2 + 2(w \cdot x)
			     	        < w^2 + 1
		because w \cdot x is negative.

Since we start with w = all zeroes, after m mistakes the numerator is
at least delta*m, and the denominator is at most sqrt(m).  

Numerator cannot be greater than denominator since the quantity we're
looking at is the cosine of the angle between w and w^*, so

	delta*m <= \sqrt{m}, so m <= 1/delta^2.

QED