15-859(A) Machine Learning Theory            01/23/06
* Overview of main learning models
* Mistake-bound model
 - relation to consistency model
 - Halving and Standard Optimal algorithm
=======================================================================
Overview of main learning models
--------------------------------
The consistency model had the problem that there was nothing in it
about being able to predict well on new data.  In this course, we will
be looking at a number of models that address this in different ways:

 - In the distributional (PAC) model, we assume data comes from some
   fixed probability distribution over the instance space, labeled by
   an unknown target function.  Assume training data is drawn from
   this distribution and will then ask questions like: how much data do I
   need to see so that if I do well over it, then I can expect to do
   well on new points also drawn from this distribution.

 - Also look at "active learning" models where the algorithm can
   choose which points it wants to have labeled.  (2 main versions: one
   where algorithm can construct new examples out of whole cloth, and one
   where it can just select from existing pool (like web pages or documents)).

 - In the other direction (making things harder) we can look at the
   case where an adversary selects the order in which examples are
   presented (i.e., our algorithm should work for any order).  This is
   the model we are going to start with, and in many ways it's the
   cleanest to think about, and also yields some very nice algorithms.

Mistake-bound model
-------------------
In the MB model, learning is in stages.  In each stage:

	 1. The learner gets an unlabeled example.
	 2. The learner predicts its classification.
	 3. The learner is told the correct label.

Goal: bound the total number of mistakes.

DEFINITION: Algorithm A has mistake-bound M for learning class C if A
makes at most M mistakes on any sequence that is consistent with a
function in C. 
[later we'll talk about extending to when there's a "mostly consistent" f in C]

Notice that since we're not assuming anything about how examples are
selected, we can't necessarily talk about "how much data do I need to
converge".  E.g., maybe we end up seeing the same example over and
over again and don't learn anything new.  But, that's OK because we
(hopefully) won't be making mistakes either.  Later on, when we talk
about distributional models, we will be able to convert mistake bounds
into bounds on how much data we need to converge.  But the
mistake-bound setting is nice because it's very clean -- no
probabilities.  So this is the toughest model for positive results.

We'll think of A as being good if its mistake bound is polynomial in
n = the size of the examples, and s = the description length of the
smallest consistent function (for classes like DNF and Decision Trees
that can have very complicated functions in them).  [We are thinking of
a "concept class" as both a set of functions and a description
language for them.] 

DEFINITION: C is *learnable* in the mistake bound model if there exists
an algorithm with mistake bound poly(n,s), and furthermore whose
running time per stage is poly(n,s).

Question: why do we talk about a *class* of functions as being
learnable or not, rather that just single functions?

----------------Examples------------------------------------------
Monotone conjunctions:  
  - Start: h = x_1 ^ x_2 ^ x_3 ^ ... ^ x_n.   (conjunction of everything)

  - Mistake on positive: remove from h all variables not satisfied.

Guarantee: {variables in h} contains {variables in target}.   Implies
no mistakes on negatives (if data is consistent with a conjunction).

Each mistake removes at least 1 vble, which implies that # mistakes <= n.
(let's say the AND of no variables is the "all positive" function)

Disjunctions: --> same thing just reverse +/-, 1/0.
--------------------------

Lower bound of n: We've seen an algorithm that makes at most n
mistakes.  In fact, *no* deterministic algorithm can guarantee a
mistake bound less than n.  Easiest to see with disjunctions: Imagine
seeing the following sequence of examples:

	1 0 0 0 0 0
	0 1 0 0 0 0
	0 0 1 0 0 0
	...
	0 0 0 0 0 1

then, *any* labeling of these is consistent with some concept in C, so
for any algorithm, there exists a sequence of examples forcing it to
make n mistakes.  (And if we imagine labeling them with a random
OR-function, can see that even a randomized algorithm has to make n/2
mistakes in expectation).  [this in some ways foreshadows the notion
of "VC-dimension" that we will get to later...]

Relation to consistency model
-----------------------------
Theorem: if class C is learnable in the MB model by an algorithm A
that uses hypotheses in C, then C is learnable in the consistency model. 

Proof:
	Given set S of data, run algorithm through the set over and
	over until it makes a pass with no mistakes.

	Let's assume the algorithm is "conservative", meaning that it
	only changes its hypothesis when it makes a mistake.

	Then, if it makes a pass with no mistakes, we have a
	consistent hypothesis.  On the other hand, if it keeps making
	mistakes and exceeds its mistake bound, then we know there is no
	consistent hypothesis.  poly MB + poly time/example ==> poly overall.

	What about "non-conservative" algorithms?
	--> can reset its state after seeing a non-mistake
	--> or, can just use its hypothesis to see if there is any
		inconsistent example and then feed that example in.

Note: the other direction is not necessarily so easy, if we want our
algorithm to be efficient.  You can construct weird concept classes
where the consistency problem is can be solved easily, but achieving a
polynomial mistake bound requires being able to invert a 1-way
function (e.g., like factoring). 

Even for natural concept classes, going in the other direction is not
necessarily very obvious.  E.g., we saw last time how to learn
decision lists in the consistency model.  On your hwk you will show
how to do that in the MB model.

One more example:

 * each example is a number in range: 0 to 2^n-1.  (think of as binary).
 * C = { [0,a] : a < 2^n}
        (i.e., C = initial sub-intervals).

 (e.g., imagine that you are observing some device that fails when it
  gets too hot, and each day you write down current temperature and
  whether device is working right or not)

Simple alg for consistency model: choose concept [0,b] where b is the
largest positive example seen.  Then verify that no negative example
is < b.

- Does this work for MB model?
- What would work for MB model?
        use current hypothesis as [0,b] where b is halfway between
        largest positive and smallest negative example.  Cuts down unknown
        region in half on every mistake.   At most n mistakes.
========================================================================

Halving algorithm
-----------------
If we don't care about computation time, we can generalize the above
algorithm to learn an *arbitrary* concept class C with at most lg(|C|)
mistakes.  

HALVING ALGORITHM: predict using majority vote over all concepts in C
consistent with past data.

Each mistake of halving algorithm cuts down the number of available
concepts in half (or more).  So, it makes at most lg|C| mistakes.
E.g., there are 2^n conjunctions, so this makes at most n mistakes.

Notice that it takes lg|C| bits to describe a concept in C (if you use
the same number of bits per concept), so we are making at most 1
mistake per bit.

What about concept classes with both simple and complex functions?
E.g., decision trees or DNF formulas or C-programs.  Let's define C_b
to be the set of functions in C that take < b bits to write down.  So,
|C_b| < 2^b.  Then we can do a "guess and double" on b.  If the target
takes s bits and s is between 2^b and 2^{b+1}, the number of mistakes
is at most 1 + 2 + 4 + ... + 2s < 4s.   So, this means that if we
remove the running-time restriction in the definition of learnability,
then *everything* is learnable in the MB model.

Question: What if we had a "prior" p on the different functions in C?
Can we make at most lg(1/p_f) mistakes, where f is the target function?

Ans: Sure, just do a weighted vote, cutting down available probability mass
in half (at least) on every mistake.

This actually gives a better algorithm for the case of functions of
different sizes.  Just give each function f a weight of 1/2^size(f).
If our description language is a prefix-code (e.g., like a huffman code:
think of a binary tree with concepts at the leaves) then the sum of
all the weights is <= 1 to start.  If the target has size s, then the
number of mistakes is at most lg(2^s) = s.  So, we got rid of the "4".

Theorem: if computation time is no object, then can learn any concept
class C with # mistakes <= # bits in description of the target function.

---- a few asides ----
Question: what if the target function really was picked according to our 
prior?  What can we say about our *expected* number of mistakes?

    Ans: E[M] <= sum_f [p_f * lg(1/p_f)] = *entropy* of distribution p.

What's also interesting is that our algorithm is now equivalent to the
"Bayes optimal" strategy: at each step we are predicting the most
likely outcome given our info so far, under the assumption the target
was picked from p.  So this gives two motivations for the same
algorithm: it's the correct thing to do if our prior is right (Bayes
motivation), but it's also a good thing to do if we are just hoping
our prior is reasonable (Mistake-bound motivation).

One more aside:  what if the true distribution is p, but we thought it
was q.  What does our bound look like now?  Ans:  sum_f [p_f * lg(1/q_f)].

The difference between these two quantities is called the KL
divergence between p and q:  KL(p||q).  Not symmetric.
=======================================================================

Question: is halving algorithm optimal?  Say you have unlimited
computation time, does this have the best possible mistake bound?

What do you think?  Turns out the answer is no.

You can think of the halving algorithm as saying "I want to go with
the prediction such that, if I am wrong, I cut down the number of
functions left by as much as possible".  But what you *really* want is
to go with the prediction that, if you are wrong, cuts down the
*mistake-bound* of the set of functions left over as much as possible.

E.g., it's possible for the MB for a set of functions C to be much
*less* than log|C|.  Examples?

So, if MB(C) is the optimal mistake-bound for C (best MB achievable by a
deterministic algorithm), and the current example splits C into C+
and C-, then you predict + if MB(C+) > MB(C-), else predict -.  In
fact, for deterministic algorithms, you can view this as a 2-player
alternating-move game between the algorithm and the adversary.

This algorithm is often called the "Standard optimal algorithm" (more
discussion in Littlestone paper).  This is *not* necessarily the same
thing as the Halving Algorithm (see hwk problem #5).
========================================================================

What if there isn't any perfect target function in C?  One thing we
might hope to do is perform comparably to the *best* function in C in
hindsight.  In a few lectures, we'll see a really neat algorithm for doing
this that is like a flexible version of the halving algorithm: instead
of throwing out the inconsistent functions, you just give them a
reduced vote, and predict probabilistically.  By doing this right
we'll be able to get a bound that looks like:

	  log(|C|) * (1/epsilon) + (1 + epsilon)*M_best

where M_best is the number of mistakes of the best function in C in
hindsight.  This then has a number of nice connections to game-theory.

First, though, we're going to look at some very nice efficient
algorithms, including one that gets to the question: how can we design
learning algorithms that handle large numbers of irrelevant features?