15-859(A) Machine Learning Theory            01/15/04

* Mistake-bound model
 - simple results
 - relation to consistency model
 - Halving and Standard Optimal algorithm
=======================================================================

Mistake-bound model
-------------------
The consistency model had the problem that there was nothing in it
about being able to predict well on new data.  The Mistake-Bound model
addresses this problem by explicitly modeling learning as an on-line
process. 

In the MB model, learning is in stages.  In each stage:

	 1. Learner gets unlabeled example.

	 2. Learner predicts classification.

	 3. Learner is told correct label.

Goal: bound the total number of mistakes.

DEFINITION: Algorithm A has mistake-bound M for learning class C if A
makes at most M mistakes on any sequence that is consistent with a
function in C. 
[later we'll talk about extendig to when there's a "mostly consistent" f in C]

Notice that since we're not assuming anything about how examples are
selected, we can't necessarily talk about "how much data do I need to
converge".  E.g., maybe we end up seeing the same example over and
over again and don't learn anything new.  But, that's OK because we
(hopefully) won't be making mistakes either.  Later on, when we talk
about models that assume data is coming from some fixed probability
distribution, we will be able to convert mistake bounds into bounds on
how much data we need to converge.  But the mistake-bound setting is
nice because it's very clean -- no probabilities.

We'll think of A as being good if its mistake bound is polynomial in
n = the size of the examples, and s = the description length of the
smallest consistent function (for classes like DNF and Decision Trees
that can have very complicated functions in them).  [We are thinking of
a "concept class" as both a set of functions and a description
language for them.] 

DEFINITION: C is *learnable* in the mistake bound model if there exists
an algorithm with mistake bound poly(n,s), and furthermore whose
running time per stage is poly(n,s).

Question: why do we talk about a *class* of functions as being
learnable or not, rather that just single functions?

----------------examples------------------------------------------
Conjunctions:  
Start: h = x_1 \Not{x}_1 x_2 \Not{x}_2 ... x_n \Not{x}_n.

	 Mistake on positive: remove from h all literals not satisfied.

Guarantee: {literals in h} contains {literals in target}.   Implies
no mistakes on negatives (if data is consistent with a conjunction).

The first mistake removes n literals, and each subsequent mistake
removes at least 1, which implies that # mistakes <= n+1.

--------------------------
Disjunctions: --> same thing just reverse positive and negative.
--------------------------

Lower bound of n: We've seen an algorithm that makes at most n+1
mistakes.  In fact, *no* deterministic algorithm can guarantee a
mistake bound less than n.  Easiest to see with disjunctions:  Imagine
seeing the following sequence of examples:

	1 0 0 0 0 0
	0 1 0 0 0 0
	0 0 1 0 0 0
	...
	0 0 0 0 0 1

then, *any* labeling of these is consistent with some concept in C, so
for any algorithm, there exists a sequence of examples forcing it to
make n mistakes.  (And if we imagine labeling them with a random
OR-function, can see that even a randomized algorithm has to make n/2
mistakes in expectation).  [this in some ways foreshadows the notion
of "VC-dimension" that we will get to later...]


Relation to consistency model
-----------------------------
Theorem: if class C is learnable in the MB model by an algorithm A
that uses hypotheses in C, then C is learnable in the consistency model. 

Proof:
	Given set S of data, run algorithm through the set over and
	over until it makes a pass with no mistakes.

	Let's assume the algorithm is "conservative", meaning that it
	only changes its hypothesis when it makes a mistake.

	Then, if it makes a pass with no mistakes, we have a
	consistent hypothesis.  On the other hand, if it keeps making
	mistakes and exceeds its mistake bound, then we know there is no
	consistent hypothesis.  poly MB + poly time/example ==> poly overall.

	What about "non-conservative" algorithms?
	--> can reset its state after seeing a non-mistake
	--> or, can just use its hypothesis to see if there is any
		inconsistent example and then feed that example in.

Note: the other direction is not necessarily so easy, if we want our
algorithm to be efficient.  You can construct weird concept classes
where the consistency problem is can be solved easily, but achieving a
polynomial mistake bound requires being able to invert a 1-way
function (e.g., like factoring). 

Even for natural concept classes, going in the other direction is not
necessarily very obvious.  E.g., we saw last time how to learn
decision lists in the consistency model.  On your hwk you will show
how to do that in the MB model.

One more example:

 * each example is a number in range: 0 to 2^n-1.  (think of as binary).
 * C = { [0,a] : a < 2^n}
        (i.e., C = initial sub-intervals).

 (e.g., imagine that you are observing some device that fails when it
  gets too hot, and each day you write down current temperature and
  whether device is working right or not)

Simple alg for consistency model: choose concept [0,b] where b is the
largest positive example seen.  Then verify that no negative example
is < b.

- Does this work for MB model?
- What would work for MB model?
        use current hypothesis as [0,b] where b is halfway between
        largest positive and smallest negative example.  Cuts down unknown
        region in half on every mistake.   At most n mistakes.
========================================================================

Halving algorithm
-----------------
If we don't care about computation time, we can generalize the above
algorithm to learn an *arbitrary* concept class C with at most lg(|C|)
mistakes.  

HALVING ALGORITHM: predict using majority vote over all concepts in C
consistent with past data.

Each mistake of halving algorithm cuts down the number of available
concepts in half (or more).  So, it makes at most lg|C| mistakes.
E.g., there are 3^n conjunctions, so this makes at most O(n) mistakes.

Notice that it takes lg|C| bits to describe a concept in C (if you use
the same number of bits per concept), so we are making at most 1
mistake per bit.

What about concept classes with both simple and complex functions?
E.g., decision trees or DNF formulas or C-programs.  Let's define C_b
to be the set of functions in C that take < b bits to write down.  So,
|C_b| < 2^b.  Then we can do a "guess and double" on b.  If the target
takes s bits and s is between 2^b and 2^{b+1}, the number of mistakes
is at most 1 + 2 + 4 + ... + 2s < 4s.   So, this means that if we
remove the running-time restriction in the definition of learnability,
then *everything* is learnable in the MB model.

Question: What if we had a "prior" p on the different functions in C?
Can we make at most lg(1/p_f) mistakes, where f is the target function?

Ans: Sure, just do a weighted vote, cutting down available probability mass
in half (at least) on every mistake.

This actually gives a better algorithm for the case of functions of
different sizes.  Just give each function f a weight of 1/2^size(f).
If our description language is a prefix-code (e.g., like a huffman code:
think of a binary tree with concepts at the leaves) then the sum of
all the weights is <= 1 to start [why?].  If the target has size s, then the
number of mistakes is at most lg(2^s) = s.  So, we got rid of the "4".

Theorem: if computation time is no object, then can learn any concept
class C with # mistakes <= # bits in description of the target function.

---- a few asides ----
Question: what if the target function really was picked according to our 
prior?  What can we say about our *expected* number of mistakes?

    Ans: E[M] <= sum_f [p_f * lg(1/p_f)] = *entropy* of distribution p.

What's also interesting is that our algorithm is now equivalent to the
"Bayes optimal" strategy: at each step we are predicting the most
likely outcome given our info so far, under the assumption the target
was picked from p.  So this gives two motivations for the same
algorithm: it's the correct thing to do if our prior is right (Bayes
motivation), but it's also a good thing to do if we are just hoping
our prior is reasonable (Mistake-bound motivation).

One more aside:  what if the true distribution is p, but we thought it
was q.  What does our bound look like now?  Ans:  sum_f [p_f * lg(1/q_f)].

The difference between these two quantities is called the KL
divergence between p and q:  KL(p||q).  Not symmetric.
=======================================================================

Question: is halving algorithm optimal?  Say you have unlimited
computation time, does this have the best possible mistake bound?
Surprisingly, the answer is no.

On-line learning as a 2-person game
-----------------------------------
If we ignore the issue of computational efficiency. we can think of
learning in the mistake-bound model as a 2-person game.  In
particular, for a deterministic algorithm, the game goes like this:

1. The opponent selects some example x, which splits C into two sets:
  C_0(x) is the set of concepts that label x negative, and C_1(x) is
  the set of concepts that label x positive. 

2. The algorithm gets to pick one of C_0(x) and C_1(x) to throw away
  (by predicting negative and making a mistake it throws out C_0(x), or
   by predicting positive and making a mistake it throws out C_1(x))

Keep going until only one concept left.  (view two semantically
   equivalent concepts as the same).

The value of this game (to the opponent) is the number of rounds the
game is played.

Looking at the problem this way, we can see a couple of things. First,
there is a well-defined notion of optimal strategies for each player
and second, there is a well-defined number OPT(C) that is the optimal
mistake bound for concept class C.  
What is optimal strategy? 
        Given an example x, we ``just'' calculate OPT(C_0(x)) and
        OPT(C_1(x)) (by applying this idea recursively), and throw out
        whichever set has larger mistake bound.

Another way to look at it is that OPT(C) is defined recursively as:

        If |C| = 1 then OPT(C) = 0.

        Else OPT(C) = 1 + max( min( OPT(C_0(x)), OPT(C_1(x)) ) )
                           x

This algorithm is often called the "Standard optimal algorithm" (more
discussion in Littlestone paper).  This is *not* necessarily the same
thing as the Halving Algorithm (see hwk problem #5).
-----
Another thing to think about is what if there isn't any perfect target
function in C?  One thing we might hope to do is perform comparably to
the *best* function in C in hindsight.  Next class we'll see a really
neat algorithm for doing this that is like a flexible version of the
halving algorithm: instead of throwing out the inconsistent functions,
you just give them a reduced vote.  By doing this right we'll be able
to get a bound that looks like:

	  log(|C|) * (1/epsilon) + (1 + epsilon)*M_best

where M_best is the number of mistakes of the best function in C in
hindsight.