Notes for 15-681 Machine Learning, 10/24/96

Today's topic: Boosting
   - theory
   - an application
=============================================================================
THE IDEA:

In the PAC model, we proved PAC guarantees for learning simple things:
	conjunctions, decision lists, rectangles, intervals,
	linear separators (we didn't actually prove this one...)

In practice, we often want more expressive hypotheses, even though we
no longer can prove PAC-style convergence guarantees:

	- decision trees (ID3)
	- Neural Nets (Backprop)

	(It's not just that we can't prove it  -- in fact the algorithms
	don't guarantee PAC convergence.  E.g., can you think of a target
	concept representable as a small decision tree, and a distribution on
	examples, where ID3 will fail (in the sense that it needs
	exponentially many examples to learn)?


Today: an idea to bridge the gap: BOOSTING.


	          	+----------+
Distrib D on examples-->|  alg A   |---> hypothesis
                        +----------+
             |               |
             |               |
             |               V
             |          +----------+
             +--------->| Boosting |
                        +----------+
                          |      |
		+---------+      +------------+
                |            OR               |
                V                             V
      Bad distrib D' for A.         Hypothesis with 99% accuracy over D       
    (A gets < 51% accuracy)

In other words, given any algorithm at all, can use it to either
achieve a PAC kind of guarantee for your problem, or else get a
witness that the algorithm doesn't PAC learn.

Note: We will need to run A several times on independent sets of
examples, so in practice it's only useful when we have lots of data.


Corollary:  Suppose there exists an algorithm A that PAC learns
concept class C, but only for epsilon=1/3  (i.e., for all c in C, all
distribs D, all delta>0, A achieves error < 1/3 in polynomial time and
samples).   THEN, C is PAC learnable.

Proof: *ask*

In other words, the PAC model is robust: if we fix a specific epsilon
< 1/2 instead of saying "for all epsilon", then this doesn't change
the set of learnable things.

==========================================================================

How can we do this?  Let's do an easy case first:

Suppose algorithm A has only one-sided error.  Given a set of data, it
tries to find a hypothesis that gets all of the positives right and at
least 10% of the negatives right too. E.g., suppose the target concept
is a circle, with positives inside and negatives outside, and A looks
for a line that has all the positives on one side and at least 10% of
the negatives on the other.

How to boost? 

Run A on the data.  It finds a line L_1.  Throw out the negatives
"outside" L_1 and then repeat on the data remaining.  If A continues
to get at least 10% of the negatives remaining right, then after S
lines, the intersection of our positive regions gets all the positive
points right and gets at most (0.9)^S of the negative points wrong.
So, to get error epsilon, need to run O(log 1/epsilon) times.

(If A ever fails, then we've found a dataset that's hard for A.)

Are we overfitting?  Not too bad since number of degrees of freedom is
proportional to S (VC dimension too), which is only logarithmic in
1/epsilon.  So, amount of data needed goes up but not by too much.

==========================================================================

General case:

Suppose algorithm A claims to get at least 70% accuracy over every
probability distribution.  Goal is to boost it up a bit (which we can
then apply recursively).

We start by running A on initial distribution D and getting hypothesis
h_1.  Say its accuracy is 70%.

***Now we want a new distribution that forces A to give us more information.***

Now, filter D to create a distribution where h_1 behaves like random
guessing.  Specifically, let CORRECT be the set of examples on which
h_1 predicts correctly and INCORRECT be the set on which h_1
predicts incorrectly.  What we want is a distribution D_2 where
CORRECT and INCORRECT both have equal weight, but besides that is
just like D.  

   	Notation: D[x] is probability of x under distribution D (or
	the density of x if we're in a continuous setting). D[S] for a
	set of examples S is the sum (or integral) of D over x in S.

Right now, D[CORRECT] = 0.7, D[INCORRECT] = 0.3. So, we want D_2
defined like this: 

	D_2[x] = 5/7 * D[x], for x in CORRECT
	D_2[x] = 5/3 * D[x], for x in INCORRECT

But, all we have is access to D.  How can we use D to create D_2?  

One way: for each example desired, we flip a coin.  If heads, we wait
until we see an example from D where h_1 is correct, and if tails we
wait until we see one where h_1 is incorrect.  

Another way: We draw examples from D_1 and place them into two queues,
one for CORRECT and one for INCORRECT, and then to get an example from
D_2 we randomly choose a queue to select from.

Now we run A to get a hypothesis h_2 with accuracy 70% on D_2.


Finally, we filter D only allowing examples x where h_1(x) != h_2(x).
We then get hypothesis h_3 with accuracy 70% on this distribution 
D_3.

RESULTING HYPOTHESIS:  if h_1 and h_2 agree, then go with that,
else go with h_3.  I.e., majority vote.

========================================================================
Analysis:  What's the accuracy of the new hypothesis?

Easiest way is to draw a picture.  Divide space into 4 regions: 
	R_1: h_1 is correct and h_2 is correct
	R_2: h_1 is correct and h_2 is incorrect
	R_3: h_1 is incorrect and h_2 is correct
	R_4: h_1 is incorrect and h_2 is incorrect

We predict correctly on R_1, incorrectly on R_4 and get 70% correct on
R_2 and R_3.   So,

	accuracy = D[R_1] + 0.7*D[R_2] + 0.7*D[R_3]

We know that:
	(1) D[R_1] + D[R_2] = 0.7
	(2) D_2[R_1] + D_2[R_3] = 0.7
combining (2) with our definition of D_2 gives us:
	(3) 5/7 * D[R_1] + 5/3 * D[R_3] = 0.7

now we just rewrite our equation for accuracy like this:

	accuracy = 3/10*D[R_1] + 7/10*D[R_3] + 7/10*D[R_1] + 7/10*D[R_2]

The first two terms are the same proportions as eqn (3) and the second
two terms let us use eqn (2).  So, we get

	accuracy = 3/5 * (7/10)^2 + (7/10)^2 = 0.784

In general, if you work it out, you get:

	new accuracy = 3*(old accuracy)^2 - 2*(old accuracy)^3

(think of 3/5 as 2*(1 - old accuracy))

This is always a decrease assuming that our original accuracy was
strictly between 1/2 and 1.

Note: in getting D_2, we might have tried: filter D to only look at
examples where h_1 predicts incorrectly and try to learn over that.
This DOESN'T work.  Why? 

Details: If our hypotheses h_i did better than 70% accuracy, then it's not
hard to see that this method does even better.  As we get accuracies close
to 1, then it gets harder to simulate the distribution D_2, but this extra
time is linear in 1/epsilon so it's OK.

==========================================================================
Application to recognizing handwritten digits: see slides.
==========================================================================
Refs:
	Robert Schapire, "The strength of weak learnability", Machine
	Learning, 5(2):197-227, 1990.

	Yoav Freund, "Boosting a weak learning algorithm by majority",
	Information and Computation, 121(2):256-285, 1995.

	Yoav Freund and Robert Schapire, "Experiments with a new boosting
	algorithm", ICML '96.