Notes for Feb 16, 1998

* Boosting.
==============================================================================

Def: Let's say that alg A WEAK-LEARNS class C if for all c in C, all
distributions D, THERE EXISTS epsilon, delta > 1/poly(n,s) s.t. A
achieves error 1/2 - epsilon with prob at least delta.

I.e., with some probability you get noticeable correlation. 

Question: supposed we defined PAC model this way, does this change the
notion of what is learnable and what is not?  

View as algorithm: have a black box (neural net) that can get some
fraction correct, and you want to use it to either find a disrib
that's inherently hard for that algorithm, or else to boost up your
accuracy. 

Dist specific: can't do it.  Imagine uniform dist. concept = x1 ^ (hard).  
Can get 75% right, but hard to do better.

--> Use the fact that it must get weak learning over ALL
distributions.


An easy case
------------
Suppose our weak-learning algorithm only makes one-sided error.  For
instance, it gets all of the negatives right and at least 1/poly of
the positives right. (E.g., when learning a disjunction, greedy alg
picks a variable with this property).  Then, boosting is easy:

We first find h_1 that gets at least 1/poly of the positives right
over distribution D. We then define D_2 to be D restricted to examples
x such that h_1(x) = 0 (i.e., throw out the positives we got right) and
learn over that, etc. 

Our final hypothesis is the OR of all the h_i(x)'s.  

For boosting hypotheses that make 2-sided error, though, it's
a little more tricky.....

Boosting: [Shapire '90]  (a more detailed description of this is in the book)
-----------------------
Suppose we have an algorithm A that over any distribution will get at
least 70% accuracy with high probability.  Say we want to boost it up
a bit (which we can then apply recursively). 

We start by running A on initial distribution D_1 = D and getting
hypothesis h_1.  Say its accuracy is 70%.  Now we want a new
distribution.

One try: filter D to only look at examples where h_1 predicts
incorrectly and try to learn over that.  This DOESN'T work.  Why?

Instead: we filter to create a distribution where h_1 behaves like random
guessing.  Specifically, let S_C be the set of examples on
which h_1 predicts correctly and S_I be the set on which h_1
predicts incorrectly.  What we want is a distribution D_2 where
S_C and S_I both have equal weight, but besides that is just like
D_1 in that Pr_{D_2}(x | x in S_C) = Pr_{D_1}(x | x in S_C)
and the same for S_I.  For example, we can construct D_2 by
flipping a coin and if heads we wait until we see an  
example from D_1 where h_1 is correct, and if tails we wait until we
see one where h_1 is incorrect.  Or we can draw examples from D_1
and place them into two queues, one for S_C and one for S_I, and
then to get an example from D_2 we randomly choose a queue to select
from.   

Let's write D_2[x] as a function of D_1[x] (this is the
probability associated with x, or the density function if we're in a
continuous space).  To create D_2 we reduced the probability on
S_C from 70% to 50%, so if x in S_C then D_2[x] =
(5/7)D_1[x].  We increased the probability on S_I from 30% to 50%
so if x in S_I then D_2[x] = (5/3)D_1[x].

Now we get a hypothesis h_2 with accuracy 70% on D_2.

Finally, we filter D only allowing examples x where h_1(x) != h_2(x).
We then get hypothesis h_3 with accuracy 70% on this distribution D_3.


Resulting hypothesis: if h_1 and h_2 agree, then go with that, else go
with h_3.  I.e., majority vote. 

Analysis
--------
What's the accuracy of the new hypothesis?

Easiest way is to draw a picture.  Divide space into 4 regions: R_1 in
which h_1 = c and h_2 = c, R_2 in which h_1 = c and h_2 != c, R_3 in
which h_1 != c and h_2 != c, and R_4 in which h_1 != c and h_2 = c.

We predict correctly on R_1, incorrectly on R_3 and get 70%
correct on R_2 U R_4.  Let's say that D_2[R_2] = gamma.  By
the fact that h_2 has 30% error on D_2 we know that D_2[R_3] =
0.3 - gamma, and by definition of D_2 we know that D_2[R_1 U
R_2] = D[R_3 U R_4] = 1/2, so
  D[R_1] = 7/5 * (1/2 - gamma)
and
  D[R_3] = 3/5 * (3/10 - gamma).

We can now work out our failure probability as:

Pr_D[fail]  =  D[R_3] + (3/10)(1 - D[R_1] - D[R_3]) 
            = (7/10)D[R_3] + (3/10)(1 - D[R_1])
            = (7/10)[3/5 * (3/10 - gamma)] + (3/10)[3/10 + (7/5)gamma]
            = (3/10)^2[3 - 3/5]
which is approx 0.216

More generally, if we work this through, we get:
	error <- error^2(3 - 2*error)
This is always a decrease assuming that our original error was
strictly between 0 and 1/2.

============================================================================

Better algorithm called "AdaBoost" by Freund and Schapire.  Idea: use
Randomized-WM algorithm, but in a strange way.

Setup: first draw large sample that will be sufficient by Occam arguments.
Will then have distributions over this fixed sample.  E.g., we could
have done it this way above, by just re-weighting the points.

- Use Weighted-majority, but in peculiar way.  Experts will correspond
to examples.  Weighting of experts will be the current distribution.
"trials" will correspond to the weak hypotheses.

Specifically, the algorithm is:

0. Draw n examples.

1. Start with weight w_i = 1 on each example x_i.  

   The associated prob dist D has P(x_i) = w_i/W, where W = sum(w_i).

2. Feed distribution to weak algorithm, get back hypothesis h_t.  Now update:

	Multiply w_i by beta_t if h(x_i) = c(x_i).

   beta_t = (1/2 - gamma_t)/(1/2 + gamma_t),

   where 1/2-gamma_t is the error of the hypothesis h_t over distrib D_t.

3. Final hypothesis: say "positive" if sum_t[ alpha_t * h_t(x) ] > 1/2

   where alpha_t = ln(1/beta_t) / sum_s[ ln(1/beta_s) ]

Claim: training error(h_final) < e^{-2*sum_t[ (gamma_t)^2 ]}

(So, if each time we have error at most 1/2 - gamma, and our goal is
to achieve error epsilon, then we need to run this 
O((1/gamma)^2 * log(1/epsilon)) times.)

E.g., say all have error < 25%, so gamma = 0.25.  Say we want error
1%.  Solve: T = 8*ln(100), which is about 37.


Need to multiply number of samples from VC bounds by T to have
training error be guaranteed to be close to test error (whp).

Let's analyze a simpler version:  use fixed beta = 1/2 - 2*gamma.
Final hypothesis is just majority vote over the h_t's.


Analysis
--------

Let's recall our analysis of RWM.  We did this by looking at the total
weight W_final at the end of all the trials.  If F_t is the fraction
of weight that got multiplied by beta on trial t, then an upper
bound on W_final is
	W_final <= n*(1-(1-beta)F_1)*(1-(1-beta)F_2)*...

so, taking logs and using ln(1-x) > -x, we got:

	ln(W_final) <= ln(n) - (1-beta) * sum[ F_t ]

Also, if the best expert made m mistakes, then we got a lower bound:

	ln(W_final) >= ln(beta^m).

These two inequalities were useful because sum[F_t] was equal to our
expected number of mistakes, so we could then solve for that.


In our current setting, F_t is the fraction of the current distribution that
the current hypothesis got *correct*.  So, F_t > 1/2 + gamma, so

	ln(W_final) <= ln(n) - (1-beta)*T*(1/2 + gamma)

Also, we can lower bound W_final by noticing that any example on which
our final hypothesis makes a mistake must have weight at least
beta^{T/2}, since if our majority vote is wrong then we could have
multiplied it by beta *at most* T/2 times.  So, if in the end our
error is E (i.e., we make E*n mistakes), then 
	W_final >= beta^{T/2} * E * n
or
	ln(W_final) >= T/2 * ln(beta) + ln(E) + ln(n)

Using the upper and lower bounds on ln(W_final) together, we can cross
off the ln(n) term and we get:

	T/2 * ln(beta) + ln(E) <= -(1-beta)*T*(1/2 + gamma)

or

	ln(E) <= -T*(ln(beta)/2 + (1-beta)(1/2 + gamma) 

now, using beta = 1-2*gamma, we get:
	
	ln(E) <= -T*(ln(1-2*gamma)/2 + gamma + 2*gamma^2)

now using ln(1-x) is approx -x - x^2/2 (this is cheating a bit...) we get

	ln(E) <= -T*gamma^2

This is the form of inequality we wanted.