15-681 Machine Learning.  Lecture 9/19
(1/2 of a lecture)

* combining predictors/experts/hypotheses with the weighted majority
  algorithm

=======================================================================

So far, we've been talking about situations where we are assuming
implicitly or explicitly that there is a static distribution.
	E.g., 	statistical tests,
		training set, pruning set, testing set.  (E.g., DTs)
		PAC model.  (also made assumption that there was a
			     correct answer)

Sometimes, this is not such a great assumption.  E.g., calendar.

Suppose we throw all those assumptions out.  No distribution.  No
correct answer.  What can we say?

To do this, we'll move to a different paradigm.  Instead of train and
then test, let's talk about on-line learning. 

The problem of "predicting from expert advice":

	Each day we're given the predictions of n experts on whether
	or not it will rain.

	Then, we want to listen to their advice and then predict for
	ourselves whether or not it will rain.

	Then, we find out.

Our goal: not make too many more mistakes than that of the best expert.

Example:
	        predictions       answer

		Y  Y  Y  N     -->  Y
                Y  N  N  Y     -->  Y
                Y  N  N  N     -->  N
                N  Y  N  Y     -->  N

Say we assumed a fixed distribution.  E.g., each expert has a fixed
probability of predicting correctly on each day.  Then, we could just
say: "wait a while, and then go with the best".  Use our statistical
tests.  But, what if we'd rather not wait, or we don't want to assume
a fixed distribution?

Some ways of thinking about this:

(1) Each expert is a hypothesis.  "Batch" method would be to test on
data and then select one.  Instead, lets use all of them, put them
into some "master" algorithm, and try to guarantee we will do nearly
as well as the best of them.  E.g., Tom's graph of size of DT versus
performance.  Which one to choose?  Instead, lets keep all of them and
combine their responses. 

(2) Think of each expert as a feature.  What is the concept class?
We're trying to do nearly as well as the best "single variable
concept".  Really easy in PAC model (only n hypotheses) but now we are
removing the PAC assumptions?

Suggested algorithm type?


Weighted Majority algorithm:
	Assign each expert a weight of 1 to start.

	Predict according to a (weighted) majority vote.

	Penalize experts that make a mistake by cutting their weight
	in half. 

Let's do the above example:
	                         we predict   correct answer
   weights       1   1   1   1
   predictions   Y   Y   Y   N       Y             Y
   weights       1   1   1  .5
   predictions   Y   N   N   Y       N             Y
   weights       1  .5  .5  .5
   predictions   Y   N   N   N       N             N
   weights      .5  .5  .5  .5
   predictions   N   Y   N   Y     either          N
   weights      .5  .25 .5  .25

Analysis:  We don't do too much worse than the best expert.

  Let W = the total weight of all the experts.  So, initially W=n.

  If the algorithm makes a mistake, this means that at least half of
  the total weight of experts predicted incorrectly.  So, by how much
  does W decrease (at least)?

  After M mistakes, W is at most....

		W < n*(3/4)^M

  Say the best expert has made m mistakes.  What is its weight w?

		w = (1/2)^m

  W >= w.  Why?

		n*(3/4)^M > (1/2)^m

		(1/n)*(4/3)^M < 2^m

		-lg(n) + M*lg(4/3) < m

                 M < 2.41*(m + lg(n)).

  So, we make at most 2.4 times as many mistakes as the best expert,
  plus some additive factor.

In practice, the algorithm will do much better than this.  Why?  Where
have we made pessimistic assumptions?

	* Sometimes weight will decrease when we DONT make a mistake.
        * Sometimes weight will go down by MORE than 1/4.
        * If several experts are pretty good, then W will be a fair
          bit larger than w.

In fact, can modify algorithm so that even in the worst case, the
"2.41" becomes nearly 1.  In expts I've seen, these modifications don't
reduce error, but they do make the algorithm usable in more general
settings...

Next time:  
        * More discussion & some variations.
	* Some applications