15-681 Machine Learning. Lecture 9/19
(1/2 of a lecture)
* combining predictors/experts/hypotheses with the weighted majority
algorithm
=======================================================================
So far, we've been talking about situations where we are assuming
implicitly or explicitly that there is a static distribution.
E.g., statistical tests,
training set, pruning set, testing set. (E.g., DTs)
PAC model. (also made assumption that there was a
correct answer)
Sometimes, this is not such a great assumption. E.g., calendar.
Suppose we throw all those assumptions out. No distribution. No
correct answer. What can we say?
To do this, we'll move to a different paradigm. Instead of train and
then test, let's talk about on-line learning.
The problem of "predicting from expert advice":
Each day we're given the predictions of n experts on whether
or not it will rain.
Then, we want to listen to their advice and then predict for
ourselves whether or not it will rain.
Then, we find out.
Our goal: not make too many more mistakes than that of the best expert.
Example:
predictions answer
Y Y Y N --> Y
Y N N Y --> Y
Y N N N --> N
N Y N Y --> N
Say we assumed a fixed distribution. E.g., each expert has a fixed
probability of predicting correctly on each day. Then, we could just
say: "wait a while, and then go with the best". Use our statistical
tests. But, what if we'd rather not wait, or we don't want to assume
a fixed distribution?
Some ways of thinking about this:
(1) Each expert is a hypothesis. "Batch" method would be to test on
data and then select one. Instead, lets use all of them, put them
into some "master" algorithm, and try to guarantee we will do nearly
as well as the best of them. E.g., Tom's graph of size of DT versus
performance. Which one to choose? Instead, lets keep all of them and
combine their responses.
(2) Think of each expert as a feature. What is the concept class?
We're trying to do nearly as well as the best "single variable
concept". Really easy in PAC model (only n hypotheses) but now we are
removing the PAC assumptions?
Suggested algorithm type?
Weighted Majority algorithm:
Assign each expert a weight of 1 to start.
Predict according to a (weighted) majority vote.
Penalize experts that make a mistake by cutting their weight
in half.
Let's do the above example:
we predict correct answer
weights 1 1 1 1
predictions Y Y Y N Y Y
weights 1 1 1 .5
predictions Y N N Y N Y
weights 1 .5 .5 .5
predictions Y N N N N N
weights .5 .5 .5 .5
predictions N Y N Y either N
weights .5 .25 .5 .25
Analysis: We don't do too much worse than the best expert.
Let W = the total weight of all the experts. So, initially W=n.
If the algorithm makes a mistake, this means that at least half of
the total weight of experts predicted incorrectly. So, by how much
does W decrease (at least)?
After M mistakes, W is at most....
W < n*(3/4)^M
Say the best expert has made m mistakes. What is its weight w?
w = (1/2)^m
W >= w. Why?
n*(3/4)^M > (1/2)^m
(1/n)*(4/3)^M < 2^m
-lg(n) + M*lg(4/3) < m
M < 2.41*(m + lg(n)).
So, we make at most 2.4 times as many mistakes as the best expert,
plus some additive factor.
In practice, the algorithm will do much better than this. Why? Where
have we made pessimistic assumptions?
* Sometimes weight will decrease when we DONT make a mistake.
* Sometimes weight will go down by MORE than 1/4.
* If several experts are pretty good, then W will be a fair
bit larger than w.
In fact, can modify algorithm so that even in the worst case, the
"2.41" becomes nearly 1. In expts I've seen, these modifications don't
reduce error, but they do make the algorithm usable in more general
settings...
Next time:
* More discussion & some variations.
* Some applications