15-681 Machine Learning. Lecture 9/19 (1/2 of a lecture) * combining predictors/experts/hypotheses with the weighted majority algorithm ======================================================================= So far, we've been talking about situations where we are assuming implicitly or explicitly that there is a static distribution. E.g., statistical tests, training set, pruning set, testing set. (E.g., DTs) PAC model. (also made assumption that there was a correct answer) Sometimes, this is not such a great assumption. E.g., calendar. Suppose we throw all those assumptions out. No distribution. No correct answer. What can we say? To do this, we'll move to a different paradigm. Instead of train and then test, let's talk about on-line learning. The problem of "predicting from expert advice": Each day we're given the predictions of n experts on whether or not it will rain. Then, we want to listen to their advice and then predict for ourselves whether or not it will rain. Then, we find out. Our goal: not make too many more mistakes than that of the best expert. Example: predictions answer Y Y Y N --> Y Y N N Y --> Y Y N N N --> N N Y N Y --> N Say we assumed a fixed distribution. E.g., each expert has a fixed probability of predicting correctly on each day. Then, we could just say: "wait a while, and then go with the best". Use our statistical tests. But, what if we'd rather not wait, or we don't want to assume a fixed distribution? Some ways of thinking about this: (1) Each expert is a hypothesis. "Batch" method would be to test on data and then select one. Instead, lets use all of them, put them into some "master" algorithm, and try to guarantee we will do nearly as well as the best of them. E.g., Tom's graph of size of DT versus performance. Which one to choose? Instead, lets keep all of them and combine their responses. (2) Think of each expert as a feature. What is the concept class? We're trying to do nearly as well as the best "single variable concept". Really easy in PAC model (only n hypotheses) but now we are removing the PAC assumptions? Suggested algorithm type? Weighted Majority algorithm: Assign each expert a weight of 1 to start. Predict according to a (weighted) majority vote. Penalize experts that make a mistake by cutting their weight in half. Let's do the above example: we predict correct answer weights 1 1 1 1 predictions Y Y Y N Y Y weights 1 1 1 .5 predictions Y N N Y N Y weights 1 .5 .5 .5 predictions Y N N N N N weights .5 .5 .5 .5 predictions N Y N Y either N weights .5 .25 .5 .25 Analysis: We don't do too much worse than the best expert. Let W = the total weight of all the experts. So, initially W=n. If the algorithm makes a mistake, this means that at least half of the total weight of experts predicted incorrectly. So, by how much does W decrease (at least)? After M mistakes, W is at most.... W < n*(3/4)^M Say the best expert has made m mistakes. What is its weight w? w = (1/2)^m W >= w. Why? n*(3/4)^M > (1/2)^m (1/n)*(4/3)^M < 2^m -lg(n) + M*lg(4/3) < m M < 2.41*(m + lg(n)). So, we make at most 2.4 times as many mistakes as the best expert, plus some additive factor. In practice, the algorithm will do much better than this. Why? Where have we made pessimistic assumptions? * Sometimes weight will decrease when we DONT make a mistake. * Sometimes weight will go down by MORE than 1/4. * If several experts are pretty good, then W will be a fair bit larger than w. In fact, can modify algorithm so that even in the worst case, the "2.41" becomes nearly 1. In expts I've seen, these modifications don't reduce error, but they do make the algorithm usable in more general settings... Next time: * More discussion & some variations. * Some applications