15-681 Machine Learning. Lecture 9/24 Recap: End of last time, talking about On-line learning and problem of predicting from expert advice. - Goal of being not too much worse than best expert. Could just sample and take statistics and pick best but... -- don't want to wait. -- what if not a fixed distribution? If one expert is doing much better than rest then we want to go with it. If several are good, then majority voting sounds best. Nice algorithm that achieves both: Weighted Majority Alg. Assign each expert a weight of 1 to start. Predict according to a (weighted) majority vote. Penalize experts that make a mistake by multiplying weights by 1/2. Proved: # mistakes M < 2.41*(m + log(n)), where n=number of experts and m = #mistakes of best expert so far. Uses: - an alternative to selecting among several hypotheses. - use to combine several learning algorithms. - log(n) means you can add in lots of experts / hypotheses / learning-algorithms without much penalty. Today: applications and variations. Algorithm is very flexible. ***Calendar application*** Motivation: 1. many of ID3's rules are simple. 2. lots of features available. 3. time dependence. (get to this later) Idea: - Create one expert for every pair of features. So, have 561 experts. - Each expert will perform a simple lookup-table rule. Given a pair of values for its two features, the expert will predict the most common outcome out of the last 4 times (say) that that pair of values appeared. E.g., Say we're predicting day of week. If the features are "seminar type" and "names of attendees", and the values are nil (meaning it's not a seminar) and Caruana (meaning that it's a meeting with Rich Caruana) then we look at the last 4 meetings with Rich Caruana and take the most common day. - Note: in doing this, we have moved the "time dependence" inside the experts. This gives us more flexibility in dealing with recency/frequency tradeoff. E.g., even if the last meeting with person X was 200 days ago, we can still remember it. As a general rule of thumb, the simpler the learning algorithm (e.g., lookup table is simpler than decision tree is simpler than neural net), the more flexible you can make it for issues like this. Results: about 5-10% better accuracy than ID3. On higher end when using all the features (not just ones selected for ID3). ============================================================================ Algorithm is built for goal of: "not too much worse than best expert so far". What if one expert is best for a while, then that expert starts doing badly and there is another that's best? E.g., some weather predictors do better in fall, some in winter etc. Might want to have stronger goal: "not too much worse than best expert lately" What might be some situations like that? 1. Notice that weight on expert is function of number of mistakes it has made total so far, so doesn't distinguish recent versus old mistakes. Any ideas on modifying the algorithm? Here's one: Modify algorithm to say that if weight w_i < W/10n, then don't cut it in half. (the number 10 is arbitrary. Any number greater than 2 would be fine) Claim: say you have a long sequence of examples. In any contiguous block, total number of mistakes is not too much worse than best expert IN THAT BLOCK. Let's analyze it. W_init = total weight at start of block. How much does W drop when a mistake is made? - at least W/2 is on portion that made a mistake. - at most W/10 of that is on portion that can't be halved. So, at least 4/10 of weight gets halved. ==> remove 1/5 of weight. ==> after M mistakes, total weight < W_init*(4/5)^M Weight on best expert starts at least W_init/20n. So, if it makes m mistakes, then its weight is at least: W_init/20n * (1/2)^m. So, W_init/20n * (1/2)^m < W_init*(4/5)^M ==> (5/4)^M < 2^m * (20n) ==> M < 3.11 * [m + lg(20n)] 2. Going back to the original algorithm, the constant 2.41 is kindof big. As we pointed out last time, in practice it does much better. But, here is a modification that reduces that worst case bound. (In expts I've seen, these modifications don't affect error, but they do make the algorithm usable in more general settings...) Modification 1: replace 1/2 by beta. Modification 2: Use weights as probabilities. I.e., algorithm is: Assign each expert a weight of 1 to start. Predict outcome X with probability proportional to total weight on that outcome. (Equivalent to: select expert i with probability w_i/W) Penalize experts that make a mistake by multiplying their weight by beta. Intuitively, why is worst case better? New bound: Expected number of mistakes M is bounded by: m ln(1/beta) + ln(n) M <= -------------------- (1 - beta) E.g., beta=1/2, get 1.39m + 2ln(n) beta=3/4, get 1.15m + 4ln(n) Application: say time between example and your prediction must be fast, but you have more time before the next example. Now, only need to examine one expert to make prediction (but need to examine all to update the weights) Application: can use in more general settings where experts are suggesting "strategies" or "actions" that can't be combined so well.