Notes for 1/26/98 Admin: did the extra copies of the book come in yet? Note: book doesn't talk about material we've been doing so far, but we'll be switching topics soon to material that's in the book. -------------------------------------------------------------------------- -- recap from last time about Randomized WM: - algorithm: use weights as probabilities. Update multiplicatively. - we found this had a nice bound: M <= (m*ln(1/beta) + ln(n)) / (1 - beta) As time goes on, the first term will be increasing and the second is static (it's an additive constant), so want to adjust beta with time. For instance, you can get that in T trials, M <= m + ln(n) + O(sqrt(T ln n)) One way of looking at this, if we divide both sides by T, is that our average loss per trial is approaching that of the best expert at a rate like 1/sqrt(T). Here is a neat example for the case of 2 experts. Imagine someone was flipping a coin of bias p. You want to predict outcome. Best prediction is 0 if p<1/2 or 1 if p>1/2. prob of making a mistake is min(p, 1-p). If you didn't know p, you could take majority vote of all outcomes you'd seen so far. In the limit, your prob of mistake would approach this. Now, what if you remove assumption bits are coming from a coin. Just some guy saying H, T, T, T, H, .... Can you have a strategy where, if in hindsight there were 30% heads, you made a mistake only a bit more than 30% of the time? This is a special case of the 2-player game setting zero-sum 2-player games (aka matrix games) ------------------------------------------- Here is a game: A and B each hide either a nickel or a dime behind their backs. Then they reveal their coins. If match, then A gets both. Else, B gets both. This is an example of a 2-person zero-sum game. Zero-sum means that what A wins = what B loses, and vice versa. Is this a fair game? (who would you rather be?) What's a good strategy? This kind of game is often called a MATRIX GAME: Payoff matrix (payoff to A) 5 -5 -10 10 row-player picks a row and column-player picks a column, and you get payoff. "Pure strategy" -> deterministic choice of row(column) "Mixed strategy" -> probabilistic choice. For a given (pure or mixed) strategy S_A of A and S_B of B, the value V(S_A, S_B) of the game to A is A's expected winning per round, given that both players follow this strategy. E.g., if S_A is "A picks randomly" and S_B is "B always hides a nickel", then V() = -2.5 cents. For a given S_A, what is min V(S_A, S_B)? S_B Can think of this as the value of the game if A uses S_A, and announces this to B, who can then use this information to pick his best strategy S_B. In fact, if A announces his strategy, then B can pick a pure (non-randomized) strategy. This is like the problem of selecting stocks when you know the future -- there's no reason to diversify. For a given S_B, what does max V(S_A, S_B) mean? S_A Can think of this as the value of the game if B uses S_B, and announces this to A, who can then use this information to pick his best strategy S_A. Suppose that someone has to announce their strategy first. Which would you rather be? Minimax theorem: max min V(S_A, S_B) = min max V(S_A, S_B) S_A S_B S_B S_A This is the value of the game V. Easy direction to see is the <= To get the >= direction, suppose for contradiction that the LHS was V-epsilon. Now, lets run our on-line algorithm. On every play, our opponent knows our randomized strategy so we should not be able to make more than V-epsilon on average per game. But, in hindsight, there must be a fixed row that would have achieved value at least V. This contradicts the theorem. ---------------------------------------------- Turns out this kind of thing has been proved over and over in different areas in the past (though improvements/simplifications over time) Hannan '57. Banos '68 Megiddo '80 learning theory papers in '90s Problem in game setting is that the weight on expert/strategy i depends on just the total loss of expert i. E.g., if I'm playing against this algorithm, and I say "rock" 100 times in a row, then it will do great, but the weight on paper will be 1, the weight on rock will be 2^(-50), and the weight on scissors will be 2^(-100). So, if I now say scissors, it will keep saying paper for a long time (about 50 times). Then I can do paper and rock for a while since it is not going to say scissors for a long time. So, to be good in practice, we'd like to be more adaptive. Can you think of ways to do that? -- could refuse to lower weights that are already really low -- when you remove some amount x from the total weight W, you could put some of it back evenly among the experts. (Homework question on analyzing this in more detail -- do easier case of deterministic alg on hwk, but same idea applies to randomized alg too). (A potential project: might be fun to try various algs on people for simple game (or not so simple game), say with 100 reps, and see how they do.) Winnow Algorithm ---------------- Now turn to an alg that is something like a combination of the WM alg and the Perceptron alg. Look at in this context: Say we're trying to learn an OR function. "This image is a picture of a person if it has this feature or that feature or that feature". We saw an algorithm: "list all features and cross off bad ones on negative examples" that can make n mistakes. But, what if most features are irrelevant. What if the target is an OR of r relevant features where r is a lot smaller than n. Can we get a better bound in that case? Winnow will give us a bound of O(r log n) mistakes. Algorithm:(a simple version) \begin{enumerate} \item Initialize the weights $w_1, \ldots, w_n$ of the variables to 1. \item Given an example $x = \{x_1, \ldots, x_n\}$, output 1 if $$w_1x_1 + w_2x_2 + \ldots + w_nx_n \geq n$$ and output 0 otherwise. \item If the algorithm makes a mistake: \begin{enumerate} \item If the algorithm predicts negative on a positive example, then for each $x_i$ equal to 1, double the value of $w_i$. \item If the algorithm predicts positive on a negative example, then for each $x_i$ equal to 1, cut the value of $w_i$ in half. \end{enumerate} \item Goto 2. \end{enumerate} \end{description} \begin{theorem} The Winnow Algorithm learns the class of disjunctions in the Mistake Bound model, making at most $2 + 3r(1 + \lg n)$ mistakes when the target concept is a disjunction of $r$ variables. \end{theorem} {\em Proof. } Let us first bound the number of mistakes that will be made on positive examples. Any mistake made on a positive example must double at least one of the weights in the target function (the {\em relevant} weights), and a mistake made on a negative example will {\em not} halve any of these weights, by definition of a disjunction. Furthermore, each of these weights can be doubled at most $1 + \lg n$ times, since only weights that are less than $n$ can ever be doubled. Therefore, Winnow makes at most $r(1 + \lg n)$ mistakes on positive examples. Now we bound the number of mistakes made on negative examples. The total weight summed over all the variables is initially $n$. Each mistake made on a positive example increases the total weight by at most $n$ (since before doubling, we must have had $w_1x_1 + \ldots w_nx_n < n$). On the other hand, each mistake made on a negative example decreases the total weight by at least $n/2$ (since before halving, we must have had $w_1x_1 + \ldots + w_nx_n \geq n$). The total weight never drops below zero. Therefore, the number of mistakes made on negative examples is at most twice the number of mistakes made on positive examples, plus 2. That is, $2 + 2r(1 + \lg n)$. Adding this to the bound on the number of mistakes on positive examples yields the theorem. \qed Can also look at case where data is not completely consistent.