15-859(B) Machine Learning Theory 01/24/02 Online learning contd * The Perceptron algorithm * The Winnow algorithm * Recap of results so far ======================================================================= The Perceptron Algorithm ------------------------ One of the oldest algorithms used in machine learning (from early 60s) is an online algorithm for learning a linear threshold function called the Perceptron Algorithm. Perceptron Algorithm: For simplicity, use threshold of 0, so looking at functions like: w_1x_1 + w_2x_2 + ... + w_nx_n > 0. (can simulate a nonzero threshold with "dummy" input x_0 that is always 1) Also, let's scale all examples x to have (Euclidean) length 1, since this doesn't affect which side of the plane they are on. THEOREM: If the data is consistent with a linear threshold function W.x > 0, where let's scale W so that |W|=1, then the number of mistakes is at most (1/delta)^2, where delta = min |W.x| x (i.e., every example has distance at least delta from the plane W.x=0) Algorithm: Start with all-zeroes weight vector w. Predict positive iff w.x > 0. Mistake on positive: w <- w + x Mistake on negative: w <- w - x So, this seems reasonable. E.g., (w+x).x = w.x + 1, and similarly (w-x).x = w.x - 1, so we are moving closer (by 1) to the value we wanted. PROOF OF THEOREM: We're going to look at the magic quantities w.W and |w|. Claim 1: every time we make a mistake, w.W goes up by at least delta. Proof: if x was a positive example, then we get (w+x).W = w.W + x.W >= w.W + delta (by defn of delta). Similarly, if x was a negative example, we get (w-x).W = w.W - x.W >= w.W + delta. So, after M mistakes, w.W >= delta*M. Claim 2: every time we make a mistake, |w|^2 goes up by at most 1. Proof: if x was a positive example, we get (w+x)^2 = w^2 + 2(w.x) + x^2. This is less than w^2 + 1 because w.x is negative (remember, we made a mistake on x). Same thing (flipping signs) if x was negative but we predicted positive. So, after M mistakes, |w|<=sqrt(M). Now, we just need to use the fact that w.W <= |w|, since W is a unit vector. (Or to put it another way, w.W/|w| is the cosine of the angle between w and W). So, if M is the number of mistakes we made, delta*M <= \sqrt{M}, so M <= 1/delta^2. QED Discussion: in the worst case, delta can be exponentially small in n. On the other hand, if we're lucky, delta might even be large compared to 1/n. This is called the "large margin" case. Winnow Algorithm ---------------- Now turn to an alg that is something like a combination of the WM alg and the Perceptron alg. Look at in this context: Say we're trying to learn an OR function. We saw an algorithm: "list all features and cross off bad ones on negative examples" that can make n mistakes. But, what if most features are irrelevant. What if the target is an OR of r relevant features where r is a lot smaller than n. Can we get a better bound in that case? Winnow will give us a bound of O(r log n) mistakes. Algorithm:(a simple version) 1. Initialize the weights w_1, ..., w_n of the variables to 1. 2. Given an example x = (x_1, ..., x_n), output + if w_1x_1 + w_2x_2 + ... + w_nx_n >= n, else output -. 3. If the algorithm makes a mistake: (a) If the algorithm predicts negative on a positive example, then for each x_i equal to 1, double the value of w_i. (b) If the algorithm predicts positive on a negative example, then for each x_i equal to 1, cut the value of w_i in half. 4. repeat (goto 2) THEOREM: The Winnow Algorithm learns the class of disjunctions in the Mistake Bound model, making at most 2 + 3r(1 + lg n) mistakes when the target concept is an OR of r variables. PROOF: Let us first bound the number of mistakes that will be made on positive examples. Any mistake made on a positive example must double at least one of the weights in the target function (the {\em relevant} weights), and a mistake made on a negative example will {\em not} halve any of these weights, by definition of a disjunction. Furthermore, each of these weights can be doubled at most $1 + \lg n$ times, since only weights that are less than $n$ can ever be doubled. Therefore, Winnow makes at most $r(1 + \lg n)$ mistakes on positive examples. Now we bound the number of mistakes made on negative examples. The total weight summed over all the variables is initially $n$. Each mistake made on a positive example increases the total weight by at most $n$ (since before doubling, we must have had $w_1x_1 + \ldots w_nx_n < n$). On the other hand, each mistake made on a negative example decreases the total weight by at least $n/2$ (since before halving, we must have had $w_1x_1 + \ldots + w_nx_n \geq n$). The total weight never drops below zero. Therefore, the number of mistakes made on negative examples is at most twice the number of mistakes made on positive examples, plus 2. That is, $2 + 2r(1 + \lg n)$. Adding this to the bound on the number of mistakes on positive examples yields the theorem. \qed Can also look at case where data is not completely consistent. Here is a list of the online learning results we have so far. [n = # variables, m = # mistakes of best function in class. MB model assumes m=0]. Efficient algorithms: - Conjunctions/disjunctions: Mistake-bound n+1. (list-and-cross-off method) Mistake-bound O(r log n) (Winnow) General m: Winnow makes O(rm + rlog n) - single-variable functions (like f(x)=x_7): Mistake-bound lg(n). (halving algorithm) General m: (1+epsilon/2)*m + (1/epsilon)ln(n). (weighted-maj) - Decision lists: Mistake-bound O(nL) for list of size L. (from hwk) - LTFs: Perceptron is 1/sep^2. 1/sep can be exp'ly large in n, or can be small even when n is large. Information-theoretic (exp'l-time algorithms): - General concept class C: same as single-variable case, replacing "n" with "|C|".