15-859(A) Machine Learning Theory 01/30/06 * The Perceptron algorithm * Kernels and Margins ======================================================================= Last time we looked at the Winnow algorithm, which has a very nice mistake-bound for learning an OR-function, and we saw it also could be used to learn a linear separator. Today will look at a more classic algorithm, with a somewhat different guarantee. The Perceptron Algorithm ------------------------ One of the oldest algorithms used in machine learning (from early 60s) is an online algorithm for learning a linear threshold function called the Perceptron Algorithm. Perceptron Algorithm: For simplicity, use threshold of 0, so looking at functions like: w_1 x_1 + w_2 x_2 + ... + w_n x_n > 0. (can simulate a nonzero threshold with "dummy" input x_0 that is always 1) THEOREM: If the data is consistent with a linear threshold function w^*.x > 0, where w^* is a unit-length vector, then the number of mistakes is at most (1/gamma)^2, where gamma = min |w^*.x| / ||x|| x (i.e., every example has distance at least gamma from the plane w^*.x=0) The parameter "gamma" is often called the "normalized margin" of w^*. Another way to view this quantity is it is the cosine of the angle between x and w^*. Perceptron Algorithm: 1. Start with all-zeroes weight vector w. Also, let's scale all examples x to have (Euclidean) length 1, since this doesn't affect which side of the plane they are on. 2. Given example x, predict positive iff w.x > 0. 3. On a mistake, update as follows: Mistake on positive: w <- w + x Mistake on negative: w <- w - x So, this seems reasonable. E.g., (w+x).x = w.x + 1, and similarly (w-x).x = w.x - 1, so we are moving closer (by 1) to the value we wanted. PROOF OF THEOREM: We're going to look at the magic quantities w.w^* and |w|. Claim 1: every time we make a mistake, w.w^* goes up by at least gamma. Proof: if x was a positive example, then we get (w+x).w^* = w.w^* + x.w^* >= w.w^* + gamma (by defn of gamma). Similarly, if x was a negative example, we get (w-x).w^* = w.w^* - x.w^* >= w.w^* + gamma. So, after M mistakes, w.w^* >= gamma*M. Claim 2: every time we make a mistake, |w|^2 goes up by at most 1. Proof: if x was a positive example, we get (w+x)^2 = w^2 + 2(w.x) + x^2. This is less than w^2 + 1 because w.x is negative (remember, we made a mistake on x). Same thing (flipping signs) if x was negative but we predicted positive. So, after M mistakes, |w|<=sqrt(M). Now, we just need to use the fact that w.w^* <= |w|, since w^* is a unit vector. (Or to put it another way, w.w^*/|w| is the cosine of the angle between w and w^*). So, if M is the number of mistakes we made, gamma*M <= \sqrt{M}, so M <= 1/gamma^2. QED Discussion ========== In the worst case, gamma can be exponentially small in n. On the other hand, if we're lucky and the data is well-separated, gamma might even be large compared to 1/n. (In fact, the latter is the more modern spin on things: namely, that in many natural cases, we would hope that there exists a large-margin separator.) In fact, one nice thing here is that the mistake-bound depends on just a purely geometric quantity: the amount of "wiggle-room" available for a solution and doesn't depend in any direct way on the number of features in the space. (Of course, as you add more features, the examples get longer and so when you scale them to be unit-length, the margins get smaller...) So, if data is separable by a large margin, then Perceptron is a good algorithm to use. What if there is no perfect separator? ===================================== What if only {\em most} of the data is separable by a large margin, or what if w^* is not perfect? We can see that the thing we need to look at is Claim 1. Claim 1 said that we make ``gamma amount of progress'' on every mistake. Now it's possible there will be mistakes where we make very little progress, or even negative progress. One thing we *can* do is bound the total number of mistakes we make in terms of the total distance we would have to move the points to make them actually separable by margin gamma. Let's call that TD_gamma. Then, we get that after M mistakes, w.w^* >= gamma*M - TD_gamma. So, combining with Claim 2, we get that sqrt(M) >= gamma*M - TD_gamma. We could solve the quadratic, but this implies, for instance, that M <= 1/gamma^2 + (2/gamma)TD_gamma. So, this is not too bad: we can't necessarily say that we're making only a small multiple of the number of mistakes that w^* is (in fact, the problem of finding an approximately-optimal separator is NP-hard), but we can say we're doing well in terms of the ``total distance'' parameter. Kernel functions ================ What if your data doesn't have a good linear separator? Here's a neat idea, called the "kernel trick". One thing we might like to do is map our data to a higher dimensional space, e.g., look at all products of pairs of features, in the hope that data will be linearly separable there. If we're lucky, will be separable by a large margin so we don't have to pay a lot in terms of mistakes. But this is going to a pain computationally. However, one thing we notice is that most learning algorithms only access data through performing dot-products (will get back to how to interpret algs like perceptron in this way in a minute). So, maybe we can perform our mapping in such a way that we have an efficient way of computing dot-products. This leads to idea of a kernel. A Kernel is a function K(x,y) such that for some mapping phi, K(x,y) = phi(x) . phi(y). Some examples: K(x,y) = (1 + )^d. K(x,y) = (1 + x_1*y_1)(1 + x_2*y_2)...(1 + x_n*y_n) [corresponds to mapping x,y to list of all products of subsets] Also string kernels [count how many substrings of length p two strings have in common] More generally, nice for the case where examples aren't so easy to map directly into R^n, but we have a reasonable notion of similarity we can encode in a kernel K. Neat fact: most learning algorithms can be run using kernels. E.g., perceptron algorithm: Initialize w = first positive example. If w.x < 0 on positive x, let w = w+x if w.x > 0 on negative x, let w = w-x. How to kernelize? notice that w is just a weighted sum of examples. E.g., w = x1 - x2 + x3, where x1,x2,x3 are examples we've seen so far. So to compute phi(w).phi(x), just do: K(x1,x)-K(x2,x)+K(x3,x). The examples that the hypothesis is written in terms of are called "support vectors". If you find the maximum margin separator for a given dataset, that is also something that can be written in terms of support vectors (not hard to see). That's the reason for the name "support vector machines" for the algorithm that takes a set of data and finds the maximum-margin separator.