15-859(A) Machine Learning Theory 01/30/06
* The Perceptron algorithm
* Kernels and Margins
=======================================================================
Last time we looked at the Winnow algorithm, which has a very nice
mistake-bound for learning an OR-function, and we saw it also could be
used to learn a linear separator. Today will look at a more classic
algorithm, with a somewhat different guarantee.
The Perceptron Algorithm
------------------------
One of the oldest algorithms used in machine learning (from early 60s)
is an online algorithm for learning a linear threshold function called
the Perceptron Algorithm.
Perceptron Algorithm: For simplicity, use threshold of 0, so looking
at functions like: w_1 x_1 + w_2 x_2 + ... + w_n x_n > 0.
(can simulate a nonzero threshold with "dummy" input x_0 that is always 1)
THEOREM: If the data is consistent with a linear threshold function
w^*.x > 0, where w^* is a unit-length vector, then the number of
mistakes is at most (1/gamma)^2, where
gamma = min |w^*.x| / ||x||
x
(i.e., every example has distance at least gamma from the plane w^*.x=0)
The parameter "gamma" is often called the "normalized margin" of w^*.
Another way to view this quantity is it is the cosine of the angle
between x and w^*.
Perceptron Algorithm:
1. Start with all-zeroes weight vector w.
Also, let's scale all examples x to have (Euclidean) length 1, since
this doesn't affect which side of the plane they are on.
2. Given example x, predict positive iff w.x > 0.
3. On a mistake, update as follows:
Mistake on positive: w <- w + x
Mistake on negative: w <- w - x
So, this seems reasonable. E.g., (w+x).x = w.x + 1, and similarly
(w-x).x = w.x - 1, so we are moving closer (by 1) to the value we wanted.
PROOF OF THEOREM: We're going to look at the magic quantities w.w^* and |w|.
Claim 1: every time we make a mistake, w.w^* goes up by at least
gamma. Proof: if x was a positive example, then we get (w+x).w^* =
w.w^* + x.w^* >= w.w^* + gamma (by defn of gamma). Similarly, if x was a
negative example, we get (w-x).w^* = w.w^* - x.w^* >= w.w^* + gamma. So,
after M mistakes, w.w^* >= gamma*M.
Claim 2: every time we make a mistake, |w|^2 goes up by at most
1. Proof: if x was a positive example, we get (w+x)^2 = w^2 + 2(w.x)
+ x^2. This is less than w^2 + 1 because w.x is negative (remember,
we made a mistake on x). Same thing (flipping signs) if x was
negative but we predicted positive. So, after M mistakes, |w|<=sqrt(M).
Now, we just need to use the fact that w.w^* <= |w|, since w^* is a unit
vector. (Or to put it another way, w.w^*/|w| is the cosine of the angle
between w and w^*). So, if M is the number of mistakes we made, gamma*M
<= \sqrt{M}, so M <= 1/gamma^2. QED
Discussion
==========
In the worst case, gamma can be exponentially small in n. On the
other hand, if we're lucky and the data is well-separated, gamma might
even be large compared to 1/n. (In fact, the latter is the more
modern spin on things: namely, that in many natural cases, we would
hope that there exists a large-margin separator.) In fact, one nice
thing here is that the mistake-bound depends on just a purely
geometric quantity: the amount of "wiggle-room" available for a
solution and doesn't depend in any direct way on the number of
features in the space. (Of course, as you add more features, the
examples get longer and so when you scale them to be unit-length, the
margins get smaller...)
So, if data is separable by a large margin, then Perceptron is a good
algorithm to use.
What if there is no perfect separator?
=====================================
What if only {\em most} of the data is separable by a large margin, or
what if w^* is not perfect? We can see that the thing we need to
look at is Claim 1. Claim 1 said that we make ``gamma amount of
progress'' on every mistake. Now it's possible there will be mistakes
where we make very little progress, or even negative progress. One
thing we *can* do is bound the total number of mistakes we make in terms
of the total distance we would have to move the points to make them
actually separable by margin gamma. Let's call that TD_gamma.
Then, we get that after M mistakes, w.w^* >= gamma*M - TD_gamma. So,
combining with Claim 2, we get that sqrt(M) >= gamma*M - TD_gamma.
We could solve the quadratic, but this implies, for instance, that M
<= 1/gamma^2 + (2/gamma)TD_gamma.
So, this is not too bad: we can't necessarily say that we're making only a
small multiple of the number of mistakes that w^* is (in fact, the
problem of finding an approximately-optimal separator is NP-hard), but
we can say we're doing well in terms of the ``total distance'' parameter.
Kernel functions
================
What if your data doesn't have a good linear separator? Here's a neat
idea, called the "kernel trick".
One thing we might like to do is map our data to a higher dimensional
space, e.g., look at all products of pairs of features, in the hope
that data will be linearly separable there. If we're lucky, will be
separable by a large margin so we don't have to pay a lot in terms of
mistakes. But this is going to a pain computationally. However, one
thing we notice is that most learning algorithms only access data
through performing dot-products (will get back to how to interpret
algs like perceptron in this way in a minute). So, maybe we can perform
our mapping in such a way that we have an efficient way of computing
dot-products. This leads to idea of a kernel.
A Kernel is a function K(x,y) such that for some mapping phi,
K(x,y) = phi(x) . phi(y).
Some examples:
K(x,y) = (1 + )^d.
K(x,y) = (1 + x_1*y_1)(1 + x_2*y_2)...(1 + x_n*y_n)
[corresponds to mapping x,y to list of all products of subsets]
Also string kernels [count how many substrings of length p
two strings have in common]
More generally, nice for the case where examples aren't so easy to map
directly into R^n, but we have a reasonable notion of similarity we
can encode in a kernel K.
Neat fact: most learning algorithms can be run using kernels.
E.g., perceptron algorithm:
Initialize w = first positive example.
If w.x < 0 on positive x, let w = w+x
if w.x > 0 on negative x, let w = w-x.
How to kernelize? notice that w is just a weighted sum of examples.
E.g., w = x1 - x2 + x3, where x1,x2,x3 are examples we've seen so far.
So to compute phi(w).phi(x), just do: K(x1,x)-K(x2,x)+K(x3,x).
The examples that the hypothesis is written in terms of are called
"support vectors". If you find the maximum margin separator for a
given dataset, that is also something that can be written in terms of
support vectors (not hard to see). That's the reason for the name
"support vector machines" for the algorithm that takes a set of data
and finds the maximum-margin separator.