15-859(B) Machine Learning Theory 02/22/10
* SVMs
* More on Kernel functions
* Better MB->Batch conversion
* L_2 Margin bounds
=====================================================================
Learning linear separators
==========================
Learning linear separators is at the heart of a lot of machine
learning. Perceptron, Winnow, linear programming, SVMs (which we'll
talk about more today), and even Boosting can be thought of as
algorithms for learning them. Most of the function classes we've seen
(conjunctions, disjunctions, decision lists) are special cases of
linear separators, and with suitable definition of features they can
represent even more (e.g., k-decision lists, or with implicit spaces
using kernels, which we'll talk more about today too).
Start with some basic facts:
1. VC-dim of linear separators in n-dim space is n+1. So this says
that O(1/epsilon[n*log(1/epsilon)+log(1/delta)] examples are
sufficient for learning.
2. But, we've seen that if there's a large margin, can get away with less.
Let's assume target is w^* . x > 0, and has zero error. |w^*| = 1,
and all |x| <=1. Given a sample S, define the L_2 margin to be:
gamma = min_{x in S} |w^* . x|
For example, Perceptron algorithm makes at most 1/gamma^2 mistakes.
This means, using good "online->batch" conversions, we just need a
training set of size:
O((1/epsilon)[M + log(1/delta)])
M = mistake bound = 1/gamma^2.
Note that this could be better or worse than the dimension bound.
3. In fact, can show that whp *any* large-margin separator you can
find will generalize well from roughly this much data. So, this
motivates why SVMs find large margin separators. Will do this
argument today too.
4. Given data in R^n, labeled +,-, can find a consistent linear
separator if one exists using linear programming. Just set up
inequalities w.x >= 1 for positives and w.x <= -1 for negatives, and
solve for w. If there is no consistent separator, minimizing number
of errors is NP-hard (even getting 49% error if target is 99%-close to
a linear separator). Instead, one thing you *can* do efficiently is
minimize hinge-loss, as in SVMs (which we'll talk about below).
What is hinge-loss? Hinge-loss is just the amount by which the above
inequalities are violated. Looks like a hinge and is an upper-bound
on error = 0/1-loss. (draw picture). Can solve by putting in "slack
variables" into the LP, like this:
minimize: sum_{x in S} epsilon_x
subject to: l(x)(w.x) >= 1 - epsilon_x, for all x in S.
and epsilon_x >= 0 for all x in S.
Plan for today:
- Describe/discuss SVMs.
- More on properties of kernel functions
- sample complexity 1: online=>batch conversion
- sample complexity 2: uniform convergence for large margin
separators.
Support-Vector Machines
=======================
Support vector machines do convex optimization to find the maximum
margin separator, and more generally to optimize a given tradeoff
between margin and hinge-loss. Let's first do the easier case, where
we assume data is linearly separable and we want the separator of
maximum margin. Then we could write that as: minimize |w|^2 subject
to the constraint that l(x)*(w.x) >= 1 for all examples x in our
training set (l(x) is the label of x). This is a convex optimization
problem, so we can do it. Equivalently we could fix |w|^2 <= 1 and
maximize gamma s.t. l(x)(w.x) >= gamma, but people like to do it this
way, for reasons that will make more sense in a minute. Note that if
we set the RHS to 1, then 1/|w| is the margin and so |w|^2 = 1/gamma^2.
More generally, we want a tradeoff betwen margin and hinge-loss, so
what SVMs really do is (C is a given constant):
minimize: |w|^2 + C sum_{x in S} epsilon_x
subject to: l(x)(w.x) >= 1 - epsilon_x, for all x in S.
and epsilon_x >= 0 for all x in S.
Here is the motivation. What we *really* want is to minimize the true
error err(h), but what we observe is our empirical error err_S(h).
So, let's split err(h) into two parts: (1) err(h) - err_S(h), which
is the amount we're overfitting, and (2) err_S(h). One bound on part
(1) is approximately (1/gamma^2)/|S|, if err_S(h) is small (since our
sample complexity bound is approximately |S| = (1/gamma^2)(1/epsilon)).
So, this is 1/|S| times the first part of the objective function. The
second part of the optimization (for C=1) is our total hinge-loss,
summed over all the examples. Dividing by |S| we get our average
hinge-loss, which is an upper bound on err_S(h). So, an upper bound
on (2) is 1/|S| times the second part of the objective function.
So, the two parts of the objective function are upper bounds on the
two quantities we care about: overfitting and empirical error.
Properties of kernel functions
==============================
One nice thing about Perceptron and SVM is they can be kernelized.
Remember, a kernel is a function K(x,y) such that for some implicit
phi, K(x,y) = .
Here are some interesting kernels:
- K(x,y) = (1 + x_1 y_1)(1 + x_2 y_2)...(1 + x_n y_n)
[corresponds to mapping x,y to a list of all products of subsets]
- given two strings, count up how many substrings of length t the two
strings have in common, or can give more weight to longer substrings.
Intuitively, you want to choose a kernel that you think would be a
reasonable way to measure similarity for the problem at hand, though
the theoretical bounds we have been giving are in terms of the margin.
Next lecture we will look at ways to connect these two views. (Note:
ideal kernel: K(x,y)=1 if x,y have the same label, K(x,y)=-1 if they
have different label).
Often the easiest way to prove some function is a legal kernel is to
build it out of other legal kernels.
Claim 0: if K is a legal kernel then for constant c>0, cK is a legal
kernel. Why?
Claim 1: if K_1, K_2 are legal then K_1 + K_2 is legal. Why?
Claim 2: if K_1, K_2 are legal then K_1*K_2 is legal.
This one is trickier: define phi(x) as outer-product of phi_1(x)
and phi_2(x). [a matrix, but viewing as a vector].
Can verify by doing calculation of K(x,y).
So, this proves that things like (1 + )^d are legal kernels.
Can also show that e^{-||x-y||^2} is a legal kernel too.
Online=>batch
=============
Let's do this improved online->batch conversion formally, since we
didn't quite do it earlier in class.
Theorem: If have conservative alg with mistake-bound M, can use to get
PAC sample-complexity O((1/epsilon)[M + log(1/delta)])
Proof: To do this, we will split data into a ``training set'' S_1 of size
max[(4M/epsilon), (16/epsilon)*ln(1/delta)]
and a ``test set'' S_2 of size
(32/epsilon)*ln(M/delta)
We will run the algorithm on S_1 and test all hypotheses produced on S_2.
Claim 1: w.h.p., at least one hyp produced on S_1 has error < epsilon/2.
Proof (tricky!):
- If all are >= epsilon/2 then the expected number of mistakes >= 2M.
- By Chernoff, Pr[# mistakes <= M] <= e^{(-expect)/8} <= delta.
- View as game: after M mistakes, alg forced to reveal target. If
alg keeps giving bad hyps, then whp will be forced to do it.
Claim 2: w.h.p., best one on S_2 has error < epsilon.
Proof: Suffices to show that good one is likely to look better
than 3*epsilon/4, and all with true error > epsilon are likely to look
worse than 3*epsilon/4. Just apply Chernoff again to the set of M
hypotheses as in your homework.
More about margins
==================
- We've seen that having a large margin is a good thing because then
the perceptron algorithm will behave well. It turns out another
thing we can say is that whp, *any* separator with a large margin
over the data will have low error. This then motivates finding the
maximum margin separator, which is what is done in Support Vector
Machines. (Can also approximate it using Perceptron by doing an
update on examples that it just gets "barely correct".)
Sample complexity analysis
==========================
The sample complexity analysis is done in two steps.
First thing to show: what is the maximum number of points that can be
split in all possible ways by a separator of margin at least gamma?
a.k.a., "fat-shattering dimension". Ans: O(1/gamma^2).
Can anyone see a simple proof?
Proof: simple proof is just to consider perceptron algorithm. Suppose
gamma-fat-shattering dimension is d. Then can force perceptron alg to
make d mistakes, and yet still have a separator w^* of margin gamma.
But we know the number of mistakes is at most 1/gamma^2. So, that's it.
Second part: now want to apply this to get a sample-complexity bound.
Sauer's lemma still applies so seems like analysis we used for
VC-dimension should just go right through, but it's actually quite not
so easy. Plus there's one technical fact we'll need. Let's do the
analysis and will just give a citation for the technical fact we need.
Analysis: Draw 2m points from D. Want to show it is unlikely there
exists a separator that gets first half correct by margin gamma, but
has more than epsilon*m mistakes on the 2nd half. This then implies the
conclusion we want, by same reasoning as when we argued the VC bounds.
As in VC proof, will show stronger statement: for *any* sets S1, S2 of size
m, each, whp this is true over randomization of pairs {x_i, x_i'} into
T1, T2. (Let S = S1 \union S2). In VC argument, we said: fix some h
that makes at least epsilon*m mistakes. Said that Prob(all mistakes
are in T2) is at most 2^{-epsilon*m}. Then applied union bound over all
labelings of data using h in C. For us, it's tempting to say: let's
count the number of separators of S with margin gamma over
all of S. But this might be undercounting since what about
separators where h only has margin gamma on T1? Instead, we'll do the
following more complicated thing. First, let's assume all |x|=1. Now
let's group the separators together. Define h(x) = h.x but truncated
at +/- gamma. Let dist_S(h1,h2) to be max_{x in S}|h1(x) - h2(x)|. We want
a "gamma/2-cover": a set H of separators such that every other
separator is within gamma/2 of some separator in H. Claim is: there
exists an H that is not too large, as a function of fat-shattering
dimension [Alon et al]. Roughly you get |H| ~ (m/gamma^2)^(log(m)/gamma^2).
Now, for these guys, define "correct" as "correct by margin at least
gamma/2" and define "mistake" as "mistake OR correct by less than
gamma/2". Our standard VC argument shows that so long as m is large
compared to (1/epsilon)*log(|H|/delta), whp, none of these will get T1
all correct, and yet make > epsilon*m "mistakes" on T2. This then
implies (by defn of H) that whp *no* separator gets T1 correct by
margin >= gamma and has > epsilon*m real mistakes on T2.
log(|H|) is approximately log^2(m)/gamma^2, so in the end you get a
bound of m = O(1/epsilon [1/gamma^2 log^2(1/(gamma*epsilon)) + log(1/delta)]).
Luckiness functions
===================
Basic idea of margins was in essense to view some separators as
"simpler" than others, using margin as the notion of "simple". What
makes this different from our Occam bounds, is that the notion of
"simple" depends on the data. Basically, we have a data-dependent
ordering of functions such that if we're lucky and the the target has
low complexity in this ordering, then we don't need much training
data. More generally, things like this are called "luckiness
functions". If a function is a "legal notion of luckiness"
(basically, the ordering depends only on the data points and not their
labels, and not too many splits of data with small complexity) then
you can apply sample complexity bounds.