15-859(B) Machine Learning Theory 04/17/07
* Margins bounds and luckiness functions.
=====================================================================
Learning linear separators
==========================
Given data in R^n, labeled +,-. Want to find a linear separator. Can
do with a number of algorithms including linear programming. Notice
that this includes decision lists, so pretty much everything we've
seen can be transformed into a linear separator problem, with suitable
definition of features. (e.g., for k-decision lists)
Some things we've seen:
- VC-dim of linear separators in n-dim space is n+1. So this says
that O(1/epsilon[n*log(1/epsilon)+log(1/delta)] examples are
sufficient for learning.
- But, we've seen that if there's a large margin, can get away with less.
Let's assume target is w^* . x > 0, and has zero error. |w^*| = 1.
Given a sample S, define the margin to be:
gamma = min_{x in S} |w^* . x|/|x|
Perceptron algorithm makes at most 1/gamma^2 mistakes. This means,
using good "online->batch" conversions, we just need a training set of
size:
O(1/epsilon[M + log(1/delta)])
M = mistake bound = 1/gamma^2.
[explain how this conversion works: in first batch of data, whp at
least one of the h's is good (if flip coin of bias epsilon for
2M/epsilon times, it's very likely you'll get > M heads), and then use
2nd part of size (1/epsilon)*log(M/delta) to test the M hypotheses.]
- One view: perceptron alg is not so good, since gamma can be
exponentially small in n. (like in case of decision lists)
Alternative view: on the other hand, gamma can often be fairly large "the
large margin case", in which case this is a better bound: depends
only on the margin, not on the dimension.
More about margins
==================
- We've seen one reason why having a large margin is a good thing:
it allows the perceptron algorithm to work from a small sample. It
turns out another thing we can say is that whp, *any* separator with
a large margin over the data will have low error. This is what is
done in Support Vector Machines. (Can also approximate it using
Perceptron by doing an update on examples that it just gets "barely
correct".)
Sample complexity analysis
==========================
The sample complexity analysis is done in two steps.
First thing to show: what is the maximum number of points that can be
split in all possible ways by a separator of margin at least gamma?
a.k.a., "fat-shattering dimension". Ans: O(1/gamma^2).
Can anyone see a simple proof?
Proof: simple proof is just to consider perceptron algorithm. Suppose
gamma-fat-shattering dimension is d. Then can force perceptron alg to
make d mistakes, and yet still have a separator w^* of margin gamma.
But we know the number of mistakes is at most 1/gamma^2. So, that's it.
Second part: now want to apply this to get a sample-complexity bound.
Seems like analysis we used for VC-dimension should just go right
through, but it's actually not so clean. Let's do the analysis and
will just point out the one part that's messy, and give a citation for
the thing we need.
Analysis: Draw 2m points from D. Want to show it is unlikely there
exists a separator that gets first half correct by margin gamma, but
has more than epsilon*m mistakes on the 2nd half. This then implies the
conclusion we want, by same reasoning as when we argued the VC bounds.
As in VC proof, will show stronger statement: for *any* set S of size
2m, whp this is true over randomization of split into two pieces S1,S2
of size m. In VC argument, we said: fix some h that makes at least
epsilon*m mistakes. Showed that Prob(all mistakes are on 2nd half)
is at most 2^{epsilon*m). Then applied union bound over all
labelings of data using h in C. For us, it's tempting to say "let's
count the number of separators of S with margin gamma over all of S",
but this might be undercounting since what about separators where h
only has margin gamma on S1? Instead, we'll do the following more
complicated thing. First, let's assume all |x|=1. Now let's group
the separators together. Define h(x) = h.x but truncated at +/-
gamma. Let dist_S(h1,h2) to be max_{x in S}|h1(x) - h2(x)|. We want
a "gamma/2-cover": a set H of separators such that every other
separator is within gamma/2 of some separator in H. Claim is: there
exists an H that is not too large, as a function of fat-shattering
dimension [Alon et al]. Roughly you get |H| ~ (m/gamma^2)^(log(m)/gamma^2).
Now, for these guys, define "correct" as "correct by margin at least
gamma/2" and define "mistake" as "mistake OR correct by less than
gamma/2". Our standard VC argument shows that so long as m is large
compared to (1/epsilon)*log(|H|/delta), whp, none of these will get S1
all correct, and yet make > epsilon*m "mistakes" on S2. This then
implies (by defn of H) that whp *no* separator gets S1 correct by
margin >= gamma and has > epsilon*m real mistakes on S2.
log(|H|) is approximately log^2(m)/gamma^2, so in the end you get a
bound of m = O(1/epsilon [1/gamma^2 log^2(1/(gamma*epsilon)) + log(1/delta)]).
Luckiness functions
===================
Basic idea of margins was in essense to view some separators as
"simpler" than others, using margin as the notion of "simple". What
makes this different from our Occam bounds, is that the notion of
"simple" depends on the data. Basically, we have a data-dependent
ordering of functions such that if we're lucky and the the target has
low complexity in this ordering, then we don't need much training
data. More generally, things like this are called "luckiness
functions". If a function is a "legal notion of luckiness"
(basically, the ordering depends only on the data points and not their
labels, and not too many splits of data with small complexity) then
you can apply sample complexity bounds. We saw this when we discussed
Semi-Supervised Learning.
PAC-MDL bounds
==============
A nice way to view this is in terms of a game between two players A
and B. A and B both have same set of unlabeled data. A is then given
labels for a random subset of the data, and has to transmit a string
that will uncompress to a labeling of the entire data set. Claim is:
if can do this with not too many bits, then can be confident in labels
on the test set. Point is that communication language is allowed to
refer to the unlabeled data. E.g., for VC-bounds: since only C[2m]
ways of splitting whole bunch, A can just give index of the split she
is thinking of. Can also interpret things like "PAC-Bayes" bounds in
this way. One of your hwk problems is like this.