15-859(B) Machine Learning Theory 01/25/12
Online learning contd
* Online learning and game theory (from slides from last time)
* The Winnow algorithm for disjunctions
* Winnow for k-of-r functions and general LTFs in terms of L_1 margin
=======================================================================
First, recap last time and talk about some connections to game theory
(see slides)
WINNOW ALGORITHM
================
If you think about the problem of learning an OR-function, we
saw an algorithm: "list all features and cross off bad ones on
negative examples" that makes at most n mistakes. But, what if most
features are irrelevant? E.g., if representing a document as vector
indicating which words appear in it and which don't, then n is pretty
large! What if the target is an OR of r relevant features where r is
a lot smaller than n. Can we get a better bound in that case?
What could we do if computation time were no object? How many bits to
describe an OR of r variables, where r << n? Ans: O(r log n). So, in
principle, we'd like to obtain a bound like this.
Winnow will give us a bound of O(r log n) mistakes efficiently.
So, this means you only have a small penalty for "throwing lots of
features at the problem". In general, will say that an algorithm with
only polylog dependence on n is "attribute-efficient".
Winnow Algorithm: (basic version)
1. Initialize the weights w_1, ..., w_n of the variables to 1.
2. Given an example x = (x_1, ..., x_n), output + if
w_1x_1 + w_2x_2 + ... + w_nx_n >= n,
else output -.
3. If the algorithm makes a mistake:
(a) If the algorithm predicts negative on a positive example, then
for each x_i equal to 1, double the value of w_i.
(b) If the algorithm predicts positive on a negative example, then
for each x_i equal to 1, set w_i = 0.
4. repeat (goto 2)
THEOREM: The Winnow Algorithm learns the class of disjunctions in the
Mistake Bound model, making at most 1 + 2r(1 + lg n) mistakes when the
target concept is an OR of r variables.
PROOF: Let us first bound the number of mistakes that will be made on
positive examples. Any mistake made on a positive example must double
at least one of the weights in the target function (the *relevant*
weights), and a mistake made on a negative example will *not* zero
any of these weights, by definition of a disjunction. Furthermore,
each of these weights can be doubled at most 1 + lg(n) times, since
only weights that are less than n can ever be doubled. Therefore,
Winnow makes at most r(1 + lg(n)) mistakes on positive examples.
Now we bound the number of mistakes made on negative examples. The
total weight summed over all the variables is initially n. Each
mistake made on a positive example increases the total weight by at
most n (since before doubling, we must have had w_1x_1 + ... + w_nx_n
< n). On the other hand, each mistake made on a negative example
decreases the total weight by at least n (since before zeroing, we
must have had w_1x_1 + ... + w_nx_n >= n). The total weight never
drops below zero. Therefore, the number of mistakes made on negative
examples is at most the number of mistakes made on positive
examples, plus 1. That is, 1 + r(1 + lg(n)). Adding this to the
bound on the number of mistakes on positive examples yields the
theorem. QED.
Open question: is there an attribute-efficient algorithm for Decision
Lists? Want mistake bound poly(L, log n), where L is length of list.
--------------------------------------------------------------------------
Winnow for Majority Vote functions
===================================
How about learning a majority-vote (k of r) function: n variables
total, r are relevant, and an example is positive iff at least k are
on. (Classify as SPAM if at least k keywords are present).
Say we know k. Let epsilon = 1/(2k). Our algorithm will multiply by
1+epsilon for mistakes on positives, and divide by 1+epsilon for
mistakes on negatives.
Think of multiplying by 1+epsilon as putting a poker chip on the weight,
and dividing as removing the chip (can have neg chips).
Max number of chips on relevant weights <= r*log_{1+epsilon}(n) ~ (r/eps)ln(n)
M.on pos.:put >= k chips on rel weights. Total weight up by at most epsilon*n
M.on neg.:take <=k-1 off rel weights. total wt down by at least n(eps/(1+eps))
Use to create two inequalities in two unknowns:
k*(M_pos) - (k-1)*(M_neg) <= (r/eps)ln(n)
total weight >= 0. Or,
n + (M_pos)*n*epsilon >= (M_neg)*n*(epsilon/(1+epsilon))
Solve: Rewrite 2nd inequality as: M_neg <= (1+eps)M_pos + (1+eps)/eps.
Plug into first to get:
k*M_pos - (k-1)(1+eps)M_pos <= (r/eps)*ln(n) + (k-1)(1+eps)/eps
Now we have set eps = 1/2k so that (k-1)(1+eps) < k - 1/2, so we get:
(1/2)M_pos <= (r/eps)*ln(n) + (k-1)(1+eps)/eps
So M_pos = O(rk*log(n)), and same with M_neg.
What if don't know k??
If know r, then can guess and double. If dont, can guess and double r
too, but then cost is O(r^2*log(n)). So we have:
THEOREM: Winnow learns the class of "k of r functions" with at most
O(r^2 log n) mistakes.
----------------------------------------------------------------
Using Winnow for general LTFs
=============================
What if the target is a general linear threshold function of the form
w*_1 x_1 + ... + w*_n x_n >= w*_0?
Let's scale so that all weights are integers, and assume all are
non-negative (can do that by introducing new variables y_i = 1-x_i).
Let W = w*_1 + ... + w*_n.
If we know W (or say W is just an upper bound on true value) then we
can solve the problem like this: just repeat each variable W times.
We then have a "k=w*_0 out of r=W" problem, so we get a mistake-bound
of O(W^2 log(nW)).
Now, here's a cute thing: the above algorithm --- repeating each variable
W times --- does *exactly* the same thing if we had run the algorithm
without repeating each variable! (it's equivalent to initializing each
weight to W instead of 1 and using a threshold of nW instead of n,
which is the same as initializing weights to 1 and a threshold of n)
So, we really didn't have to do anything!
So, we get a good bound as a function of the L_1-size of the
solution. Can in fact remove W from log as well.
More generally can directly argue using the same basic analysis that
if the target vector w* is a general LTF such that w* \cdot x > c on
positives and w* \cdot x < c - alpha on negatives, then the mistake
bound is O((L_1(w*)/alpha)^2 log(n)). I.e., don't have to assume
integer weights. Can extend also to the case where features x_i aren't
just {0,1}. In that case, need to multiply L_1(w*) by L_\infty(X),
the maximum magnitude of any feature in the instance space.
[Will do this on hwk 2]
The quantity gamma = alpha/[L_1(w^*)L_infty(X)] is called the "L_1
margin" of the separator, and our bound is O((1/gamma^2)*log(n))
The perceptron algorithm, which we'll look at next time, has a mistake
bound of O(1/gamma^2) where gamma = alpha/[L_2(w*)L_2(X)]. If "n" is
large but most are irrelevant (i.e. target is sparse but examples are
dense), then Winnow is better because adding irrelevant features
increases L_2(X) but not L_infty(X). On the other hand, if the target
is dense and examples are sparse, then perceptron is better.