15-859(A) Machine Learning Theory            01/25/06

Online learning contd
* The Winnow algorithm
* Infinite-attribute model, string-valued features

=======================================================================
Recap
=====
- Last time we introduced the Mistake-Bound model, a clean model for
  online learning.  E.g., think of a stream of email messages,
  classifying into SPAM or NOT-SPAM, trying to minimize the number of
  mistakes.

- Gave simple algorithm for learning an AND-function or OR-function.
  E.g., for OR-function (document is labeled as SPAM if it has one of
  some list of keywords): start by listing all variables, predict using OR,
  throwing out variables seen on negative examples.  Makes at most n
  mistakes.  Also saw that no algorithm can guarantee < n mistakes.

- Also talked about broader issues.


WINNOW ALGORITHM
================
If you think about the problem of learning an OR-function, we
saw an algorithm: "list all features and cross off bad ones on
negative examples" that makes at most n mistakes.  But, what if most
features are irrelevant?  E.g., if representing a document as vector
indicating which words appear in it and which don't, then n is pretty
large!  What if the target is an OR of r relevant features where r is
a lot smaller than n.  Can we get a better bound in that case?

Winnow will give us a bound of O(r log n) mistakes.

So, this means you only have a small penalty for "throwing lots of
features at the problem".  In general, will say that an algorithm with
only polylog dependence on n is "attribute-efficient".


Winnow Algorithm: (basic version)

1. Initialize the weights w_1, ..., w_n of the variables to 1.

2. Given an example x = (x_1, ..., x_n), output + if

	w_1x_1 + w_2x_2 + ... + w_nx_n >= n,

   else output -.

3. If the algorithm makes a mistake:

  (a) If the algorithm predicts negative on a positive example, then
  for each x_i equal to 1, double the value of w_i.

  (b) If the algorithm predicts positive on a negative example, then
  for each x_i equal to 1, cut the value of w_i in half.

4. repeat (goto 2)


THEOREM: The Winnow Algorithm learns the class of disjunctions in the
Mistake Bound model, making at most 2 + 3r(1 + lg n) mistakes when the
target concept is an OR of r variables.


PROOF: Let us first bound the number of mistakes that will be made on
positive examples.  Any mistake made on a positive example must double
at least one of the weights in the target function (the *relevant*
weights), and a mistake made on a negative example will *not* halve
any of these weights, by definition of a disjunction.  Furthermore,
each of these weights can be doubled at most 1 + lg(n) times, since
only weights that are less than n can ever be doubled.  Therefore,
Winnow makes at most r(1 + lg(n)) mistakes on positive examples.

Now we bound the number of mistakes made on negative examples.  The
total weight summed over all the variables is initially n.  Each
mistake made on a positive example increases the total weight by at
most n (since before doubling, we must have had w_1x_1 + ... + w_nx_n
< n).  On the other hand, each mistake made on a negative example
decreases the total weight by at least n/2 (since before halving, we
must have had w_1x_1 + ... + w_nx_n >= n).  The total weight never
drops below zero.  Therefore, the number of mistakes made on negative
examples is at most twice the number of mistakes made on positive
examples, plus 2.  That is, 2 + 2r(1 + lg(n)).  Adding this to the
bound on the number of mistakes on positive examples yields the
theorem.

Can also look at case where data is not completely consistent: a
positive example satisfying none of relevant vars can cause the total
weight to increase by at most n, resulting in at most 2 additional
mistakes on negatives needed to bring it back down.  In the other
direction, a negative example satisfying t relevant variables can
cause t relevant weights to be halved, which could then require up to
t more mistakes on positives to fix (themselves causing up to 2t
mistakes on negatives).  

================================================

Winnow for Majority Vote functions
===================================

How about learning a majority-vote (k of r) function:  n variables
total, r are relevant, and an example is positive iff at least k are
on.  (Classify as SPAM if at least k keywords are present).

Say we know k.  Let epsilon = 1/(2k).  Our algorithm will multiply by
1+epsilon for mistakes on positives, and divide by 1+epsilon for
mistakes on negatives. 

Think of multiplying by 1+epsilon as putting a poker chip on the weight,
and dividing as removing the chip (can have neg chips).

Max number of chips on relevant weights <= r*log_{1+epsilon}(n).

M.on pos.:put >= k chips on rel weights. Total weight up by at most epsilon*n
M.on neg.:take <=k-1 off rel weights. total wt down by at least n(eps/(1+eps))

Use to create two inequalities in two unknowns:

k*(M on pos) - (k-1)*(M on neg) <= r*log_{1+epsilon}(n)

total weight >= 0.  Or,
n + (M on pos)*n*epsilon >= (M on neg)*n*(epsilon/(1+epsilon))

Solve ... both are O(k*r*log(n)).

What if don't know k??

If know r, then can guess and double.  If dont, can guess and double r
too, but then cost is O(r^2*log(n))


Using Winnow for general LTFs
=============================
What if the target is a general linear threshold function of the form
w_1 x_1 + ... + w_n x_n >= w_0?

Let's scale so that all weights are integers, and assume all are
non-negative (can do that by introducing new variables y_i = 1-x_i).

Let W = w_1 + ... + w_n.

If we know W (or say W is just an upper bound on true value) then we
can solve the problem like this:  just repeat each variable W times.
We then have a "k=w_0 out of r=W" problem, so we get a mistake-bound
of O(W^2 log(nW)).

Now, here's a cute thing: the above algorithm --- repeating each variable
W times --- does *exactly* the same thing if we had run the algorithm
without repeating each variable! (it's equivalent to initializing each
weight to W instead of 1 and using a threshold of nW instead of n.
So, we really didn't have to do anything!

So, we get a good bound as a function of the L_1-size of the solution.

================================================

String-valued features and the IA model
---------------------------------------

The discussion so far has focused on learning over the instance space
{0,1}^n. I.e., examples have n Boolean-valued attributes.
Another common setting is one in which the attributes are
string-valued; that is, X = (\Sigma^*)^n.  For instance, one
attribute might represent an object's color, another its texture, etc.
If the number of choices for each attribute is small, we can just
convert this to the Boolean case, for instance by letting ``x_1 =
red'' be a Boolean variable that is either true or false in any
given example.  However, if the number of choices for an attribute is
large or is unknown apriori, this conversion may blow up the number of
variables.  

Question: how does this change the learning problem?

Sometimes, this is called the Infinite Attribute model.  Here, we have
an infinite number of boolean features but each example has at most n
that are "on".  Example described as the set of the on-features.  Goal
is to be polynomial in n.   E.g., if we think of representing email
messages by the set of words in them, then here n is the maximum
number of words in a given email message, which might be much smaller
than the total size of the dictionary.   

These two models (n string-valued features, or just an example is a
list of at most n strings) are basically equivalent.

Example: learning an OR function. 

   - list-and-cross-off is not a very good strategy anymore.
   
   Idea: start off saying negative.  See 1st positive 
        (buy xanax real cheap)
   and use this to initialize hypothesis. Note that at least one of these
   is important.
   
   What to do when make a mistake on a negative?
   What to do when make a mistake on a positive?
   Say there are r variables in the target.  How many mistakes at most?

Can also use Winnow.  Idea:

   Say we know r.  solve for smallest N >= n*(MB of Winnow)
                                         = n*(3r(log N + 1) + 2).

        (e.g., N = n*r^2 is sufficient if r and n are large, since RHS
         gives us O(nrlog(nr)) < rn^2)

   Define y_1, ..., y_N as N boolean features.  What we'll do is
   assign these to strings as they come up in examples that winnow
   makes a mistake on. 

   Say we see example: (buy xanax real cheap).  Then we'll
   *temporarily* assign y_1 = "buy", etc.  Give winnow the
   example:
                1 1 1 1 0 0 ... 0

   If we make a mistake, we make this assignment permanent.  Otherwise
   we undo it (we forget we ever saw the example).

   Claim: What winnow sees is consistent with an OR of r boolean features.

   So, by its mistake bound and the defn of N, we won't ever run out
   of y's to assign.

   # mistakes at most 3r(log N + 1) + 2, which is O(rlog(nr)).

In general: any algorithm that is Attribute-Efficient (or even has up to
an n^{1-epsilon} dependence on the total number of variables) can be
translated to this model.

Note: decision list alg (from your hwk) fails.  Not known if it is
possible to learn Decision Lists in this infinte-attribute setting,
with a mistake bound polynomial in n and the length of the target
list.