15-859(B) Machine Learning Theory            03/01/10
* Maximum entropy and maximum likelihood exponential model
* Connection to Winnow.

===========================================================================
- 2 different, natural, optimization criteria that have the same solution.
- turns out to do well in practice too: one of best algorithms for
  text classification problems.
- Winnow (balanced/multiclass version) can be viewed as a fast
  approximation.  Also related to SVMs

We'll be assuming that examples are given by n boolean features.
Also convenient to have "dummy feature" x_0 = 1 in every example.

High level
----------
First of all, we are going to be producing a hypothesis that, given x,
outputs a probability distribution over labels.  E.g., labels might be
{+,-} or might have a larger set of labels.  p_h(y|x) = the amount of
probability mass that h(x) assigns to label y.  Let S = data sample.
p_S(y|x) = 1 if label of x is y, and 0 otherwise.  [or if x is in the
sample multiple times with different labels, then p_S(y|x) is the
empirical distribution of these labels]

Second, we're not going to make assumptions about there being a
correct target function of some form --- we're just going to ask for
the hypothesis that maximizes a certain objective function.  Going to
be a convex optimization problem.

Here are 2 natural criteria that turn out to be equivalent.

1. Maximum entropy
==================
General principle for fitting a "maximally ignorant" prob dist given
known constraints. E.g., you roll a die many times and find that the
average value is 4 instead of 3.5.  Could ask for the maximum entropy
distribution subject to this constraint.  Entropy of a distribution p
over y is defined as:
          H(p) = sum_y p(y)*lg(1/p(y))
It's a measure of how spread out p is.  For instance, if we played 20
questions and I picked y from distribution p, and suppose we ignored
integrality issues so you could split the distribution 50/50 with
each question.  Then it would take you lg(1/p(y)) questions to guess a
given y, and H(p) would be the expected number of question asked.  [To
deal with integrality, imagine we change the game so that I pick k y's
iid from p and you need to identify the entire string.]

In the context of a probabilistic function over a dataset, the
standard measure used is the average entropy over the data. I.e.,

   H(p_h) = average_{x in S} sum_y p_h(y|x) lg[1/p_h(y|x)]

The maximum entropy p_h is the one that predicts y uniformly at
random, no matter what x is.  But, let's say we want h to agree in
certain ways with the sample.  E.g., if the sample is 30% positive,
then we want h (over x in S) to be 30% positive on average.  If the
sample is 45% positive given that x_i = 1, then we want h to agree
with that too.  Specifically, we want:

1.  for all y, sum_{x in S} p_h(y|x) = sum_{x in S} p_S(y|x)

2.  for all y, for all x_i,

      sum_{x in S:x_i=1} p_h(y|x) = sum_{x in S:x_i=1} p_S(y|x).

Note: it's easy to achieve these by just memorizing data: let p_h = p_S.
Note: can merge 1 into 2 because of dummy feature.

DEF: Let P be the set of probability distributions satisfying these
constraints.  Our objective: find the MAXIMUM ENTROPY h in P.

 [Note 1: usually in this area, a "feature" is property of the example
 and the label.  E.g., "x_i = 1 and the  label is +".  But I will use
 feature in the standard machine-learning sense.  It doesn't matter much.]

 [Note 2: this is a pretty strange objective.  In particular, it says
 nothing about what h should do on points x not in S!  Luckily, there
 will exist "reasonable" functions that optimize this objective.]

2. MAXIMUM LIKELIHOOD "EXPONENTIAL MODEL" or "LOGISTIC LINEAR SEPARATOR"
========================================================================

Here's a natural family (let's call it "Q") of probabilistic
hypotheses.  Let w_{iy} be a weight for feature i and label y.  Let
w_y be the vector for y.  Let p_h(y|x) be proportional to e^{w_y.x}.
I.e., p_h(y|x) = e^{w_y.x}/Z(x), where Z(x) = sum_y' e^{w_y'.x}.

Of all these, pick the maximum likelihood hypothesis.  I.e., we want
the h in Q that maximizes the probability of the actual labels on the
data.

If you just think of positive and negative examples, and let w =
w_{+}-w_{-}, you get p_h(+|x) = e^{w.x}/(1+e^{w.x}).  So you can think of
this as a linear separator with a prediction that is more confident as
x gets farther from the separator.  Which one maximizes likelihood?
We want to maximize the product of the probabilities of the observed
labels.  Equivalently, by taking -ln, we want to minimize a sum of
quantities we can view as penalties, where points near the separator
get a small penalty, and points on the wrong side of the separator get
a penalty that is roughly proportional to their distance to the plane [do it
out: p_h(+|x) = e^{w.x}/(1+e^{w.x}), so -ln(p_h(+|x)) = -w.x + ln(1+e^{w.x}) 
and draw picture]

I.e., this is multinomial logistic regression.

Interesting theorem: criteria 1 and 2 are duals.  p* in the
intersection of P and Q is the optimum solution to both.
[and finding the max likelihood p_h in Q is a convex optimization problem]

ANALYSIS of MAXENT and MAXIMUM-LIKELIHOOD
==========================================

First, for a general distribution p_h, what is the likelihood?
product_{x in S} p_h(label(x)|x).  Let's take the -log and divide by
|S|.  I.e., we want to find h in Q to:

    minimize average_{x in S} log(1/p_h(label(x)|x)).

Can write it like this:

    minimize average_{x in S} sum_y p_S(y|x) log[1/p_h(y|x)].


This has an interesting interpretation.  For two distributions p,q,
let's define the "cross entropy":

     H(p,q) = sum_y p(y)log[1/q(y)].

This is the average number of bits to encode y distributed according
to p, if we use the optimal code for q.  H(p,p) = H(p).
Clearly, H(p,q) >= H(p,p).

Let's extend this definition to average over {x in S} if p,q are
functions over x.  H(p,q) = average_{x in S} sum_y p(y|x)log[1/q(y|x)]

So, HERE IS THE KEY POINT:
 - for criteria (2) we want h in Q that minimizes H(p_S,p_h).

 - for criteria (1) we want h in P that maximizes H(p_h,p_h).


The rest of the analysis will follow from the following lemma:

LEMMA: let p, p' in P, q in Q.  Then H(p,q) = H(p',q).  

[I.e., if we optimize our code for q, then all distribs in P are
equally good for us.  We can call this quantity H(P,q), since it's the
same for all distributions in P].

PROOF: H(p,q) = 

  average_{x in S} sum_y p(y|x)*[log(Z(x)) - lg(e)*w_y.x]

Notice that Z doesn't depend on y, and sum_y p(y|x) = 1.  So, the Z
part just gives average_{x in S}[log(Z(x))] which doesn't depend on
p.  For the rest of it, let's factor out the lg(e) and what's left is:

    - average_{x in S} sum_y p(y|x)*[sum_i w_{yi}*x_i]

  = - sum_y sum_i w_{yi}*[average_{x in S} p(y|x) x_i]

Notice that for a given i,y, the inside part is non-zero only for x in
S such that x_i=1.  It really only depends on the average over {x in
S: x_i=1} of p(y|x).  By definition of P, this is the same for all p
in P: we can replace p with p_S.  So, it didn't matter which p in P we
were looking at.  QED.


OK, now let's use this to prove that if p* is in both P and Q then it's
the optimum solution to both criteria.

Maximizing likelihood over Q:
 I.e., we want to show that H(p_S, p*) <= H(p_S, q) for any q in Q.

 But notice that p_S is in P.  So, the LHS equals H(p*,p*) and the RHS
 equals H(p*,q), so by defn of H, LHS <= RHS.


Maximizing entropy over P:
 I.e., we want to show that H(p*,p*) >= H(p,p) for any p in P.

 Since p* is in Q, the LHS equals H(p,p*).  So again we have what we
 want by defn of H.

This proves the result.  One thing left is to argue that such a p*
exists: i.e., it is possible to satisfy the sample constraints of P
using a model of the expenential form Q.  Actually, we're in a little
trouble here: what if S has 100% positive examples, or all examples
with x_5=1 are negative?  The problem is there is no maximum value to
optimization (2).  We can handle this by either (a) replacing Q
with its closure \bar{Q}, or else (b) redefining p_S to imagine the
dataset had at least one positive and one negative example with all
1's in it (this is probably the better thing to do in practice).
Won't prove this part of the result but this is where the x_0 comes in. 

Also, want uniqueness: if h in \bar{Q} minimizes H(p_S,p_h) and p* is
in P intersect \bar{Q}, we want to argue that p_h = p*.  Can do this
using our main lemma.  H(p_S,p_h) = H(p*,p_h), but this minimum occurs
only at p_h = p*.

RELATION TO BALANCED/MULTICLASS-WINNOW
======================================
It turns out the balanced winnow algorithm approximates the maxent
constraints if you set the multiplier to 1+epsilon, 1/(1+epsilon).  In
particular, the number of pos->neg mistakes is approximately the
number of neg->pos mistakes (and this holds even if you focus on just
the examples with x_i=1) which means the number of times it predicts
positive is approximately the number of positives in the sequence (and
this holds even if you focus on just the examples with x_i=1).  Also,
it uses rules of the same form as Q (except it just outputs the
highest probability label rather than a probability distribution).  In
practice it seems to work like a somewhat less smooth version of
maxent but can be a lot faster.