In ``A maximum entropy approach to natural language processing''
(*Computational Linguistics* 22:1, March 1996), the appendix describes an
approach to computing the gain of a single feature *f*. This note elaborates on
the equations presented there; in particular, we show how to derive equations
(37) and (38).

**Background**

Recall the form of the model:

where

- is some fixed distribution,
- is an indicator function or ``feature,'' drawn from some large set of possible features,
- is a real-valued weight corresponding to the feature
*f*, and - is a normalizing term, required to make a probability distribution.

A typical measure of the quality of a model relative to some empirical distribution is the log-likelihood:

Furthermore, we introduce the *gain* of a feature, which is the increase in
log-likelihood of the model relative to :

arises as a natural and efficient approximation to ;
see section 4.3 of the paper for details. Here is the fraction of times in the empirical sample; in
other words, the ``empirical expected value'' of *f*.

**Deriving **

The efficient feature selection algorithm computes for each feature *f* the
optimal weight if *f* were to be adjoined to the set of active features
, under the assumption that the other parameters remain fixed. Since the
optimal occurs at , we can apply a numerical root-finding
technique (such as Newton's method) to to locate the optimal
. Under suitable conditions, a numerical root-finding technique is
guaranteed to move monotonically closer to the root with each iteration.

Differentiating (3), we get

Introducing the abbreviation , we can rewrite this equation as

For , we differentiate once more:

Before things get any hairier, we will write *f* for
and introduce . The notation (in the latter case) is meant to
indicate that is a constant, not taking part in
summations. Carrying on, we have

The last line matches equation (38) in the paper.

Thu Aug 1 13:39:45 EDT 1996