An elaboration on the feature induction equations

In ``A maximum entropy approach to natural language processing'' (Computational Linguistics 22:1, March 1996), the appendix describes an approach to computing the gain of a single feature f. This note elaborates on the equations presented there; in particular, we show how to derive equations (37) and (38).


Recall the form of the model:



A typical measure of the quality of a model relative to some empirical distribution tex2html_wrap_inline396 is the log-likelihood:


Furthermore, we introduce the gain of a feature, which is the increase in log-likelihood of the model tex2html_wrap_inline398 relative to tex2html_wrap_inline400 :


tex2html_wrap_inline402 arises as a natural and efficient approximation to tex2html_wrap_inline404 ; see section 4.3 of the paper for details. Here tex2html_wrap_inline406 is the fraction of times tex2html_wrap_inline408 in the empirical sample; in other words, the ``empirical expected value'' of f.

Deriving tex2html_wrap_inline412

The efficient feature selection algorithm computes for each feature f the optimal weight tex2html_wrap_inline416 if f were to be adjoined to the set of active features tex2html_wrap_inline420 , under the assumption that the other parameters remain fixed. Since the optimal tex2html_wrap_inline422 occurs at tex2html_wrap_inline424 , we can apply a numerical root-finding technique (such as Newton's method) to tex2html_wrap_inline426 to locate the optimal tex2html_wrap_inline428 . Under suitable conditions, a numerical root-finding technique is guaranteed to move monotonically closer to the root with each iteration.

Differentiating (3), we get


Introducing the abbreviation tex2html_wrap_inline430 , we can rewrite this equation as


For tex2html_wrap_inline432 , we differentiate once more:


Before things get any hairier, we will write f for tex2html_wrap_inline436 and introduce tex2html_wrap_inline438 . The notation (in the latter case) is meant to indicate that tex2html_wrap_inline440 is a constant, not taking part in summations. Carrying on, we have


The last line matches equation (38) in the paper.

Adam Berger
Thu Aug 1 13:39:45 EDT 1996