next up previous
Next: The maxent principle Up: Maxent Modeling Previous: Training data

Features and constraints

Our goal is to construct a statistical model of the process which generated the training sample tex2html_wrap_inline1576 . The building blocks of this model will be a set of statistics of the training sample. In the current example we have employed several such statistics: the frequency that in translated to either dans or en was tex2html_wrap_inline1578 ; the frequency that it translated to either dans or au cours de was tex2html_wrap_inline1580 ; and so on. These particular statistics were independent of the context, but we could also consider statistics which depend on the conditioning information x. For instance, we might notice that, in the training sample, if April is the word following in, then the translation of in is en with frequency tex2html_wrap_inline1584 .

To express the event that in translates as en when April is the following word, we can introduce the indicator function


The expected value of f with respect to the empirical distribution tex2html_wrap_inline1588 is exactly the statistic we are interested in. We denote this expected value by


We can express any statistic of the sample as the expected value of an appropriate binary-valued indicator function f. We call such function a feature function or feature for short. (As with probability distributions, we will sometimes abuse notation and use tex2html_wrap_inline1592 to denote both the value of f at a particular pair tex2html_wrap_inline1596 as well as the entire function f.)

When we discover a statistic that we feel is useful, we can acknowledge its importance by requiring that our model accord with it. We do this by constraining the expected value that the model assigns to the corresponding feature function f. The expected value of f with respect to the model tex2html_wrap_inline1604 is


where tex2html_wrap_inline1608 is the empirical distribution of x in the training sample. We constrain this expected value to be the same as the expected value of f in the training sample. That is, we require


Combining (1), (2) and (3) yields the more explicit equation


We call the requirement (3) a constraint equation or simply a constraint. By restricting attention to those models tex2html_wrap_inline1616 for which (3) holds, we are eliminating from consideration those models which do not agree with the training sample on how often the output of the process should exhibit the feature f.

To sum up so far, we now have a means of representing statistical phenomena inherent in a sample of data (namely, tex2html_wrap_inline1620 ), and also a means of requiring that our model of the process exhibit these phenomena (namely, tex2html_wrap_inline1622 ).

One final note about features and constraints bears repeating: though the words ``feature'' and ``constraint'' are often used interchangeably in discussions of maximum entropy, we will be vigilant to distinguish the two and urge the reader to do likewise: a feature is a binary-valued function of tex2html_wrap_inline1624 ; a constraint is an equation between the expected value of the feature function in the model and its expected value in the training data.

next up previous
Next: The maxent principle Up: Maxent Modeling Previous: Training data

Adam Berger
Fri Jul 5 11:43:50 EDT 1996