Our goal is to construct a statistical model of the process which generated the
training sample . The building blocks of this model will be a set of
statistics of the training sample. In the current example we have employed
several such statistics: the frequency that *in* translated to either *
dans* or *en* was ; the frequency that it translated to either *
dans* or *au cours de* was ; and so on. These particular statistics
were independent of the context, but we could also consider statistics which
depend on the conditioning information *x*. For instance, we might notice
that, in the training sample, if *April * is the word following *in*,
then the translation of *in* is *en* with frequency .

To express the event that *in* translates as *en* when *April* is
the following word, we can introduce the indicator function

The expected value of *f* with respect to the empirical distribution
is exactly the statistic we are interested in. We denote this expected value
by

We can express any statistic of the sample as the expected value of an
appropriate binary-valued indicator function *f*. We call such function a *
feature function* or *feature* for short. (As with probability
distributions, we will sometimes abuse notation and use to denote both
the value of *f* at a particular pair as well as the entire function
*f*.)

When we discover a statistic that we feel is useful, we can acknowledge its
importance by requiring that our model accord with it. We do this by
constraining the expected value that the model assigns to the corresponding
feature function *f*. The expected value of *f* with respect to the model
is

where is the empirical distribution of *x* in the training sample.
We constrain this expected value to be the same as the expected value of *f* in
the training sample. That is, we require

Combining (1), (2) and (3) yields the more explicit equation

We call the requirement (3) a *constraint
equation* or simply a *constraint*. By restricting attention to those
models for which (3) holds, we are
eliminating from consideration those models which do not agree with the
training sample on how often the output of the process should exhibit the
feature *f*.

To sum up so far, we now have a means of representing statistical phenomena inherent in a sample of data (namely, ), and also a means of requiring that our model of the process exhibit these phenomena (namely, ).

One final note about features and constraints bears repeating: though the words ``feature'' and ``constraint'' are often used interchangeably in discussions of maximum entropy, we will be vigilant to distinguish the two and urge the reader to do likewise: a feature is a binary-valued function of ; a constraint is an equation between the expected value of the feature function in the model and its expected value in the training data.

Fri Jul 5 11:43:50 EDT 1996