Our goal is to construct a statistical model of the process which generated the
training sample . The building blocks of this model will be a set of
statistics of the training sample. In the current example we have employed
several such statistics: the frequency that in translated to either
dans or en was
; the frequency that it translated to either
dans or au cours de was
; and so on. These particular statistics
were independent of the context, but we could also consider statistics which
depend on the conditioning information x. For instance, we might notice
that, in the training sample, if April is the word following in,
then the translation of in is en with frequency
.
To express the event that in translates as en when April is the following word, we can introduce the indicator function
The expected value of f with respect to the empirical distribution
is exactly the statistic we are interested in. We denote this expected value
by
We can express any statistic of the sample as the expected value of an
appropriate binary-valued indicator function f. We call such function a
feature function or feature for short. (As with probability
distributions, we will sometimes abuse notation and use to denote both
the value of f at a particular pair
as well as the entire function
f.)
When we discover a statistic that we feel is useful, we can acknowledge its
importance by requiring that our model accord with it. We do this by
constraining the expected value that the model assigns to the corresponding
feature function f. The expected value of f with respect to the model
is
where is the empirical distribution of x in the training sample.
We constrain this expected value to be the same as the expected value of f in
the training sample. That is, we require
Combining (1), (2) and (3) yields the more explicit equation
We call the requirement (3) a constraint
equation or simply a constraint. By restricting attention to those
models for which (3) holds, we are
eliminating from consideration those models which do not agree with the
training sample on how often the output of the process should exhibit the
feature f.
To sum up so far, we now have a means of representing statistical phenomena
inherent in a sample of data (namely, ), and also a means of
requiring that our model of the process exhibit these phenomena (namely,
).
One final note about features and constraints bears repeating: though the words
``feature'' and ``constraint'' are often used interchangeably in discussions of
maximum entropy, we will be vigilant to distinguish the two and urge the reader
to do likewise: a feature is a binary-valued function of ; a constraint
is an equation between the expected value of the feature function in the model
and its expected value in the training data.