Suppose that we are given n feature functions , which determine statistics we feel are important in modeling the process. We would like our model to accord with these statistics. That is, we would like to lie in the subset of defined by
Figure 1 provides a geometric interpretation of this setup. Here is the space of all (unconditional) probability distributions on 3 points, sometimes called a simplex. If we impose no constraints (depicted in (a)), then all probability models are allowable. Imposing one linear constraint restricts us to those which lie on the region defined by , as shown in (b). A second linear constraint could determine exactly, if the two constraints are satisfiable; this is the case in (c), where the intersection of and is non-empty. Alternatively, a second linear constraint could be inconsistent with the first--for instance, the first might require that the probability of the first point is and the second that the probability of the third point is --this is shown in (d). In the present setting, however, the linear constraints are extracted from the training sample and cannot, by construction, be inconsistent. Furthermore, the linear constraints in our applications will not even come close to determining uniquely as they do in (c); instead, the set of allowable models will be infinite.
Figure 1: Four different scenarios in constrained optimization. represents the space of all probability distributions. In (a), no constraints are applied, and all are allowable. In (b), the constraint narrows the set of allowable models to those which lie on the line defined by the linear constraint. In (c), two consistent constraints and define a single model . In (d), the two constraints are inconsistent (i.e. ); no can satisfy them both.
Among the models , the maximum entropy philosophy dictates that we select the distribution which is most uniform. But now we face a question left open earlier: what does ``uniform'' mean?
A mathematical measure of the uniformity of a conditional distribution is provided by the conditional entropy
The entropy is bounded from below by zero, the entropy of a model with no uncertainty at all, and from above by , the entropy of the uniform distribution over all possible values of y. With this definition in hand, we are ready to present the principle of maximum entropy.
To select a model from a set of allowed probability distributions, choose the model with maximum entropy :
It can be shown that is always well-defined; that is, there is always a unique model with maximum entropy in any constrained set .