We introduce the concept of maximum entropy through a simple example. Suppose
we wish to model an expert translator's decisions concerning the proper French
rendering of the English word *in*. Our model of the expert's
decisions assigns to each French word or phrase *f* an estimate, , of
the probability that the expert would choose *f* as a translation of *in*.
To guide us in developing , we collect a large sample of instances of the
expert's decisions. Our goal is to extract a set of facts about the
decision-making process from the sample (the first task of modeling) that will
aid us in constructing a model of this process (the second task).

One obvious clue we might glean from the sample is the list of allowed
translations. For example, we might discover that the expert translator always
chooses among the following five French phrases: {*dans*, *en*, *
à*, *au cours de*, *pendant*}. With this information in hand, we
can impose our first constraint on our model *p*:

This equation represents our first statistic of the process; we can now proceed
to search for a suitable model which obeys this equation. Of course, there are
an infinite number of models for which this identity holds. One model
which satisfies the above equation is ; in other words, the model
always predicts *dans*. Another model which obeys this constraint predicts
*pendant* with a probability of , and *à* with a probability
of . But both of these models offend our sensibilities: knowing only that
the expert always chose from among these five French phrases, how can we
justify either of these probability distributions? Each seems to be making
rather bold assumptions, with no empirical justification. Knowing only that the
expert chose exclusively from among these five French phrases, the
most intuitively appealing model is

This model, which allocates the total probability evenly among the
five possible phrases, is the most uniform model subject to our knowledge. (It
is not, however, the most uniform overall; that model would grant an equal
probability to every *possible* French phrase.)

We might hope to glean more clues about the expert's decisions from our
sample. Suppose we notice that the expert chose either *dans* or *en*
30% of the time. We could apply this knowledge to update our model of the
translation process by requiring that satisfy two constraints:

Once again there are many probability distributions consistent with these two constraints. In the absence of any other knowledge, a reasonable choice for is again the most uniform--that is, the distribution which allocates its probability as evenly as possible, subject to the constraints:

Say we inspect the data once more, and this time notice another interesting
fact: in half the cases, the expert chose either *dans* or *à*. We
can incorporate this information into our model as a third constraint:

We can once again look for the most uniform satisfying these constraints, but now the choice is not as obvious. As we have added complexity, we have encountered two problems. First, what exactly is meant by ``uniform,'' and how can one measure the uniformity of a model? Second, having determined a suitable answer to these questions, how does one find the most uniform model subject to a set of constraints like those we have described?

The maximum entropy method answers both these questions. Intuitively, the principle is simple: model all that is known and assume nothing about that which is unknown. In other words, given a collection of facts, choose a model which is consistent with all the facts, but otherwise as uniform as possible. This is precisely the approach we took in selecting our model at each step in the above example.

Fri Jul 5 11:43:50 EDT 1996