This page contains pedagogically-oriented material on maximum entropy and exponential models. The emphasis is towards modelling of discrete-valued stochastic processes which arise in human language applications, such as language modelling.
All links point to postscript files unless otherwise indicated.
This is a high-level tutorial on how to use MaxEnt for modelling discrete stochastic processes. The motivating example is the task of determining the most appropriate translation of a French word in context. The tutorial discusses the process of growing an exponential model by automatic feature selection ("inductive learning," if you will) and also the task of estimating maximum-likelihood parameters for a model containing a fixed set of features.Convexity, Maximum Likelihood and All That
This note is meant as a gentle but comprehensive introduction to the
A gentle introduction to iterative scaling
This note concerns the improved iterative scaling algorithm for computing
maximum-likelihood estimates of the parameters of exponential models. The
algorithm was invented by members of the machine translation group (myself not
included) at IBM Watson Research in the early 1990s. The point of this document
is twofold: first, to motivate the improved iterative scaling algorithm for
conditional models, and second, to do so in a way that is minimizes the
mathematical burden on the reader.
Notes from Eric Ristad's MaxEnt tutorial at ACL'97
This tutorial explains how to build maximum entropy models for natural
language applications such as information retrieval and speech
recognition. We review the maximum entropy framework, explore the art
of effective feature design, and show how to implement models using
the instructor's publicly available Maximum Entropy Modeling Toolkit
In ``A maximum entropy approach to natural language processing''
(Computational Linguistics 22:1, March 1996), the appendix describes an
approach to computing the gain of a single feature f. This note elaborates on
the equations presented there; in particular, we show how to derive equations
(37) and (38).
This link is to the Maximum Entropy Modeling Toolkit, for parameter estimation and prediction for maximum entropy models in discrete domains. The software comes with documentation, and was used as the basis of the 1996 Johns Hopkins workshop on language modelling.
Adam Berger, Stephen Della Pietra, and Vincent Della Pietra
Computational Linguistics, (22-1), March 1996;
The concept of maximum entropy can be traced back along multiple
threads to Biblical times. Only recently, however, have computers become
powerful enough to permit the widescale application of this concept to real
world problems in statistical estimation and pattern recognition. In this
paper we describe a method for statistical modeling based on maximum entropy.
We present a maximum-likelihood approach for automatically constructing maximum
entropy models and describe how to implement this approach efficiently, using
as examples several problems in natural language processing.
Inducing features of random fields
Stephen Della Pietra, Vincent Della Pietra, and John Lafferty
The random field models and techniques introduced in this paper differ
from those common to much of the computer vision literature in that the
underlying random fields are non-Markovian and have a large number of
parameters that must be estimated. Relations to other learning
approaches, including decision trees, are given. As a demonstration of
the method, we describe its application to the problem of automatic word
classification in natural language processing.
IEEE Transactions on Pattern Analysis and Machine Intelligence 19:4, pp.380--393, April, 1997.
We present a technique for constructing random fields
from a set of training samples. The learning paradigm builds
increasingly complex fields by allowing potential functions, or
features, that are supported by increasingly large subgraphs. Each
feature has a weight that is trained by minimizing the
Kullback-Leibler divergence between the model and the empirical
distribution of the training data. A greedy algorithm determines how
features are incrementally added to the field and an iterative scaling
algorithm is used to estimate the optimal values of the weights.
The random field models and techniques introduced in this paper differ from those common to much of the computer vision literature in that the underlying random fields are non-Markovian and have a large number of parameters that must be estimated. Relations to other learning approaches, including decision trees, are given. As a demonstration of the method, we describe its application to the problem of automatic word classification in natural language processing.
A maximum entropy approach to adaptive statistical language modelling
Computer, Speech and Language 10:187--228, 1996.
An adaptive statistical language model is described, which
successfullyintegrates long distance linguistic information with other
knowledge sources. Most existing statistical language models exploit only the
immediate history of a text. To extract information from further back in the
document's history, we propose and use trigger pairs as the basic information
bearing elements. This allows the model to adapt its expectations to the topic
of discourse. Next, statistical evidence from multiple sources must be
combined. Traditionally, linear interpolation and its variants have been used,
but these are shown here to be seriously deficient. Instead, we apply the
principle of Maximum Entropy (ME). Each information source gives rise to a set
of constraints, to be imposed on the combined estimate. The intersection of
these constraints is the set of probability functions which are consistent with
all the information sources. The function with the highest entropy within that
set is the ME solution. Given consistent statistical evidence, a unique ME
solution is guaranteed to exist, and an iterative algorithm exists which is
guaranteed to converge to it. The ME framework is extremely general: any
phenomenon that can be described in terms of statistics of the text can be
readily incorporated. An adaptive language model based on the ME approach was
trained on the Wall Street Journal corpus, and showed 32%--39% perplexity
reduction over the baseline. When interfaced to SPHINX-II, Carnegie Mellon's
speech recognizer, it reduced its error rate by 10%--14%. This thus illustrates
the feasibility of incorporating many diverse knowledge sources in a single,
unified statistical framework.