This page contains pedagogically-oriented material on maximum entropy and exponential models. The emphasis is towards modelling of discrete-valued stochastic processes which arise in human language applications, such as language modelling.

All links point to postscript files unless otherwise indicated.

This is a high-level tutorial on how to use MaxEnt for modelling discrete stochastic processes. The motivating example is the task of determining the most appropriate translation of a French word in context. The tutorial discusses the process of growing an exponential model by automatic feature selection ("inductive learning," if you will) and also the task of estimating maximum-likelihood parameters for a model containing a fixed set of features.Convexity, Maximum Likelihood and All That

This note is meant as a gentle but comprehensive introduction to theA gentle introduction to iterative scalingexpectation-maximization (EM) andimproved iterative scaling (IIS) algorithms, two popular techniques in maximum likelihood estimation. The focus in this tutorial is on the foundation common to the two algorithms: convex functions and their convenient properties. Where examples are called for, we draw from applications in human language technology.

This note concerns theNotes from Eric Ristad's MaxEnt tutorial at ACL'97improved iterative scalingalgorithm for computing maximum-likelihood estimates of the parameters of exponential models. The algorithm was invented by members of the machine translation group (myself not included) at IBM Watson Research in the early 1990s. The point of this document is twofold: first, to motivate the improved iterative scaling algorithm for conditional models, and second, to do so in a way that is minimizes the mathematical burden on the reader.

This tutorial explains how to build maximum entropy models for natural language applications such as information retrieval and speech recognition. We review the maximum entropy framework, explore the art of effective feature design, and show how to implement models using the instructor's publicly available Maximum Entropy Modeling Toolkit

In ``A maximum entropy approach to natural language processing'' (Computational Linguistics22:1, March 1996), the appendix describes an approach to computing the gain of a single featuref. This note elaborates on the equations presented there; in particular, we show how to derive equations (37) and (38).

This link is to the Maximum Entropy Modeling Toolkit, for parameter estimation and prediction for maximum entropy models in discrete domains. The software comes with documentation, and was used as the basis of the 1996 Johns Hopkins workshop on language modelling.

Adam Berger, Stephen Della Pietra, and Vincent Della Pietra

*Computational Linguistics*, (22-1), March 1996;

The concept of maximum entropy can be traced back along multiple threads to Biblical times. Only recently, however, have computers become powerful enough to permit the widescale application of this concept to real world problems in statistical estimation and pattern recognition. In this paper we describe a method for statistical modeling based on maximum entropy. We present a maximum-likelihood approach for automatically constructing maximum entropy models and describe how to implement this approach efficiently, using as examples several problems in natural language processing.Inducing features of random fields

Stephen Della Pietra, Vincent Della Pietra, and John Lafferty

*IEEE Transactions on Pattern Analysis and Machine
Intelligence* 19:4, pp.380--393, April, 1997.

We present a technique for constructing random fields from a set of training samples. The learning paradigm builds increasingly complex fields by allowing potential functions, or features, that are supported by increasingly large subgraphs. Each feature has a weight that is trained by minimizing the Kullback-Leibler divergence between the model and the empirical distribution of the training data. A greedy algorithm determines how features are incrementally added to the field and an iterative scaling algorithm is used to estimate the optimal values of the weights.The random field models and techniques introduced in this paper differ from those common to much of the computer vision literature in that the underlying random fields are non-Markovian and have a large number of parameters that must be estimated. Relations to other learning approaches, including decision trees, are given. As a demonstration of the method, we describe its application to the problem of automatic word classification in natural language processing.

A maximum entropy approach to adaptive statistical language modelling

Roni Rosenfeld

*Computer, Speech and Language* 10:187--228, 1996.

An adaptive statistical language model is described, which successfullyintegrates long distance linguistic information with other knowledge sources. Most existing statistical language models exploit only the immediate history of a text. To extract information from further back in the document's history, we propose and use trigger pairs as the basic information bearing elements. This allows the model to adapt its expectations to the topic of discourse. Next, statistical evidence from multiple sources must be combined. Traditionally, linear interpolation and its variants have been used, but these are shown here to be seriously deficient. Instead, we apply the principle of Maximum Entropy (ME). Each information source gives rise to a set of constraints, to be imposed on the combined estimate. The intersection of these constraints is the set of probability functions which are consistent with all the information sources. The function with the highest entropy within that set is the ME solution. Given consistent statistical evidence, a unique ME solution is guaranteed to exist, and an iterative algorithm exists which is guaranteed to converge to it. The ME framework is extremely general: any phenomenon that can be described in terms of statistics of the text can be readily incorporated. An adaptive language model based on the ME approach was trained on the Wall Street Journal corpus, and showed 32%--39% perplexity reduction over the baseline. When interfaced to SPHINX-II, Carnegie Mellon's speech recognizer, it reduced its error rate by 10%--14%. This thus illustrates the feasibility of incorporating many diverse knowledge sources in a single, unified statistical framework.