Language Modelling

Overview

A language model is a conditional distribution on the identify of the i'th word in a sequence, given the identities of all previous words. A trigram model models language as a second-order Markov process, making the computationally convenient approximation that a word depends only on the previous two words. By restricting the conditioning information to the previous two words, the trigram model is making the simplifying assumption---clearly false---that the use of language one finds in television, radio, and newspaper can be modeled by a second-order Markov process. Although words more than two back certainly bear on the identity of the next word, higher-order models are impractical: the number of parameters in an n-gram model is exponential in n, and finding the resources to compute and store all these parameters becomes daunting for n>3.


Relevant Publications

Recognition performance of a large-scale dependency-grammar language model
A. Berger, H. Printz
International Conference on Spoken Language Processing, Sydney, Australia (1998)

In this paper we report on a large-scale investigation of dependecy grammar language models. Our work includes several significant departures from earlier studies, notably a much larger training corpus, improved model structure, different feature types, new feature selection methods, and more coherent training and test data. We report results of using this model, in a rescoring paradigm, upon the word error rate (WER) of the IBM speech recognition system.

A Comparison of Criteria for Maximum Entropy/Minimum Divergence Language Modelling
(in compressed postscript)
A. Berger, H. Printz
Third Conference on Empirical Methods in Natural Language Processing. Granada, Spain (1998)

In this paper we study the gain, a naturally-arising statistic from the theory of MEMD modeling, as a figure of merit for selecting features for an MEMD language model. We compare the gain with two popular alternatives---empirical activation and mutual information---and argue that the gain is the preferred statistic, on the grounds that it directly measures a feature's contribution to improving upon the base model.

Just in Time Language Modelling
A. Berger, R. Miller
IEEE Conference on Acoustic, Speech and Signal Processing. Seattle, WA (1998)

Traditional approaches to language modelling have relied on a fixed corpus of text to inform the parameters of a probability distribution over word sequences. Increasing the corpus size often leads to better-performing language models, but no matter how large, the corpus is a static entity, unable to reflect information about events which postdate it. In these pages we introduce an online paradigm which interleaves the estimation and application of a language model. We present a Bayesian approach to online language modelling, in which the marginal probabilities of a static trigram model are dynamically updated to match the topic being dictated to the system. We also describe the architecture of a prototype we have implemented which uses the World Wide Web (WWW) as a source of information, and provide results from some initial proof of concept experiments.

Cyberpunc: A lightweight punctuation annotation system for speech
D. Beeferman, A. Berger, J. Lafferty
IEEE Conference on Acoustic, Speech and Signal Processing. Seattle, WA (1998)

This paper describes a lightweight method for the automatic insertion of intra-sentence punctuation into text. Despite the intuition that pauses in an acoustic stream are a positive indicator for some types of punctuation, this work will demonstrate the feasibility of a system which relies solely on lexical information. Besides its potential role in a speech recognition system, such a system could serve equally well in non-speech applications such as automatic grammar correction in a word processor and parsing of spoken text. After describing the design of a punctuation-restoration system, which relies on a trigram language model and a straightforward application of the Viterbi algorithm, we summarize results, both quantitative and subjective, of the performance and behavior of a prototype system.
A demo of the system is available.

Text segmentation using exponential models
D. Beeferman, A. Berger, J. Lafferty
Second Conference on Empirical Methods in Natural Language Processing. Providence, RI. (1997)

This paper introduces a new statistical approach to partitioning text automatically into coherent segments. Our approach enlists both short-range and long-range language models to help it sniff out likely sites of topic changes in text. To aid its search, the system consults a set of simple lexical hints it has learned to associate with the presence of boundaries through inspection of a large corpus of annotated data. We also propose a new probabilistically motivated error metric for use by the natural language processing and information retrieval communities, intended to supersede precision and recall for appraising segmentation algorithms. Qualitative assessment of our algorithm as well as evaluation using this new metric demonstrate the effectiveness of our approach in two very different domains, Wall Street Journal articles and the TDT Corpus, a collection of newswire articles and broadcast news transcripts.

A Model of Lexical Attraction and Repulsion
D. Beeferman, A. Berger, J. Lafferty
ACL-EACL'97 Joint Conference, Madrid Spain (1997)

This paper introduces new techniques based on exponential families for modeling the correlations between words in text and speech. The motivation for this work is to build improved statistical language models by treating a static trigram model as a default distribution, and adding sufficient statistics, or ``features,'' to a family of conditional exponential distributions in order to model the nonstationary characteristics of language. We focus on features based on pairs of mutually informative words which allow the trigram model to adapt to recent context. While previous work assumed the effects of these word pairs to be constant over a window of several hundred words, we show that their influence is nonstationary on a much smaller time scale. In particular, empirical samples drawn from both written text and conversational speech reveal that the ``attraction'' between words decays exponentially, while stylistic and syntactic contraints create a ``lexical exclusion'' effect that discourages close co-occurrence. We show that these characteristics are well described by mixture models based on two-stage exponential distributions. These models are a common tool in queueing theory, but they have not previously found use in speech and language processing. We show how the EM algorithm can be used to estimate the parameters of these models, which can then be incorporated as penalizing features in the posterior distribution for predicting the next word. Experimental results illustrate the benefit these techniques yield when incorporated into a long-range language model.