Abstracts: (All papers are in postscript format unless otherwise indictated.)
 

Information retrieval as statistical translation

We propose a new probabilistic approach to information retrieval based upon the ideas and methods of statistical machine translation. The central ingredient in this approach is a statistical model of how a user might distill or ``translate'' a given document into a query. To assess the relevance of a document to a user's query, we estimate the probability that the query would have been generated as a translation of the document, and factor in the user's general preferences in the form of a prior distribution over documents. We propose a simple, well motivated model of the document-to-query translation process, and describe an algorithm for learning the parameters of this model in an unsupervised manner from a collection of documents. As we show, one can view this approach as a generalization and justification of the ``language modeling'' strategy recently proposed by Ponte and Croft. In a series of experiments on TREC data, a simple translation-based retrieval system performs well in comparison to conventional retrieval techniques. This prototype system only begins to tap the full potential of translation-based retrieval.
Citation:
A. Berger, J. Lafferty. Information retrieval as statistical translation. Proceedings of ACM SIGIR'99. Berkeley, CA (to appear)

Online documents:
Conference paper
   

Error-correcting output coding for text classification

This paper applies error-correcting output coding (ECOC) to the task of document categorization. ECOC, of recent vintage in the AI literature, is a method for decomposing a multiway classification problem into many binary classification tasks, and then combining the results of the subtasks into a hypothesized solution to the original problem. There has been much recent interest in the machine learning community about algorithms which integrate ``advice'' from many subordinate predictors into a single classifier, and error-correcting output coding is one such technique. We provide experimental results on several real-world datasets, extracted from the Internet, which demonstrate that ECOC can offer significant improvements in accuracy over conventional classification algorithms.
Citation:
A. Berger (1999). Error-correcting output coding for text classification. IJCAI'99: Workshop on machine learning for information filtering. Stockholm, Sweden.

Online documents:
preprint
   

Statistical models for text segmentation

This paper introduces a new statistical approach to automatically partitioning text into coherent segments. The approach is based on a technique that incrementally builds an exponential model by identifying features correlated with the presence of boundaries in labeled training text. The models use two classes of features: topicality features that use adaptive language models in a novel way to detect broad changes of topic, and cue-word features that detect occurrences of specific words, which may be domain-specific, that tend to be used near segment boundaries. Assessment of our approach on quantitative and qualitative grounds demonstrates its effectiveness in two very different domains, Wall Street Journal news articles and television broadcast news story transcripts. Quantitative results on these domains are presented using a new probabilistically motivated error metric, which combines precision and recall in a natural and flexible way. This metric enables a quantitative assessment of the relative contributions of the different feature types, as well as a comparison with decision trees and previously proposed text segmentation algorithms.
Citation:
D. Beeferman, A. Berger, and J. Lafferty. Statistical models for text segmentation. Machine Learning, 34. Special Issue on Natural Language Learning (C. Cardie and R. Mooney, eds). 1999.

Online documents:
journal article
Conference paper (shorter; presented at EMNLP '97)
   

Just in Time Language Modelling

Traditional approaches to language modelling have relied on a fixed corpus of text to inform the parameters of a probability distribution over word sequences. Increasing the corpus size often leads to better-performing language models, but no matter how large, the corpus is a static entity, unable to reflect information about events which postdate it. In these pages we introduce an online paradigm which interleaves the estimation and application of a language model. We present a Bayesian approach to online language modelling, in which the marginal probabilities of a static trigram model are dynamically updated to match the topic being dictated to the system. We also describe the architecture of a prototype we have implemented which uses the World Wide Web (WWW) as a source of information, and provide results from some initial proof of concept experiments.
Citation:
A. Berger, R. Miller. Just in Time Language Modelling. IEEE Conference on Acoustic, Speech and Signal Processing. Seattle, WA (1998)

Online documents:
Conference paper
Slides of ICASSP'98 talk
 

A Model of Lexical Attraction and Repulsion

This paper introduces new techniques based on exponential families for modeling the correlations between words in text and speech. The motivation for this work is to build improved statistical language models by treating a static trigram model as a default distribution, and adding sufficient statistics, or ``features,'' to a family of conditional exponential distributions in order to model the nonstationary characteristics of language. We focus on features based on pairs of mutually informative words which allow the trigram model to adapt to recent context. While previous work assumed the effects of these word pairs to be constant over a window of several hundred words, we show that their influence is nonstationary on a much smaller time scale. In particular, empirical samples drawn from both written text and conversational speech reveal that the ``attraction'' between words decays exponentially, while stylistic and syntactic contraints create a ``lexical exclusion'' effect that discourages close co-occurrence. We show that these characteristics are well described by mixture models based on two-stage exponential distributions. These models are a common tool in queueing theory, but they have not previously found use in speech and language processing. We show how the EM algorithm can be used to estimate the parameters of these models, which can then be incorporated as penalizing features in the posterior distribution for predicting the next word. Experimental results illustrate the benefit these techniques yield when incorporated into a long-range language model.
Citation:
D. Beeferman, A. Berger, J. Lafferty. ACL-EACL'97 Joint Conference. Madrid Spain (1997)

Online documents:
Conference paper