Abstracts: (All papers are in postscript format unless otherwise indictated.)
 
Information retrieval as statistical translation
We propose a new probabilistic approach to information retrieval based
upon the ideas and methods of statistical machine translation. The
central ingredient in this approach is a statistical model of how a
user might distill or ``translate'' a given document into a query. To
assess the relevance of a document to a user's query, we estimate the
probability that the query would have been generated as a translation
of the document, and factor in the user's general preferences in the
form of a prior distribution over documents. We propose a simple,
well motivated model of the document-to-query translation process, and
describe an algorithm for learning the parameters of this model in an
unsupervised manner from a collection of documents. As we show, one
can view this approach as a generalization and justification of the
``language modeling'' strategy recently proposed by Ponte and Croft.
In a series of experiments on TREC data, a simple translation-based
retrieval system performs well in comparison to conventional retrieval
techniques. This prototype system only begins to tap the full
potential of translation-based retrieval.
- Citation:
- A. Berger, J. Lafferty. Information retrieval as statistical translation.
Proceedings of ACM SIGIR'99. Berkeley, CA (to appear)
- Online documents:
-
Conference paper
 
 
Error-correcting output coding for text classification
This paper applies error-correcting output coding (ECOC) to the task of
document categorization. ECOC, of recent vintage in the AI literature, is a
method for decomposing a multiway classification problem into many binary
classification tasks, and then combining the results of the subtasks into a
hypothesized solution to the original problem. There has been much recent
interest in the machine learning community about algorithms which integrate
``advice'' from many subordinate predictors into a single classifier, and
error-correcting output coding is one such technique. We provide experimental
results on several real-world datasets, extracted from the Internet, which
demonstrate that ECOC can offer significant improvements in accuracy over
conventional classification algorithms.
- Citation:
- A. Berger (1999). Error-correcting output coding for text classification.
IJCAI'99: Workshop on machine learning for information filtering.
Stockholm, Sweden.
- Online documents:
-
preprint
 
 
Statistical models for text segmentation
This paper introduces a new statistical approach to
automatically partitioning text into coherent segments. The approach is based
on a technique that incrementally builds an exponential model by identifying
features correlated with the presence of boundaries in labeled training text.
The models use two classes of features: topicality features that use
adaptive language models in a novel way to detect broad changes of topic, and
cue-word features that detect occurrences of specific words, which may
be domain-specific, that tend to be used near segment boundaries. Assessment
of our approach on quantitative and qualitative grounds demonstrates its
effectiveness in two very different domains, Wall Street Journal news
articles and television broadcast news story transcripts. Quantitative results
on these domains are presented using a new probabilistically motivated error
metric, which combines precision and recall in a natural and flexible way.
This metric enables a quantitative assessment of the relative contributions of
the different feature types, as well as a comparison with decision trees and
previously proposed text segmentation algorithms.
- Citation:
- D. Beeferman, A. Berger, and J. Lafferty. Statistical models for text
segmentation. Machine Learning, 34. Special Issue on Natural Language
Learning (C. Cardie and R. Mooney, eds). 1999.
- Online documents:
-
journal article
- Conference paper (shorter; presented at EMNLP '97)
 
 
Just in Time Language Modelling
Traditional approaches to language modelling have relied on a fixed
corpus of text to inform the parameters of a probability distribution
over word sequences. Increasing the corpus size often leads to
better-performing language models, but no matter how large, the corpus
is a static entity, unable to reflect information about events which
postdate it. In these pages we introduce an online paradigm which
interleaves the estimation and application of a language model. We
present a Bayesian approach to online language modelling, in which the
marginal probabilities of a static trigram model are dynamically
updated to match the topic being dictated to the system. We also
describe the architecture of a prototype we have implemented which
uses the World Wide Web (WWW) as a source of information, and provide
results from some initial proof of concept experiments.
- Citation:
- A. Berger, R. Miller. Just in Time Language Modelling.
IEEE Conference on Acoustic, Speech and Signal Processing. Seattle, WA (1998)
- Online documents:
- Conference paper
- Slides of ICASSP'98 talk
 
A Model of Lexical Attraction and Repulsion
This paper introduces new techniques based on exponential
families for modeling the correlations between words in
text and speech.
The motivation for this work is to build improved
statistical language models by treating a static trigram model as a
default distribution, and adding sufficient statistics, or ``features,''
to a family of conditional exponential distributions in order to model
the nonstationary characteristics of language. We focus on features
based on pairs of mutually informative words which allow the trigram
model to adapt to recent context. While previous work assumed the
effects of these word pairs to be constant over a window of several
hundred words, we show that their influence is nonstationary on a much
smaller time scale. In particular, empirical samples drawn from both
written text and conversational speech reveal that the ``attraction''
between words decays exponentially, while stylistic and syntactic
contraints create a ``lexical exclusion'' effect that discourages close
co-occurrence. We show that these characteristics are well described by
mixture models based on two-stage exponential distributions. These
models are a common tool in queueing theory, but they have not
previously found use in speech and language processing. We show how the
EM algorithm can be used to estimate the parameters of these models,
which can then be incorporated as penalizing features in the posterior
distribution for predicting the next word. Experimental results
illustrate the benefit these techniques yield when incorporated into
a long-range language model.
- Citation:
- D. Beeferman, A. Berger, J. Lafferty.
ACL-EACL'97 Joint Conference.
Madrid Spain (1997)
- Online documents:
- Conference paper