Index:



  Abstracts:
 

Automatic partitioning of web pages using clustering

This paper introduces a method for automatically partitioning richly-formatted electronic documents. An automatic partitioning system has many potential uses, but we focus here on one: dividing web content into fragments small enough to be delivered to and rendered on a mobile phone or PDA. The segmentation algorithm is analyzed from a theoretical and an empirical basis, with a suite of measurements.
Citation:
In Brewster and Dunlop (Eds.), "Mobile Human-Computer Interaction - Mobile HCI 2004", Lecture Notes in Computer Science, Vol. 3160, ISBN: 3-540-23086-6, September 2004

Online documents:
Conference paper [pdf]  

Centaur: A two-panel user interface for mobile document access

This paper introduces a novel user interface designed to mitigate some of the usability problems in mobile web access. The interface consists of two side-by-side panels for representing a richly-formatted document (e.g. a web page). In one panel is a "wide angle" view of the entire document, partitioned into regions. In the other panel is a "zoomed in" view of the currently-active region. We describe a prototype system, called CENTAUR, which automatically and in real-time reformats electronic documents into this side-by-side view.
Citation:
In Brewster and Dunlop (Eds.), "Mobile Human-Computer Interaction - Mobile HCI 2004", Lecture Notes in Computer Science, Vol. 3160, ISBN: 3-540-23086-6, September 2004

Online documents:
Conference paper [pdf]  

Query-relevant summarization using FAQs

This paper introduces a statistical model for query-relevant summarization: succinctly characterizing the relevance of a document to a query. Learning parameter values for the proposed model requires a large collection of model summarized documents, which we do not have, but as a proxy, we use a collection of FAQ (frequently-asked question) documents. Taking a learning approach enables a principled, quantitative evaluation of the proposed system, and the results of some initial experiments--on a collection of Usenet FAQs and on a FAQ-like set of customer-submitted questions to several large retail companies--suggest the plausibility of learning for summarization.
Citation:
V. Mittal and A. Berger (2000). Query-relevant summarization using FAQs. Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL). Hong Kong.

Online documents:
Conference paper [pdf]  

Agglomerative clustering of a search engine query log

This paper introduces a technique for mining a collection of user transactions with an Internet search engine to discover clusters of similar queries and similar URLs. The information we exploit is "clickthrough data": each record consists of a user's query to a search engine along with the URL which the user selected from among the candidates offered by the search engine. By viewing this dataset as a bipartite graph, with the vertices on one side corresponding to queries and on the other side to URLs, one can apply an agglomerative clustering algorithm to the graph's vertices to identify related queries and URLs. One noteworthy feature of the proposed algorithm is that it is "content-ignorant"--the algorithm makes no use of the actual content of the queries or URLs, but only how they co-occur within the clickthrough data. We describe how to enlist the discovered clusters to assist users in web search, and measure the effectiveness of the discovered clusters in the Lycos search engine.
Citation:
D. Beeferman and A. Berger (2000). Agglomerative clustering of a search engine query log. Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD). Boston, MA..

Online documents:
Conference paper [pdf]
Slides from presentation [ps]
 

Statistical machine learning for information retrieval.

The purpose of this work is to introduce and experimentally validate a framework, based on statistical machine learning, for handling a broad range of problems in information retrieval (IR)....[truncated for length]
Citation:
A. Berger (2001). Statistical machine learning for information retrieval Carnegie Mellon University Computer Science Department Technical Report CMU-CS-01-110

Online documents:
PhD Thesis [pdf] (Caveat: some of the figures are blank. This is a known problem; I plan to fix it when I get the chance...)

 

Ocelot: A system for summarizing web pages

We introduce Ocelot, a prototype system for automatically generating the "gist" of a web page by summarizing it. Although most text summarization research to date has focused on the task of news articles, web pages are quite different in both structure and content. Instead of coherent text with a well-defined discourse structure, they are more often likely to be a chaotic jumble of phrases, links, graphics and formatting commands. Such text provides little foothold for extractive summarization techniques, which attempt to generate a summary of a document by excerpting a contiguous, coherent span of text from it. This paper builds upon recent work in non-extractive summarization, producing the gist of a web page by "translating" it into a more concise representation rather than attempting to extract a text span verbatim. Ocelot uses probabilistic models to guide it in selecting and ordering words into a gist. This paper describes a technique for learning these models automatically from a collection of human-summarized web pages.
Citation:
A. Berger, V. Mittal (2000). Ocelot: A system for summarizing web pages. Proceedings of the 23rd Annual Conference on Research and Development in Information Retrieval (ACM SIGIR). Athens, Greece..

Online documents:
Conference paper [ps]
Slides from presentation [ps]
   
 

Bridging the lexical chasm: Statistical approaches to answer-finding

This paper investigates whether a machine can automatically learn the task of finding, within a large collection of candidate responses, the answers to questions. The learning process consists of inspecting a collection of answered questions and characterizing the relation between question and answer with a statistical model. For the purpose of learning this relation, we propose two sources of data: Usenet FAQ documents and customer service call-center dialogues from a large retail company. We will show that the task of "answer-finding" differs from both document retrieval and traditional question-answering, presenting challenges different from those found in these problems. The central aim of this work is to discover, through theoretical and empirical investigation, those statistical techniques best suited to the answer-finding problem.
Citation:
A. Berger, R. Caruana, D. Cohn, D. Freitag, and V. Mittal (2000). Bridging the lexical chasm: Statistical approaches to answer-finding Proceedings of the 23rd Annual Conference on Research and Development in Information Retrieval (ACM SIGIR). Athens, Greece..

Online documents:
Conference paper [pdf]
   
 

The Weaver system for document retrieval

This paper introduces Weaver, a probabilistic document retrieval system under development at Carnegie Mellon University, and discusses its performance in the TREC-8 ad hoc evaluation. We begin by describing the architecture and philosophy of the Weaver system, which represents a departure from traditional approaches to retrieval. The central ingredient is a statistical model of how a user might distill or "translate" a given document into a query. The retrieval-as-translation approach is based on the noisy channel paradigm and statistical language modeling, and has much in common with other recently proposed models. After the initial high-level overview, the bulk of the paper contains a discussion of implementation details and the empirical performance of the Weaver retrieval system.
Citation:
A. Berger, J. Lafferty (1999). The Weaver system for document retrieval. Proceedings of the 8th Annual Text Retrieval Conference (TREC-8). Gaithersburg, MD.

Online documents:
Conference paper [ps]
Slides from presentation [ps]
   

Automatic construction of conditional exponential models from elementary features

An apparatus for translating a series of source words in a first language to a series of target words in a second language. For an input series of source words, at least two target hypotheses, each comprising a series of target words, are generated. Each target word has a context comprising at least one other word in the target hypothesis. For each target hypothesis, a language model match score comprises an estimate of the probability of occurrence of the series of words in the target hypothesis. At least one alignment connecting each source word with at least one target word in the target hypothesis is identified. For each source word and each target hypothesis, a word match score comprises an estimate of the conditional probability of occurrence of the source word, given the target word in the target hypothesis which is connected to the source word and given the context in the target hypothesis of the target word which is connected to the source word. For each target hypothesis, a translation match score comprises a combination of the word match scores for the target hypothesis and the source words in the input series of source words.
Citation:
A. Berger, P. Brown, S. Della Pietra, V. Della Pietra, J. Lafferty, R. Mercer (2001). Automatic construction of condictional exponential models from elementary featuresU.S. patent #6,304,841.

 

Information retrieval as statistical translation

We propose a new probabilistic approach to information retrieval based upon the ideas and methods of statistical machine translation. The central ingredient in this approach is a statistical model of how a user might distill or "translate" a given document into a query. To assess the relevance of a document to a user's query, we estimate the probability that the query would have been generated as a translation of the document, and factor in the user's general preferences in the form of a prior distribution over documents. We propose a simple, well motivated model of the document-to-query translation process, and describe an algorithm for learning the parameters of this model in an unsupervised manner from a collection of documents. As we show, one can view this approach as a generalization and justification of the "language modeling" strategy recently proposed by Ponte and Croft. In a series of experiments on TREC data, a simple translation-based retrieval system performs well in comparison to conventional retrieval techniques. This prototype system only begins to tap the full potential of translation-based retrieval.
Citation:
A. Berger, J. Lafferty (1999). Information retrieval as statistical translation. Proceedings of the 22nd Annual Conference on Research and Development in Information Retrieval (ACM SIGIR). Berkeley, CA., 222-229.

Online documents:
Conference paper [ps]
Short note deriving equation (4) in paper [ps]
   

Error-correcting output coding for text classification

This paper applies error-correcting output coding (ECOC) to the task of document categorization. ECOC, of recent vintage in the AI literature, is a method for decomposing a multiway classification problem into many binary classification tasks, and then combining the results of the subtasks into a hypothesized solution to the original problem. There has been much recent interest in the machine learning community about algorithms which integrate "advice" from many subordinate predictors into a single classifier, and error-correcting output coding is one such technique. We provide experimental results on several real-world datasets, extracted from the Internet, which demonstrate that ECOC can offer significant improvements in accuracy over conventional classification algorithms.
Citation:
A. Berger (1999). Error-correcting output coding for text classification. IJCAI'99: Workshop on machine learning for information filtering. Stockholm, Sweden.

Online documents:
Workshop paper [pdf]
   

Statistical models for text segmentation

This paper introduces a new statistical approach to automatically partitioning text into coherent segments. The approach is based on a technique that incrementally builds an exponential model by identifying features correlated with the presence of boundaries in labeled training text. The models use two classes of features: topicality features that use adaptive language models in a novel way to detect broad changes of topic, and cue-word features that detect occurrences of specific words, which may be domain-specific, that tend to be used near segment boundaries. Assessment of our approach on quantitative and qualitative grounds demonstrates its effectiveness in two very different domains, Wall Street Journal news articles and television broadcast news story transcripts. Quantitative results on these domains are presented using a new probabilistically motivated error metric, which combines precision and recall in a natural and flexible way. This metric enables a quantitative assessment of the relative contributions of the different feature types, as well as a comparison with decision trees and previously proposed text segmentation algorithms.
Citation:
D. Beeferman, A. Berger, and J. Lafferty (1999). Statistical models for text segmentation. Machine Learning, (34):177-210. Special Issue on Natural Language Learning (C. Cardie and R. Mooney, eds).

Online documents:
journal article [pdf]
Conference paper (shorter; presented at EMNLP '97) [ps]
   

A maximum entropy approach to natural language processing

The concept of maximum entropy can be traced back along multiple threads to Biblical times. Only recently, however, have computers become powerful enough to permit the widescale application of this concept to real world problems in statistical estimation and pattern recognition. In this paper we describe a method for statistical modeling based on maximum entropy. We present a maximum-likelihood approach for automatically constructing maximum entropy models and describe how to implement this approach efficiently, using as examples several problems in natural language processing.
Citation:
A. Berger, S. Della Pietra and V. Della Pietra (1996). A maximum entropy approach to natural language processing. Computational Linguistics 22-1. Also available as IBM Research Report RC-19694

Online documents:
journal article [ps]
IBM Tech Report [ps]
 

The Candide system for machine translation

We present an overview of Candide, a system for automatic translation of French text to English text. Candide uses methods of information theory and statistics to develop a probability model of the translation process. This model, which is made to accord as closely as possible with a large body of French and English sentence pairs, is then used to generate English translations of previously unseen French sentences. This paper provides a tutorial in these methods, discussions of the training and operation of the system, and a summary of test results.
Citation:
A. Berger, P. Brown, S. Della Pietra, V. Della Pietra, J. Gillett, J. Lafferty, H. Printz, L. Ures (1994). The Candide system for machine translation. ARPA Workshop on Speech and Natural Language. Morgan Kaufman Publishers, 157-163.

Online documents:
Workshop paper [ps]
 

Just in Time Language Modelling

Traditional approaches to language modelling have relied on a fixed corpus of text to inform the parameters of a probability distribution over word sequences. Increasing the corpus size often leads to better-performing language models, but no matter how large, the corpus is a static entity, unable to reflect information about events which postdate it. In these pages we introduce an online paradigm which interleaves the estimation and application of a language model. We present a Bayesian approach to online language modelling, in which the marginal probabilities of a static trigram model are dynamically updated to match the topic being dictated to the system. We also describe the architecture of a prototype we have implemented which uses the World Wide Web (WWW) as a source of information, and provide results from some initial proof of concept experiments.
Citation:
A. Berger, R. Miller (1998). Just in Time Language Modelling. IEEE Conference on Acoustic, Speech and Signal Processing. Seattle, WA.

Online documents:
Conference paper [pdf]
Slides of ICASSP'98 talk [ps]
 

Probabilistic decoding of low-density Cayley codes

We report on some investigations into the behavior of a class of low-density codes constructed using algebraic techniques. Recent work shows expansion to be an essential property of the graphs underlying the low-density parity-check codes first introduced by Gallager. In addition, it has recently been shown that certain spectral techniques similar to those based on Fourier analysis for classical cyclic codes can be applied to codes constructed from Cayley graphs. This motivates us to compare the behavior of algebraically constructed expanders and randomly generated bipartite graphs using a probabilistic decoding algorithm. Preliminary results indicate that the performance of the explicit, algebraic expanders is comparable to that of random graphs in the case where each variable is associated with only two parity checks, while such codes are inferior to randomly generated codes with three or more constraints for each variable.
Citation:
A. Berger and J. Lafferty (1997). Proceedings of the Canadian Workshop on Information Theory. Toronto, Canada.

Online documents:
Workshop paper [pdf]

A Model of Lexical Attraction and Repulsion

This paper introduces new techniques based on exponential families for modeling the correlations between words in text and speech. The motivation for this work is to build improved statistical language models by treating a static trigram model as a default distribution, and adding sufficient statistics, or "features," to a family of conditional exponential distributions in order to model the nonstationary characteristics of language. We focus on features based on pairs of mutually informative words which allow the trigram model to adapt to recent context. While previous work assumed the effects of these word pairs to be constant over a window of several hundred words, we show that their influence is nonstationary on a much smaller time scale. In particular, empirical samples drawn from both written text and conversational speech reveal that the "attraction" between words decays exponentially, while stylistic and syntactic constraints create a "lexical exclusion" effect that discourages close co-occurrence. We show that these characteristics are well described by mixture models based on two-stage exponential distributions. These models are a common tool in queueing theory, but they have not previously found use in speech and language processing. We show how the EM algorithm can be used to estimate the parameters of these models, which can then be incorporated as penalizing features in the posterior distribution for predicting the next word. Experimental results illustrate the benefit these techniques yield when incorporated into a long-range language model.
Citation:
D. Beeferman, A. Berger, J. Lafferty (1997). ACL-EACL'97 Joint Conference. Madrid, Spain.

Online documents:
Conference paper [ps]
 

Language translation apparatus and method using context-based translation models

An apparatus for translating a series of source words in a first language to a series of target words in a second language. For an input series of source words, at least two target hypotheses, each including a series of target words, are generated. Each target word has a context comprising at least one other word in the target hypothesis. For each target hypothesis, a language model match score including an estimate of the probability of occurrence of the series of words in the target hypothesis. At least one alignment connecting each source word with at least one target word in the target hypothesis is identified. For each source word and each target hypothesis, a word match score including an estimate of the conditional probability of occurrence of the source word, given the target word in the target hypothesis which is connected to the source word and given the context in the target hypothesis of the target word which is connected to the source word. For each target hypothesis, a translation match score including a combination of the word match scores for the target hypothesis and the source words in the input series of source words. A target hypothesis match score including a combination of the language model match score for the target hypothesis and the translation match score for the target hypothesis. The target hypothesis having the best target hypothesis match score is output.
Citation:
A. Berger, P. Brown, S. Della Pietra, V. Della Pietra, A. Kehler, R. Mercer (1996). Language translation apparatus and method using context-based translation models. U.S. patent #5,510,981.

 

Recognition performance of a large-scale dependency-grammar language model

In this paper we report on a large-scale investigation of dependecy grammar language models. Our work includes several significant departures from earlier studies, notably a much larger training corpus, improved model structure, different feature types, new feature selection methods, and more coherent training and test data. We report results of using this model, in a rescoring paradigm, upon the word error rate (WER) of the IBM speech recognition system.
Citation:
A. Berger and H. Printz (1998). Int'l Conference on Spoken Language Processing (ICSLP'98), Sydney, Australia.

Online documents:
Conference paper [ps]
 

A Comparison of Criteria for Maximum Entropy/Minimum Divergence Feature Selection

In this paper we study the gain, a naturally-arising statistic from the theory of MEMD modeling, as a figure of merit for selecting features for an MEMD language model. We compare the gain with two popular alternatives---empirical activation and mutual information---and argue that the gain is the preferred statistic, on the grounds that it directly measures a feature's contribution to improving upon the base model.
Citation:
A. Berger, H. Printz (1998). Third Conference on Empirical Methods in Natural Language Processing, 96-107. Granada, Spain.

Online documents:
Conference paper [compressed ps]

 

Cyberpunc: A lightweight punctuation annotation system for speech

This paper describes a lightweight method for the automatic insertion of intra-sentence punctuation into text. Despite the intuition that pauses in an acoustic stream are a positive indicator for some types of punctuation, this work will demonstrate the feasibility of a system which relies solely on lexical information. Besides its potential role in a speech recognition system, such a system could serve equally well in non-speech applications such as automatic grammar correction in a word processor and parsing of spoken text. After describing the design of a punctuation-restoration system, which relies on a trigram language model and a straightforward application of the Viterbi algorithm, we summarize results, both quantitative and subjective, of the performance and behavior of a prototype system.
Citation:
D. Beeferman, A. Berger, J. Lafferty (1998). IEEE Conference on Acoustic, Speech and Signal Processing. Seattle, WA.

Online documents:
Conference paper [ps]

 

Experiments in spoken document retrieval at CMU

We describe our submission to the TREC-7 Spoken Document Retrieval (SDR) track and the speech recognition and information retrieval engines. We present SDR evaluation results and a brief analysis. A few developments are also decribed in greater detail including:
  • A new, probabilistic retrieval engine based on language models,
  • A new, TFIDF-based weighting function that incorporates word error probability
  • The use of a simple confidence estimate for word probability based on speech recognition lattices.
Although improvements over a development test set were promising, the new techniques failed to yield significant gains in the evaluation test set.
Citation:
M. Siegler, A. Berger, M. Witbrock, A. Hauptman (1998). Proceedings of TREC-7, The Seventh Text Retrieval Conference.

Online documents:
Conference paper [pdf]