Publications

  • Automating the Construction of Internet Portals with Machine Learning
    Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore.
    To appear in Information Retrieval, Kluwer Academic Publishers. 2000.

    [ps.gz]

    Domain-specific internet portals are growing in popularity because they gather content from the Web and organize it for easy access, retrieval and search. For example, www.campsearch.com allows complex queries by age, location, cost and specialty over summer camps. This functionality is not possible with general, Web-wide search engines. Unfortunately these portals are difficult and time-consuming to maintain. This paper advocates the use of machine learning techniques to greatly automate the creation and maintenance of domain-specific Internet portals. We describe new research in reinforcement learning, information extraction and text classification that enables efficient spidering, the identification of informative text segments, and the population of topic hierarchies. Using these techniques, we have built a demonstration system: a portal for computer science research papers. It already contains over 50,000 papers and is publicly available at www.cora.justresearch.com. These techniques are widely applicable to portal creation in other domains.

  • Learning Hidden Markov Model Structure for Information Extraction
    Kristie Seymore, Andrew McCallum, and Ronald Rosenfeld.
    AAAI'99 Workshop on Machine Learning for Information Extraction. 1999.
    [ps.gz]
    You can get a copy of the data sets used in this work here.
    Statistical machine learning techniques, while well proven in fields such as speech recognition, are just beginning to be applied to the information extraction domain. We explore the use of hidden Markov models for information extraction tasks, specifically focusing on how to learn model structure from data and how to make the best use of labeled and unlabeled data. We show that a manually-constructed model that contains multiple states per extraction field outperforms a model with one state per field, and discuss strategies for learning the model structure automatically from data. We also demonstrate that the use of distantly-labeled data to set model parameters provides a significant improvement in extraction accuracy. Our models are applied to the task of extracting important fields from the headers of computer science research papers, and achieve an extraction accuracy of 92.9%.

  • A Machine Learning Approach to Building Domain-Specific Search Engines
    Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore.
    Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99). 1999.

    [ps.gz]

    Building Domain-Specific Search Engines with Machine Learning Techniques (longer version)
    Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore.
    AAAI Spring Symposium on Intelligent Agents in Cyberspace. 1999.

    [ps.gz]
    Domain-specific search engines are growing in popularity because they offer increased accuracy and extra functionality not possible with the general, Web-wide search engines. For example, www.campsearch.com allows complex queries by age-group, size, location and cost over summer camps. Unfortunately these domain-specific search engines are difficult and time-consuming to maintain. This paper proposes the use of machine learning techniques to greatly automate the creation and maintenance of domain-specific search engines. We describe new research in reinforcement learning, information extraction and text classification that enables efficient spidering, identifying informative text segments, and populating topic hierarchies. Using these techniques, we have built a demonstration system: a search engine for computer science research papers. It already contains over 50,000 papers and is publicly available at www.cora.justresearch.com.

  • Nonlinear Interpolation of Topic Models for Language Model Adaptation
    Kristie Seymore, Stanley Chen and Ronald Rosenfeld,
    Proceedings of ICSLP98, December, 1998.

    [ps.gz], Updated details of this work will eventually be found here.
    Topic adaptation for language modeling is concerned with adjusting the probabilities in a language model to better reflect the expected frequencies of topical words for a new document. The language model to be adapted is usually built from large amounts of training text and is considered representative of the current domain. In order to adapt this model for a new document, the topic (or topics) of the new document are identified. Then, the probabilities of words that are more likely to occur in the identified topic(s) than in general are boosted, and the probabilities of words that are unlikely for the identified topic(s) are suppressed.

    We present a novel technique for adapting a language model to the topic of a document, using a nonlinear interpolation of n-gram language models. A three-way, mutually exclusive division of the vocabulary into general, on-topic and off-topic word classes is used to combine word predictions from a topic-specific and a general language model. We achieve a slight decrease in perplexity and speech recognition word error rate on a Broadcast News test set using these techniques. Our results are compared to results obtained through linear interpolation of topic models.

  • The 1997 CMU Sphinx-3 English Broadcast News Transcription System
    Kristie Seymore, Stanley Chen, Sam-Joo Doh, Maxine Eskenazi, Evandro Gouvea,
    Bhiksha Raj, Mosur Ravishankar, Ronald Rosenfeld, Matthew Siegler, Richard Stern, and Eric Thayer,
    Proceedings of the 1998 DARPA Speech Recognition Workshop, 1998.

    [ps.gz] [HTML].
    This paper describes the 1997 Hub-4 Broadcast News Sphinx-3 speech recognition system. This year's system includes full-bandwidth acoustic models trained on Broadcast News and Wall Street Journal acoustic training data, an expanded vocabulary, and a 4-gram language model for N-best list rescoring. The system structure, acoustic and language models, and adaptation components are described in detail, and results are presented to establish the contributions of multiple recognition passes. Additionally, experimental results are presented for several different acoustic and language model configurations.

  • Topic Adaptation for Language Modeling Using Unnormalized Exponential Models
    Stanley Chen, Kristie Seymore and Ronald Rosenfeld,
    Proceedings of ICASSP '98, 1998.

    [ps.gz]
    In this paper, we present novel techniques for performing topic adaptation on an n-gram language model. Given training text labeled with topic information, we automatically identify the most relevant topics for new text. We adapt our language model toward these topics using an exponential model, by adjusting probabilities in our model to agree with those found in the topical subset of the training data. For efficiency, we do not normalize the model; that is, we do not require that the probabilities in the language model sum to 1. With these techniques, we were able to achieve a modest reduction in speech recognition word-error rate in the Broadcast News domain.

  • Experiments in Spoken Document Retrieval at CMU
    Matthew Siegler, Michael Witbrock, Sean Slattery, Kristie Seymore,
    Rosie Jones, and Alex Hauptmann,
    Proceedings of TREC-6, The Sixth Text Retrieval Conference, 1997.

    [ps.gz]

  • Using Story Topics for Language Model Adaptation
    Kristie Seymore and Ronald Rosenfeld,
    Proceedings of Eurospeech '97, September 1997.

    [ps.gz]

    Large-scale Topic Detection and Language Model Adaptation (longer version)
    Kristie Seymore and Ronald Rosenfeld, Carnegie Mellon University Tech Report CMU-CS-97-152, June 1997.
    [ps.gz]
    The subject matter of any conversation or document can typically be described as some combination of elemental topics. We have developed a language model adaptation scheme that takes a piece of text, chooses the most similar topic clusters from a set of over 5000 elemental topics, and uses topic specific language models built from the topic clusters to rescore N-best lists. We are able to achieve a 15% reduction in perplexity and a small improvement in WER by using this adaptation. We also investigate the use of a topic tree, where the amount of training data for a specific topic can be judiciously increased in cases where the elemental topic cluster has too few word tokens to build a reliably smoothed and representative language model. Our system is able to fine-tune topic adaptation by interpolating models chosen from thousands of topics, allowing for adaptation to unique, previously unseen combinations of subjects.

  • Language and Pronunciation Modeling in the CMU 1996 Hub 4 Evaluation
    Kristie Seymore, Stanley Chen, Maxine Eskenazi and Ronald Rosenfeld,
    Proceedings of the 1997 ARPA Speech Recognition Workshop, 1997.

    [ps.gz] [HTML]
    We describe several language and pronunciation modeling techniques that were applied to the 1996 Hub Broadcast News transcription task. These include topic adaptation, the use of remote corpora, vocabulary size optimization, n-gram cutoff optimization, modeling of spontaneous speech, handling of unknown linguistic boundaries, higher order n-grams, weight optimization in rescoring, and lexical modeling of phrases and acronyms.

  • Scalable Backoff Language Models
    Kristie Seymore and Ronald Rosenfeld,
    ICSLP96, October, 1996.

    [ps.gz]

    Scalable Trigram Backoff Language Models (longer version)
    Kristie Seymore and Ronald Rosenfeld,
    Carnegie Mellon University Tech Report CMU-CS-96-139, May 1996.

    [ps.gz]
    When a trigram backoff language model is created from a large body of text, trigrams and bigrams that occur few times in the training text are often excluded from the model in order to decrease the model size. Generally, the elimination of n-grams with very low counts is believed to not significantly affect model performance. This project investigates the degradation of a trigram backoff model's perplexity and word error rates as bigram and trigram cutoffs are increased. The advantage of reduction in model size is compared to the increase in word error rate and perplexity scores.

    More importantly, this project also investigates alternative ways of excluding bigrams and trigrams from a backoff language model, using criteria other than the number of times an n-gram occurs in the training text. Specifically, a difference method has been investigated where the difference in the logs of the original and backed off trigram and bigram probabilities is used as a basis for n-gram exclusion from the model. We show that excluding trigrams and bigrams based on a weighted version of this difference method results in better perplexity and word error rate performance than excluding trigrams and bigrams based on counts alone.

  • The 1996 Hub-4 Sphinx-3 System
    Paul Placeway, Stanley Chen, Maxine Eskenazi,
    Uday Jain, Vipul Parikh, Bhiksha Raj, Mosur Ravishankar, Ronald Rosenfeld,
    Kristie Seymore, Matthew Siegler, Richard Stern and Eric Thayer,
    Proceedings of the 1997 ARPA Speech Recognition Workshop, 1997.

    [HTML]
    This paper describes the CMU Sphinx-3 system, and the configuration we used for the 1996 DARPA (Hub-4) evaluation. The model structure, acoustic modeling, language modeling, lexical modeling, and system structure are summarized. We also discuss the experimental results obtained with this system on the most recent DARPA evaluation, and some subsequent results are also discussed.


    last updated on 4/6/99