Research interests: The application of information theory to
real-world problems involving text. Specifically, my thesis work
focused on applying information theory to problems in information
retrieval, including the ranking, classification, translation and
summarization of documents. Other interests include machine learning,
statistical inference, data mining, speech processing, coding
Selected publications (including PhD thesis) are here.
- The StART
project: Statistical methods applied to retrieval
technology. Looking at real-world problems in information retrieval,
such as classification, retrieval and summarization of documents, from
a statistical perspective.
- Statistical Machine Translation: I
was a member of the Candide group at the IBM Watson Research
Center for several years in the early 1990's. The group's mission was
to explore the possibilities of fully automatic translation from one
language (say, French) to another (English) via computer, by allowing
a computer to inspect a large collection of translated data and, from
the collection, "learn" how to translate. John Lafferty and I have
revived and extended that work in the Weaver project here at CMU.
- Language Modelling: Predicting the next
word in a sequence of English text, and related questions. Language models form
an integral part of commercial speech and handwriting recognition systems, and
have recently been put to use in document retrieval systems.
- Maximum Entropy and Exponential Models:
Almost all of the work I do on language modelling falls into the maxent/minimum
divergence framework. This page contains information mostly of a tutorial
nature on the use of discrete exponential models in natural language