Document retrieval is the task of ranking a collection of documents by relevance to a query. Commercial Internet search engines such as Lycos and Excite devote considerable resources to solving the problem of discovering, among the millions of documents comprising the World Wide Web, those web pages most relevant to a user's query. An old and still pressing issue in text retrieval is how to account for the fact that language is nuanced, and words and meanings aren't in a one-to-one relation: polysemy refers to the case when a word has multiple meanings, and synonymy to the case where two words have the same meaning. A retrieval algorithm which doesn't account for the polysemous nature of lemon may err in ranking as highly relevant to the query lemon law a document on FDA citrus-growing regulations; likewise, an algorithm not accounting for synonymy might, for the query automobiles, overlook documents containing car but not automobile. Inspired by recent work in statistical machine translation, we have developed algorithms which account for such lexical ambiguities when ranking documents by relevance to a query.

Document classification is the task of learning, from a collection of categorized documents, how to assign categories to documents. Classifying documents by computer---for instance, automatically assigning index terms to medical research papers---has been of interest to information scientists for many years, but has gained greater importance with the growth of the Web. Internet-related classification research has addressed the problem of learning to collect interesting postings to electronic discussion groups based on a user's predilections, automatically classifying web pages by content, and suggesting web pages to a user based on his or her expressed preferences. We've applied a technique developed in the artificial intelligence and statistics literature called error-correcting output coding (ECOC) to the classification problem. Early results indicate that ECOC considerably outperforms Naive Bayes classification, a popular and accurate classification method.

Text segmentation is the problem of automatically partitioning text into coherent segments. For instance, a long, undivided stream of text automatically transcribed, via a speech recognition system, from news broadcasts may need to be partitioned into individual stories in order to be useful. We have developed a prototype text segmenting system based on feature-based exponential models. The system first assigns to every position in text a probability that that position represents a good place to "cut," and then uses these probabilities to decide where to actually place the divisions.

Language Modeling is the problem of predicting the next word in a sequence of natural language text. Language models form an integral part of modern speech recognition systems, for instance. One area of active research is how to dynamically alter a model based on the topic of the text. We've constructed a prototype language modelling system which conducts "real-time research" on the Internet to update its model. We've also built statistical models to describe how some words augur others in text---seeing stock, for instance, means that share is subsequently more likely.