Document classification is the task of learning, from a collection of categorized documents, how to assign categories to documents. Classifying documents by computer---for instance, automatically assigning index terms to medical research papers---has been of interest to information scientists for many years, but has gained greater importance with the growth of the Web. Internet-related classification research has addressed the problem of learning to collect interesting postings to electronic discussion groups based on a user's predilections, automatically classifying web pages by content, and suggesting web pages to a user based on his or her expressed preferences. We've applied a technique developed in the artificial intelligence and statistics literature called error-correcting output coding (ECOC) to the classification problem. Early results indicate that ECOC considerably outperforms Naive Bayes classification, a popular and accurate classification method.
Text segmentation is the problem of automatically partitioning text into coherent segments. For instance, a long, undivided stream of text automatically transcribed, via a speech recognition system, from news broadcasts may need to be partitioned into individual stories in order to be useful. We have developed a prototype text segmenting system based on feature-based exponential models. The system first assigns to every position in text a probability that that position represents a good place to "cut," and then uses these probabilities to decide where to actually place the divisions.
Language Modeling is the problem of predicting the next word in a sequence of natural language text. Language models form an integral part of modern speech recognition systems, for instance. One area of active research is how to dynamically alter a model based on the topic of the text. We've constructed a prototype language modelling system which conducts "real-time research" on the Internet to update its model. We've also built statistical models to describe how some words augur others in text---seeing stock, for instance, means that share is subsequently more likely.