Thesis Research - Abstract

In this era, where electronic text information is exponentially growing and where time is a critical resource, it has become virtually impossible for any user to browse or read large numbers of individual documents. It is therefore important to explore methods of allowing users to locate and browse information quickly within collections of documents. Automatic text summarization of multiple documents fulfills such information seeking goals by providing a method for the user to quickly view highlights and/or relevant portions of document collections. As of yet, there has been little work with multi-document summarization, although single document summarization has been a subject of focus in the last few years. Multi-document summarization differs from single in that the issues of compression, speed, redundancy and passage selection are critical in the formation of useful summaries. If multi-document summarization is to be useful across subject areas and languages, it must be relatively independent of natural language understanding. A statistical approach allows for both rapid passage selection and speed. The maximal marginal relevance (MMR) metric is used to provide ``relevant'' novelty in passage selection, i.e., selecting passages that meet the criteria of relevance to a query, while reducing redundancy and maximizing diversity among the individual passages. The approach builds on previous work in single-document summarization by using additional, available information about the document set as a whole, the relationships between the documents, as well as properties of individual documents. The underlying framework is modular, thus allowing easy parameterization to take into account different document genres or corpora characteristics, user requirements, as well as linguistic properties of languages that can enhance summarization results. The principal question being addressed is "Can multi-document summarization effectively indicate the textual content of document collections and assist users to rapidly find their desired information?" I will explore this question by evaluating the system in the domains of newswire articles, web pages, and time permitting, computer science technical reports.

Committee:
Jaime Carbonell (Chair)
Jamie Callan
Vibhu Mittal
Jan Pedersen