Summarization in Lemur


Contents

  1. Overview
  2. Applications
  3. Summarization API

1. Overview

The Lemur summarization library includes an abstract class for general automatic summary generation. This class is also useful for users who want to build a prototype / evaluation system. There are two sample methods implemented: a basic sentence selection algorithm as well as one based on a Maximum Marginal Relevance (MMR) algorithm. Both are designed with generic summarization in mind, although they can be used to produce query-based summaries as well. The goal of this abstract is to provide a reasonable structure for swapping out various summarization algorithms easily that work off of a Lemur index.

2. Applications

BasicSummApp

This application demonstrates the very simplest summarizer one can create with the provided API. The sentence selection algorithm is a quick scoring algorithm that scores all passages which make up a document. Passages are then "pulled" back out of the summarizer.

NOTE: This summarizer will attempt to locate end of sentence markers in the document vector for a particular document. This is currently done by looking for a token "*eos". If no such tokens are located, it chops the document into sequential passages of a fixed length and scores those. A provided file, webparser_extended.l is provided if you wish to use it. Replacing the parser with one generated by this lex file will translate <s> tokens in a source document into *eos tokens for you. It will also identify titles in HTML documents by inserting the special token *title prior to terms in the document vecotrs that appear inside the html <title> tag.

MMRSummApp

This application demonstrates a more complex summarizer which does comparisons between passages. The algorithm requires a query, as it is query-based by nature, although it will auto-generate a query if one is not provided that is appropriate for the document. The application itself, however, is also a simple program, the complication is encapsulated in the class MMRSumm.

NOTE: See note above regarding *eos and *title markers. This implemntation also utilizes identification of pronouns in the same way, if available. The algorithm will work without pronoun identification. The previously mentioned webparser_extended.l will recognize a <pronoun> tag in a source document, assuming it appears just prior to a pronoun in text, as used by this application.

3. Summarization API

The basic Summarizer class provides a generic interface for various summary generation techniques. It describes essentially two ways for each summarizer to be accessed by an application. One method specifies a pre-determined summary length (in number of passages), after which the summarizer determines what those passages are and hands them back to the application at once. The other method is iterative, where the application can continually request subsequent passages from the summarizer. Two implementations are provided to demonstrate how to utilize this interface, BasicSumm which implements a simple sentence selection algorithm, and MMRSumm which implements an MMR algorithm that includes automatic query generation for generic summaries. Passage is a simply container for passages (often sentences or fixed length word sequences) which are the basic unit of each summary. The classes BasicPassage and MMRPassage are tailored implementations for BasicSumm and MMRSum respectively.


The Lemur Project
Last modified: Fri Feb 13 18:28:24 EST 2004