Text Segmentation Using Exponential Models


This paper introduces a new statistical approach to automatically partitioning text into coherent segments. Our proposed model enlists both long-range and short-range language models to help it sniff out likely sites of topic changes in text. To aid its search, the model consults a set of simple lexical hints it has learned to associate with the presence of boundaries through inspection of a large corpus of annotated data. We also propose a new probabilistically motivated error metric for use by the natural language processing community, intended to supersede precision and recall for appraising segmentation algorithms. Qualitative assessment of our algorithm as well as evaluation using this new metric demonstrate the effectiveness of our approach in two very different domains, Wall Street Journal articles and broadcast news transcripts.