Text Segmentation in the Informedia Project

Faculty Mentor:

Alex Hauptmann (alex@cs.cmu.edu)

Students:

Zhirong Wang (Zhirong@cs)

Jichuan Chang (cjc@cs)

Ningning Hu (hnn@cs)

1. Problem Description

Our research will address the problem of story segmentation in the Informedia Digital Video Library project, focusing on how to use closed-captioning cues to automatically detect the story boundary and segment the word stream into coherent regions of text.

With a collection of text data annotated with the boundaries between regions, the problem of text segmentation is to learn how to place breaks in unlabeled text by observing a set of labeled examples. In our source data, the boundaries are denoted with paragraph separators “>>>”.

Segmentation is an integral and critical process in the Informedia digital video library. The success of information retrieval in Informedia hinges on an important assumption: that we can segment the whole new broadcast into individual paragraphs or stories. For example, given a large unpartitioned collection of expository text and a users query, return a collection of coherent segments matching the query. Without segmentation technique, an IR application may be able to locate positions in its database which are strong matched with the user’s query, but be unable to determine how much of the surrounding data to provide to the user.

One method to segment news broadcast is using text (such as transcripts) as an aid. Usually there is closed captioning along with the news broadcast, with hints in it denoting story boundaries. How can we learn from this information to segment text into stories? [1] points out that a seemingly simple problem can actually prove quite difficult to automate, and that a tool for partitioning a stream of text (or multimedia) into coherent regions would be of great benefit to a number of existing applications.

Training data is provided by new provider and stored in Informedia disks, with paragraph separators (>>>) denoting the boundary of story. For example, cwv199910091830.cc contains the closed captioning text for CNN WorldView on Oct. 19, 1999. The closed captioning text contain (maybe incorrect) timing information, and every news lasts about 18:30 minutes. Below is an example:

000076 ANDRIA: FROM CNN IN ATLANTA,

000079 SEEN LIVE AROUND THE WORLD, THIS

000081 IS "WORLDVIEW." I'M ANDRIA

000082 HALL. >>> THE SEARCH FOR

000084 SURVIVORS CONTINUES IN SOUTHEASTERN

000087 MEXICO, WHERE THE DEATH TOLL IS

000088 RISING BY THE HOUR. IT'S NOW

000089 CLOSE TO 300. RESCUERS ARE

000091 DIGGING THROUGH A SEA OF MUD,

000093 SEARCHING FOR HUNDREDS WHO ARE

000094 MISSING. THE U.S. STATE DEPARTMENT,

000096 WARNING TRAVELERS TO USE

000099 EXTREME CAUTION IN THE REGION.

We can remove all the separators to produce testing data. Although these separators are not always correct, the goal of our work is to build algorithm and system that can put back all the separators in their original positions. Even ideally, the system should improve on closed-caption boundaries.

People have tried various approaches to do text segmentation. [1] introduces a new statistical approach to automatically partitioning text into coherent segments. We will use the probabilistically motivated error metric to measure our result. [2] [6] uses the tiles for segmentation of paragraphs by topic. [3] has concentrated on automatic segmentation of stories from news broadcasts using phrase templates. BNN system is heavily tailored towards a specific news show format, namely CNN Prime News. In this project, we will use machine learning and information retrieval techniques to attack the problem.

2. Research goals

The goal of our research is to find the effective learning algorithm as well as features for text segmentation. We will try to focus on topic changes detection and adjacent sentence features extraction methods to do the segmentation job. In both of the methods, we will also use difference algorithms: decision tree, neural network, SVM, etc. We will compare their results together.

One simple way of testing is compare the result of our segmentation with the original closed-captioning. There maybe several kinds of errors: (1) incorrect location of detected boundary; (2) boundary not detected, (3) extra boundary detected. We will use the error metric introduced in [1] as well as traditional precision and recall metrics to measure the output.

In this project, we will try to answer the following questions:

(1) Which methods is the best in our implementation and testing, explain why

(2) Whether the result of our segmentation is better than the segmentation used in the Informedia Digital Video Library project, explain the reason.

(3) Compare the performance of our method with the latest text segmentation techniques, explain the difference.

We will regard our work in this project as successful if we finally find some method which produces satisfactory error-rate. If all the methods we have tried are not good enough, reasons must be provided to justify the result.

3. Project Plan

A project plan, outlining in detail what type experiments will be performed and when (take your best guess). Also describe other means of achieving the project goal, such as consulting specific literature, talking to specific experts.

4. Individual Tasks

A description of the tasks that will be performed by the individual team member. Our requirement: Each team member has to engage in interesting machine learning work, so having one student program the software interface while another runs machine learning experiments is not a good division of labor.

References

[1] Beeferman, D., Berger, A., and Lafferty, J., "Text Segmentation Using Exponential Models," Proceedings of Empirical Methods in Natural Language Processing, AAAI 97, Providence, RI, 1997

[2] Hearst, M.A. and Plaunt, C., “Subtopicstructuring for full-length document access,” in ProcACM SIGIR-93 Int’l Conf. On Research and Development in Information Retrieval, pp. 59 – 68, Pittsburgh, PA, 1993.

[3] Merlino, A., Morey, D., and Maybury,M., “Broadcast News Navigation using Story Segmentation,” ACM Multimedia 1997, November 1997

[4] Hauptmann, A., Witbrock, M., "Story Segmentation and Detection of Commercials in Broadcast News Video," ADL-98 Advances in Digital Libraries Conference, Santa Barbara, CA., April 22-24, 1998

[5] Lafferty, J., Berger, A., Beeferman, D., “Statistical Models for Text Segmentation,” Special Issue on Natural Language Learning, C. Cardie and R. Mooney, eds. 34(1-3), pp 177-210, 1999

[6] Hearst, M.A. (1994). Multi-paragraph segmentation of expository text. In Proceedings of the 32^nd Annual Meeting of the Association for Computational Linguistics.