Faculty
Mentor: |
|
Alex
Hauptmann (alex@cs.cmu.edu) |
Students: |
|
Jichuan Chang (cjc@cs.cmu.edu) Ningning Hu
(hnn@cs.cmu.edu) Zhirong Wang (zhirong@cs.cmu.edu) |
In this paper, we report our experiences of building text segmenter for the Informedia project. Several methods are used in our project; their experimental results are compared and analyzed. Based on the application background of Informedia project, we choose to use relaxed error metrics to do performance evaluation.
In this section we very
briefly discuss some previous approaches to the text segmentation problem.
2.1 Exponential Models
In Beeferman’s paper “Text
Segmentation Using Exponential Models”[3], introduces a new statistical
approach to partitioning text automatically into coherent segments. The
approach enlists both short-range and long-range language models to help it
sniff out likely sites of topic changes in text. To aid its search, the system
consults a set of simple lexical hints it has learned to associate with the
presence of boundaries through inspection of a large corpus of annotated data.
2.2 Machine learning
In Litman’s paper “Combining
Multiple Knowledge Sources for Discourse Segmentation “[9], predicts discourse
segment boundaries from linguistic features of utterances, using a corpus of
spoken narratives as data. In this paper, they present two methods for
developing segmentation algorithms from training data: hand tuning and machine
learning. When multiple types of features are used, results approach human
performance on an independent test set (both methods), and using
cross-validation (machine learning).
2.3 Lexical cohesion
In Kozima’s paper “Text
Segmentation Based on Similarity Between Words “[10], proposes a new indicator
of text structure, called the lexical cohesion profile (LCP), which locates
segment boundaries in a text. A text segment is a coherent scene; the words in
a segment are linked together via lexical cohesion relations. LCP records
mutual similarity of words in a sequence of text. The similarity of words,
which represents their cohesiveness, is computed using a semantic network.
Comparison with the text segments marked by a number of subjects shows that LCP
closely correlates with the human judgments. LCP may provide valuable
information for resolving anaphora and ellipsis.
Kozima generalizes lexical
cohesiveness to apply to a window of text, and plots the cohesiveness of
successive text windows in a document, identifying the valleys in the measure
as segment boundaries.
2.4 Text tiling.
In Hearst’s paper
“Multi-paragraph Segmentation of Expository Text “[8], describes TextTiling, an
algorithm for partitioning expository texts into coherent multi-paragraph
discourse units which reflect the subtopic structure of the texts. The
algorithm uses domain-independent lexical frequency and distribution
information to recognize the interactions of multiple simultaneous themes.
Two fully implemented
versions of the algorithm are described and shown to produce segmentation that
corresponds well to human judgments of the major subtopic boundaries of
thirteen lengthy texts.
A cosine measure is used to
gauge the similarity between constant-size blocks of morphologically analyzed
tokens. First-order rates of change of this measure are then calculated to
decide the placement of boundaries between blocks, which are then adjusted to
coincide with the paragraph segmentation, provided as input to the algorithm.
This approach leverages the observation that text segments are dense with
repeated content words. Relying on this fact, however, may limit precision
because the repetition of concepts within a document is subtler than can be
recognized by only a “bag of words" tokenizer and morphological filter.
2.5 Dragon’s Approach for
text segmentation
In Allan’s paper “Topic
Detection and Tracking Pilot Study Final Report “[11], describes several
approaches to do text segmentation. One is from Dragon. Dragon's approach to
segmentation is to treat a story as an instance of some underlying topic, and
to model an unbroken text stream as an unlabeled sequence of these topics. In
this model, finding story boundaries is equivalent to finding topic
transitions. At a certain level of abstraction, identifying topics in a text
stream is similar to recognizing speech in an acoustic stream. Each topic block
in a text stream is analogous to a phoneme in speech recognition, and each word
or sentence (depending on the granularity of the segmentation) is analogous to
an ``acoustic frame''. Identifying the sequence of topics in
an unbroken transcript
therefore corresponds to recognizing phonemes in a continuous speech stream.
Just as in speech recognition, this situation is subject to analysis using classic
Hidden Markov Model (HMM) techniques, in which the hidden states are topics and
the observations are words or sentences.
·
Rainbow: Rainbow is a statistical text
classification tool built on top of the Bow library. It first
build an index file by counting word in training data, then a SVM classifier
are trained and used to classify the testing data.
In our experiment, the performance is similar with Naive Bayes
classification (although achieved in much longer time).
·
SVMlight is a SVM tool built for
text classification. It accepts vector (of word counts) or sparse vector input.
After counting the number of distinct word, we realized that even for SVM,
there are too many features to train a classifier.
In order to reduce the dimension of feature space within several hundreds, we decide to choose only the words with the highest average mutual information. To simplify the computation, we actually chose words from the sentences sitting just before and after the boundary, which have the largest difference between their occurrence probability in boundary blocks and background blocks.
The result is disappointing mainly because SVMlight takes too long to learn. For a simple case with 500 training data (actually 3 passages), the training process finished in 15 minutes (out machine runs Linux on top of a Intel Pentium III box). The training time increases quickly with the training data size, when we change the size of training data into 1600 (about 8 passages) if failed to finish after 15 hours[1].
·
Number of topics
·
Size of sliding window
Table 1 shows the result of different segmenter built with different sliding window size and number of topics. According to this result, we can see that clustering into more topics can improve the overall classification accuracy. But limited by Rainbow’s processing ability, we only have the time to get such result.
|
Recall |
Precision |
|||||
Topics Size |
4 |
6 |
8 |
4 |
6 |
8 |
|
8 |
0.321839 |
0.256177 |
0.311724 |
0.196568 |
0.248424 |
0.365696 |
|
|
16 |
0.421456 |
0.360208 |
0.38069 |
0.198198 |
0.267633 |
0.353846 |
Table 1. Segmentation performance using topic change detection method
We cut our news stories into small passages with a fixed window. These passages were manually classified into two classes. If the passage includes the boundary sentence, it will be labeled as “yes” class. “Yes” means there is a boundary in this passage. Else this passage will be classified as “no” class. We divided the dataset into two parts, one set for training and the other for testing. The testing data includes CNN World View 2000 from July to October.
Now our objective is to use these two categories of data to build a classifier. Since Naive Bayes classifiers are among the most successful known algorithms for learning to classify text documents. So we applied Naïve Bayes classifiers to our problem.
|
|||||
0.263 |
0.516 |
|
|||
0.331 |
0.648 |
|
|||
2 |
0.336 |
0.772 |
0.383 |
0.749 |
|
Table 2. Performance of ANN segmentation
(1) SVM: Rainbow and SVMlight
|
SVMligh |
|
Recall |
0.07 |
?? |
Precision |
0.223 |
?? |
(2) ANN
2.1. Impact of Training Data Distribution
Figure 3. Impact of Training Data Distribution
100 Input Units |
No Stop Words |
No Stop Words |
||
Recall |
0.597 |
0.721 |
0.705 |
0.846 |
Precision |
0.201 |
0.115 |
0.327 |
0.208 |
|
100 Input Units (No merge) |
200 Input Units |
||
Recall |
0.597 |
0.628 |
0.705 |
0.739 |
Precision |
0.201 |
0.137 |
0.327 |
0.219 |
|
%Y = 25% |
%Y = 33% |
%Y = 50% |
Recall |
0.589 |
0.777 |
0.888 |
Precision |
0.122 |
0.100 |
0.009 |
Recall |
4 |
6 |
8 |
Precision |
4 |
6 |
8 |
# topics = 8 |
0.322 |
0.256 |
0.312 |
# topics = 8 |
0.197 |
0.248 |
0.366 |
# topics =16 |
0.421 |
0.360 |
0.381 |
# topics =16 |
0.198 |
0.268 |
0.354 |
Correct: 55313 out of 65359
(84.63 percent accuracy)
|
Class name |
No |
Yes |
Total |
Acc(%) |
0 |
No |
48330 |
7177 |
56507 |
87.3 |
1 |
Yes |
2869 |
5983 |
8852 |
67.59 |
- Confusion details, row is
actual, column is predicted
Percent_Accuracy average 84.63
|
30 words (Fixed window size) |
Recall |
0.68 |
Precision |
0.45 |
(6) Performance of different segmentation methods
Figure 4. Segmentation Accuracy
References
[1] Y. Yang, T. Ault T. Pierce and C. Lattimer, “Improving text categorization methods for event tracking”, Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp65-72. 2000.
[2] T. Mitchell, “Machine Learning”
[3] D. Beeferman, A. Berger, and J. Lafferty, "Text Segmentation Using Exponential Models," Proceedings of Empirical Methods in Natural Language Processing, AAAI 97, Providence, RI, 1997
[4] M. A. Hearst, and C. Plaunt, “Subtopic structuring for full-length document access,” in ProcACM SIGIR-93 Int’l Conf. On Research and Development in Information Retrieval, pp. 59 – 68, Pittsburgh, PA, 1993.
[5] A. Merlino, D. Morey, and M. Maybury, “Broadcast News Navigation using Story Segmentation,” ACM Multimedia 1997, November 1997
[6] A. Hauptmann, M. Witbrock, "Story Segmentation and Detection of Commercials in Broadcast News Video," ADL-98 Advances in Digital Libraries Conference, Santa Barbara, CA., April 22-24, 1998
[7] J. Lafferty, A. Berger, D. Beeferman, “Statistical Models for Text Segmentation,” Special Issue on Natural Language Learning, C. Cardie and R. Mooney, eds. 34(1-3), pp 177-210, 1999
[8] M. A. Hearst, “Multi-paragraph segmentation of expository text”. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 1994
[9] J. Litman and R.J. Passonneau, “Combining Multiple Knowledge Sources for
Discourse Segmentation”, in Proceedings of the ACL, 1995.
[10] H. Kozima, “Text Segmentation Based on Similarity between Words”, in Proceedings of the ACL, 1993
[11] Allan, J., Carbonell, J.G., Doddington, G., Yamron, J. and Yang Y.” Topic Detection and Tracking Pilot Study Final Report”, Proceedings of the Broadcast News Transcription and Understranding Workshop (Sponsored by DARPA), Feb. 1998
Without Stop Word Removal |
After Stop Word Removal |
||
CNN |
OF |
CNN |
WORLDVIEW |
THE |
A |
REPORTS |
__USA__ |
OF |
I |
CAPTIONING |
UNITED |
IT |
THAT |
__NUM__ |
CLINTON |
TO |
AND |
CLOSED |
THINK |
REPORTS |
IN |
WORLDVIEW |
PRESIDENT |
AND |
IT |
ADDITION |
STATES |
THAT |
WORLDVIEW |
REPORTING |
__NUM__ |
IS |
THEY |
BELL |
CNN |
THIS |
WE |
LONDON |
SPACE |
CAPTIONING |
__USA__ |
PROVIDED |
WORLD |
ON |
ON |
ORDERED |
IRAQ |
A |
YOU |
THINK |
PRINCESS |
__NUM__ |
HE |
HEART |
DAY |
THEY |
BE |
JERUSALEM |
DIANA |
WE |
THERE |
ATLANTIC |
WELL |
ARE |
DO |
WASHINGTON |
ISRAEL |
CLOSED |
TO |
WALTER |
NEW |
BY |
HAVE |
RODGERS |
JUDY |
THERE |
FOR |
PEOPLE |
AHEAD |
WORLDVIEW |
UNITED |
COMMUNICATION |
GET |
AT |
THIS |
WHITE |
TODAY |
I |
WILL |
THANK |
FIRST |
DO |
CLINTON |
CORRESPONDENT |
MILITARY |
ADDITION |
FROM |
MOSCOW |
MONEY |