CMU REPORT ON STORY LINK DETECTION
Ralf Brown, Tom Pierce
Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA 15213
Presentation for the TDT-3 Workshop
1. SYSTEM DESCRIPTION: CMU-1
Our first system is based on cosine similarity between the two
stories under test:
- documents are stop-worded, stemmed, and converted to binary term
vectors (multiple occurrences of a word are treated as a single
occurrence)
- term vectors are weighted by TF*IDF for the terms; TF*IDF initialized from
dry-run data and incrementally updated on the test data
- cosine similarity is computed and optionally discounted by a time-based
decay (the more time has elapsed between stories, the lower the score)
- resultant score is thresholded with a split threshold (separate values
depending on whether both stories are from the same source)
2. SYSTEM DESCRIPTION: CMU-2
Our second system is also based on cosine similarity, but differs
from CMU-1:
- term vectors are not binary
- uses (1+log(TF))*IDF instead of straight TF*IDF
- optional probabilistic modeler (not used)
- no time-based decay
3. SYSTEM DESCRIPTION: COMMON CODE
Both systems use the same code library for all data access (loading
the story collection, etc.) and to process the test data, perform
internal scoring, etc. As a result, CMU-1 contains only about 500
lines of code specific to it -- the remainder is the common library
code.
The common code also contains not only all the logic for processing
the SLD input data, it also has the (as yet unused) capability to
invoke multiple decision procedures on each story pair and to combine
their decisions in various ways (majority vote, weighted vote,
all-but-one, etc.).
4. PERFORMANCE - THEN
System | Transcription | Deferral | Clink |
CMU-1 | ASR | 1 | 1.1260 |
CMU-1 | ASR | 10 | 1.0943 |
CMU-1 | ASR | 100 | 1.0921 |
CMU-1 | manual | 1 | 1.1477 |
CMU-1 | manual | 10 | 1.1657 |
CMU-1 | manual | 100 | 1.0974 |
CMU-2 | ASR | 10 | 0.4667 |
5. PERFORMANCE - NOW
System | Data Set | Transcription | Deferral | Clink |
CMU-1 | dry run | ASR | 10 | 0.1399 |
CMU-1 | Dec.eval | ASR | 10 | 1.1320 |
CMU-1 | alternate | ASR | 10 | 0.1392 |
CMU-2 | dry run | ASR | 10 | 0.1267 |
CMU-2 | Dec.eval | ASR | 10 | 1.2867 |
CMU-2 | alternate | ASR | 10 | 0.1269 |
6. PERFORMANCE COMPARISON BY DATA SET
7. WHAT WENT WRONG?
Basically, the training set was not representative of the test set:
- Optimal decision thresholds are very different for
the dry-run and official evaluation sets, and nearly the same for the
dry-run and alternate data sets -- 0.65-0.75 on dry-run and alternate
versus 0.18-0.22 on evaluation set for CMU-1.
- The distribution of similarity scores differs.
- The dry-run and evaluation sets have different temporal
characteristics -- adding a time based score decay to the dry-run set
hurts performance, while doing the same to the eval set helps.
The dry-run data is representative of the alternate data set,
which was generated in the same manner as the dry-run set. When tuned
on the dry-run data, both of our systems had equal (or even slightly
better) performance on the alternate set as on the dry-run data.
In some sense, the evaluation data is harder than either of the other
two data sets, as even tuning on the evaluation set yields
Clink values three to four times higher than on either
of the other sets.
8. DISTRIBUTION OF SCORES
9. FUTURE WORK
- Design and implement additional link-detection strategies
- Create a multi-strategy SLD system from those new methods and one or
both existing methods
- Expand split thresholds into separate thresholds for each possible
combination of sources.