CMU REPORT ON STORY LINK DETECTION

Ralf Brown, Tom Pierce

Language Technologies Institute
Carnegie Mellon University
Pittsburgh, PA 15213

Presentation for the TDT-3 Workshop



1. SYSTEM DESCRIPTION: CMU-1

Our first system is based on cosine similarity between the two stories under test:
  1. documents are stop-worded, stemmed, and converted to binary term vectors (multiple occurrences of a word are treated as a single occurrence)
  2. term vectors are weighted by TF*IDF for the terms; TF*IDF initialized from dry-run data and incrementally updated on the test data
  3. cosine similarity is computed and optionally discounted by a time-based decay (the more time has elapsed between stories, the lower the score)
  4. resultant score is thresholded with a split threshold (separate values depending on whether both stories are from the same source)



2. SYSTEM DESCRIPTION: CMU-2

Our second system is also based on cosine similarity, but differs from CMU-1:
  1. term vectors are not binary
  2. uses (1+log(TF))*IDF instead of straight TF*IDF
  3. optional probabilistic modeler (not used)
  4. no time-based decay



3. SYSTEM DESCRIPTION: COMMON CODE

Both systems use the same code library for all data access (loading the story collection, etc.) and to process the test data, perform internal scoring, etc. As a result, CMU-1 contains only about 500 lines of code specific to it -- the remainder is the common library code.

The common code also contains not only all the logic for processing the SLD input data, it also has the (as yet unused) capability to invoke multiple decision procedures on each story pair and to combine their decisions in various ways (majority vote, weighted vote, all-but-one, etc.).



4. PERFORMANCE - THEN



System Transcription Deferral Clink
CMU-1 ASR 1 1.1260
CMU-1 ASR 10 1.0943
CMU-1 ASR 100 1.0921
CMU-1 manual 1 1.1477
CMU-1 manual 10 1.1657
CMU-1 manual 100 1.0974
CMU-2 ASR 10 0.4667



5. PERFORMANCE - NOW



System Data Set Transcription Deferral Clink
CMU-1 dry run ASR 10 0.1399
CMU-1 Dec.eval ASR 10 1.1320
CMU-1 alternate ASR 10 0.1392
CMU-2 dry run ASR 10 0.1267
CMU-2 Dec.eval ASR 10 1.2867
CMU-2 alternate ASR 10 0.1269



6. PERFORMANCE COMPARISON BY DATA SET



DET curves for CMU-1
DET curves for CMU-2



7. WHAT WENT WRONG?

Basically, the training set was not representative of the test set:
  1. Optimal decision thresholds are very different for the dry-run and official evaluation sets, and nearly the same for the dry-run and alternate data sets -- 0.65-0.75 on dry-run and alternate versus 0.18-0.22 on evaluation set for CMU-1.
  2. The distribution of similarity scores differs.
  3. The dry-run and evaluation sets have different temporal characteristics -- adding a time based score decay to the dry-run set hurts performance, while doing the same to the eval set helps.

The dry-run data is representative of the alternate data set, which was generated in the same manner as the dry-run set. When tuned on the dry-run data, both of our systems had equal (or even slightly better) performance on the alternate set as on the dry-run data.

In some sense, the evaluation data is harder than either of the other two data sets, as even tuning on the evaluation set yields Clink values three to four times higher than on either of the other sets.



8. DISTRIBUTION OF SCORES



Score distribution on Dry Run data
Score distribution on Evaluation data
Score distribution on Alternate data



9. FUTURE WORK

  1. Design and implement additional link-detection strategies
  2. Create a multi-strategy SLD system from those new methods and one or both existing methods
  3. Expand split thresholds into separate thresholds for each possible combination of sources.