Dataset for Novelty and Redundancy Detection (CMUNRF1)

Introduction

We created a one gigabyte dataset by combining AP News and Wall Street Journal data from TREC CDs 1, 2, and 3 available at LDC. We chose these corpora because they are widely available, because information needs and relevance judgments are available from NIST, and because the two newswire corpora cover the same time period (1988 to 1990) and many of the same topics, guaranteeing a certain amount of redundancy in the document stream. Documents were ordered chronologically according for filtering purposes.
50 TREC topics (101 to 150) were used to simulate user profiles.

We hired undergraduate students, who were otherwise unaffiliated with our research, to read the relevant documents for a profile in chronological order and to provide redundancy judgments. The decision to restrict their attention to relevant documents made the task more manageable, and was consistent with a filtering system where another component makes decisions about relevance.

Assessors made their judgments one topic at a time. They were instructed to make a decision for each document about whether the information it contained was redundant with document(s) seen previously for that topic, and to identify the prior document(s).

For most of the profiles (101-121,123-125,127-129,132,135-141), documents were judged by two independent assessors and then differences were resolved by the assessors.

Students reported that the choice of corpus (``old'') and topics made this a dull task, so we were unable to collect assessments for all 50 topics. And for some profiles (122, 133,134,143,144,146,148,149), documents were judged by only one assessor.

Metrics

We believe that in operational environments different people will have different definitions of redundancy and different redundancy thresholds. We modeled this environment by not giving assessors a precise definition of redundancy. We provided two degrees of redundancy, ABSOLUTE REDUNDANT and SOMEWHAT REDUNDANT; If the assessor thought a person would definitely not want to read dt because it absolutely contained no new information, dt was marked as ABSOLUTE REDUNDANT. If the assessor thought that a new document had some new information that a person might want to read, even though much of the document was redundant with a prior document, the document could be marked as SOMEWHAT REDUNDANT . We assume the unmarked documents are NOVEL.

We assume if doc1 makes doc2 redundant, doc2 makes doc3 redundant, then doc1 makes doc3 redundant. Actually the annotator used the same assumption and they only annotated as follows:
doc1 makes doc2 redundant
doc2 makes doc3 redundant
Our post processing program automatically added another entry to indicates that doc1 makes doc3 redundant.

Files

You can download all the files here. You can also download each file individually:

redundancy.apwsj.results: This file contains redundancy judgments. An example of the redundancy assessments is shown below. The first field is a profile id. The second field is the document id of a redundant document. Subsequent document ids are the documents that preceded it and made it redundant. A ? indicates that a document is SOMEWHAT REDUNDANT, otherwise it is ABSOLUTE REDUNDANT

[q121 AP880214-0049 ? AP880214-0002 ] if user q121 read document AP880214-0002, then AP880214-0049 is somewhat redundant
[q121 AP880217-0031 AP880216-0137 ] if user q121 read document AP880216-0137, then AP880217-0031 is absolutely redundant.
[q128 AP880218-0137 AP880218-0113 AP880218-0112 ] if user q128 read AP880218-0113 and AP880218-0112, then AP880218-0137 is absolutely redundant.

Topics: This file contains information of TREC topic 101-150 as provided by NIST. These topics were used to simulate user profiles.

apwsj.qrels: This file contains relevance judgments for TREC topic 101-150 as provided by NIST. An example of the relevance judgments is shown below. The first field is a profile id. The third field is a relevant document for that profile.

[q128 0 AP901228-0001 1]: document AP901228-0001 is relevant for profile q128.

Profile2AnnotatorMap.txt: For each different annotator, we assigned an unique ID for this. In case the information of annotator is useful for your experiments, we also provided the information. This file contains information about the ID of annotator(s) who provided redundancy judgments for each profile.

apwsj88-90.docno.sorted: contains the document IDs of AP News and Wall Street Journal data set. When doing experiments, the filtering system should process documents in the order as given.

apwsj88-90.rel.docno.sorted: contains the document IDs of AP News and Wall Street Journal data set that are relevant for at least one user profile.

As described in our paper Novelty and Redundancy Detection in Adaptive Filtering, our adaptive filtering system processes all the documents in the order as given in apwsj88-90.docno.sorted, and the Novelty/Redundancy module in the system processes all the relevant documents in the order as given in apwsj88-90.rel.docno.sorted. In the adaptive filtering environment, order of documents are important. Please refer to the paper for more information about doing experiments

This directory does not contain AP News and Wall Street Journal data from TREC CDs 1, 2, and 3. You need to contact NIST or LCD to get this data set.

Reference

Y. Zhang , J. Callan and T. Minka Novelty and Redundancy Detection in Adaptive Filtering. In Proceedings of the 25st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, 2002.

The following 33 topics were used for the experiments reported in the paper: q101, q102, q103, q104, q105, q106, q107, q108, q109, q111, q112, q113, q114, q115, q116, q117, q118, q119, q120, q121, q123, q124, q125, q127, q128, q129, q132, q135, q136, q137, q138, q139, q141.

Contact Information

If you have any questions or comments, please send the email to