Topical Shard Definitions

As a part of my research on selective search I have created topical partitions of the following three document collections: Gov2, ClueWeb09-CategoryB and ClueWeb09-CategoryA-English. The definitions of these topical shards for the three datasets are shared below. Each of the links below is a tarball that contains a separate file for individual "topics". (We do not attempt to identify or label these topics.) Each topic file contains a list of document identifiers (one per line) that were inferred to be about a similar topic. The number of topics for each dataset is specified below and was choosen by me based on some heuristic, such as, the desired number of documents per shard on average. For more details about the methodology that was used to create these topical shards please refer to my CIKM 2010 paper.

Document Collection: Gov2
Number of Topics: 50
Topical Shard Definitions: gov2-50TopicalShards.tar.gz (File size: 129MB)

Document Collection: ClueWeb09-CategoryB
Number of Topics: 100
Topical Shard Definitions: categoryB-100TopicalShards.tar.gz (File size: 150MB)

Document Collection: ClueWeb09-CategoryA-English
Number of Topics: 1000
Topical Shard Definitions: categoryA-English-1000TopicalShards.tar.gz (File size: 1.6GB) (Note: The lastest and the correct shard definitions for ClueWeb09-CategoryA-English are from Jan 23, 2012.)

Back to HomePage Last Updated: 23 Jan 2012