The Forum Dataset

The Forum Dataset is no longer available for download. The supplemantal material and evaluation tool is still available below.

The Forum Dataset was created with the cooperation of in an effort to promote research on information retrieval, language technologies, and social network analysis. It contains a full snapshot of the online forum,, from July 2010. This message board is large, with over 22 million messages, over 3.5 million authors, and active participation for over ten years.

In addition to the document collection, queries from's query log and pairwise preference relevance judgements for a message thread retrieval task using this online forum are distributed.

This webpage describes the dataset, gives instructions for obtaining the dataset, and describes the supplemental data to use for thread search information retrieval experiments. Further details of the dataset can be found in the tech report describing the collection.

Contact: Jonathan Elsas.

Document Collection

The Online Forum document collection is a full snapshot of the online forum, from July 2010.

Number of Messages 22,054,728
Number of Threads 9,040,958
Number of Sub-forums 165,358
Number of Unique Authors 3,775,670
Message Date Range December 1995 - July 2010
Size 5 GB (compressed)

The documents distributed in the collection are in the TRECTEXT SGML format, similar to other collections used at the Text REtrieval Conference. An example document is shown below:

<DATE_STR>03 May 1999</DATE_STR>
<TEXT>I am interested in any information available on a Thomas Darcy, born in NYC, NY to Michael & Catherine SMITH Darcy. He is the eldest child and appears in the census with the family as well as residing with his mother at the time of her death in the early 1910's. He is listed in the 1913 NY Directory as a clerk. Please contact me at XXXXX@XXXX.XXX if any of this information sounds familar.</TEXT>

All the messages have the following fields:

DOCNOUnique message identifier, containing thread membership information.
PIDUnique numeric message identifier.
SUBFORUMSubforum containing the post.
DATE_STRPublication date of the post. "01-01-1900" if missing.
DATE_NUMNumeric representation of the publication date.
THREAD_IDThread identifier. Unique per subforum.
POST_IDPost identifier. Unique per subforum.
POST_URLURL of the post on the Forum website.
AUTHOR_NAMEAuthor name.
AUTHORUnique numeric author identifier. "0" if missing.
POST_TITLETitle of the post.
TEXTText content fo the post.

The message threading structure can be identified from the content of the SUBFORUM, THREAD_ID and POST_ID fields:

An example threading structure is shown below, with the POST_IDs in a single thread show in the nodes, and edges representing a message response relationship.

Obtaining a Copy of the Dataset

Thread Retrieval Task

In addition to the document collection, we distribute a query set and relevance judgements appropriate to use as an information retrieval test collection for studying message thread retrieval.

The queries distributed with this dataset were sampled from the query log. The query set reflects the primary type of information need expressed by the users of All of the queries distributed in this query set contain at least a person's name, and half of these queries contain additional information such as a location or other keyword.

Pairwise preference assessments were collected using a simulated pool of retrieval runs. The preference assessment collection followed the guidelines in Carterette et al.'s SIGIR 2008 paper "A Test Collection of Preference Judgments". The pooling and assessment process is described in detail in the tech report distributed with this dataset.

Along with the queries and relevance judgments, you can also download the evaluation tool to compute pairwise-preference performance measures. This tool is released open-source.

IMPORTANT NOTE: The the pairwise preference data uses DOCNO of the FIRST message of a thread as the document identifier. See the document collection section for an explanation of how the POST_ID and THREAD_ID fields are used to identify the first message of a thread.


Publishing Research Using the Dataset

If you use the Forum Dataset in published research, you must provide attribution to as the source of the dataset. Also, please include a reference to the tech report describing the dataset:
Jonathan Elsas, "The Forum Dataset", CMU LTI Tech Report CMU-LTI-017, 2011.