Trec Blog Track

From ScribbleWiki: Analysis of Social Media

Jump to: navigation, search

The TREC Blog Track was started in 2006 by NIST to promote research in information seeking behavior in the blogosphere. The track ran again in 2007 and will continue in 2008 (and possibly beyond). 14 groups participated in the Opinion Detection task in 2006 and 20 groups in the task in 2007. The Blog Distillation task was run for the first time in 2007 and 9 groups participated.

Contents

Official TREC Blog Track Resources

Dataset

In 2006 and 2007 the TREC Blog Track used the BLOG06 test collection. This collection consists of HTML and XML (RSS/Atom) documents from 100,649 blogs, collected during an 11 week period from December 2005 to February 2006. There are three parts to the collection:

  • Feed documents (38.6GB): 753,681 RSS/Atom XML documents corresponding to the web feeds of the 100,649 unique blogs in the collection. There is not guarantee that these documents contain the full post text; the feeds are often incomplete "summaries" or truncations of the blog post.
  • Permalink documents (88.8GB): 3,215,171 HTML documents corresponding to the unique blog posts collected over the eleven weeks. These pages often include user comments in response to the post, advertisements and other text not necessarily relating to the post.
  • Homepage documents (20.8GB): 324,880 HTML documents corresponding to the Blog's homepage, or main entry point into the blog. Note that there are many fewer blog homepage documents than blogs due to the blog sites blocking crawling of their homepages via the robots.txt exclusion file.

The collection also intentionally includes some amount of "noise" in the form of non-English blog, non-blog web pages and splogs or spam-blogs.

Opinion Detection Task (2006, 2007, 2008)

The opinion detection task was the first task run with the Blog track in 2006. It was run again in 2007 and in 2008.

Task overview

The intention of the opinion detection task was to address the question: What is the public sentiment towards X? where X is some person, organization, event, or other entity. This task aims to identify sentiment at the document-level. Groups are given a set of "topics" (or queries) and are tasked with retrieving permalink documents containing an opinionated discussion of the topic. The topics given to groups follow the typical TREC format, consisting of three fields:

  • title: the object of interest. For this task, the topics were pulled directly from the query log of a commercial search engine. Some example topics include: "carrie underwood", sag awards and pfizer.
  • description: A 1-2 sentence description of the information need. For example: Find opinions of the drug company Pfizer and its products.
  • narrative: A several-sentence long description of what is and is not considered relevant for the given topic.

Groups typically run their systems using only the title field in order to mimic the 2-3 word queries common in web search engines.

In 2006 the opinion detection task consisted only of ranking opinionated documents for each query. Submissions were evaluated on their ability to identify topically relevant documents as well as opinionated documents.

In 2007 an optional polarity classification sub-task was added.

Primary Approaches

Lexical & statistical approaches dominated the best performing systems in 2007. UIC's system made extensive use of external corpora such as Wikipedia and consumer rating web sites to build a lexicon of objectective and opinionated language respectively and build SVM classifiers using that opinionated lexicon. The University of Glasgow's system adapted their Divergence from Randomness (DFR) relevance feedback system to automatically identify strongly opinionated terms.

Blog Distillation Task (2007, 2008...)

The blog distillation task was run for the first time in 2007 and will run again in 2008 (and possibly beyond).

Task overview

Blog (or feed) distillation is defined as follows: Find blogs with a central and recurring interest in X where X is some topic. This task addresses the specific scenario of someone wanting to add a new feed to their feed aggregator and issues a topical query to a blog search engine. This differs from other blog retrieval tasks, where for example someone may be interested in recent blog entries from many blogs on a given topic. In the feed distillation task, the blog feed is the unit of retrieval, not the post or permalink.

As in the opinion detection task, the topics are given in the standard TREC format with a title, description and narrative. See the opinion detection task for details on the format.

Primary Approaches

The approaches to blog distillation can roughly be grouped into a "large document" approach that treats the feed as a single monolithic document and uses traditional IR techniques to rank, and the "small document" approach whre each entry (or permalink) is treated as a single document and a permalink ranking is then aggregated into a feed ranking. The University of Glasgow system adapted their expert-finding model from enterprise search, where subject-matter experts are ranked based on the emails they compose. In this small-document approach, feeds are "voted on" by the relevant posts that are retrieved. The UMass system combined the large- and small-document approaches. Their small document approach was adapted from a "cluster selection" model where the feeds are considered different topical document clusters and the goal is to rank clusters in order of relevance. Their large document model takes the typical IR approach to ranking feeds.

Future Blog Track Tasks

The future direction of the blog track is still under discussion. The feed distillation task will run again in 2008 and the opinion detection task will be dropped. Other tasks that are under consideration include:

  • First-story detection: find the post that first "breaks" a news story in the blogosphere
  • Blogger influence: find the most influential blogs on a given topic
  • Post filtering: given a topic and ongoing user feedback, find the blog posts on that topic over time. This is similar to the filtering track that was run in TREC in the past.
  • Splog detection: similar to spam filtering, but identifying blog-spam.
Views
Personal tools
  • Log in / create account