Multi-Tier Indexing for Web Search Engines
Jamie Callan
Carnegie Mellon University

Project Overview

This project is adapting prior work on federated search to create a more selective approach to searching web indexes that we call topic-partitioned indexing. Each subset (shard) of a topic-partitioned index covers specific content areas, so that only shards covering the query?s topic area(s) need to be searched. Our research is developing methods to efficiently assign documents to shards. Supervised and unsupervised techniques are used to match queries to shards. The result is a selective search that delivers similar accuracy as more exhaustive searches, but requires an order of magnitude less effort, thus yielding significant computational and financial savings. The project is using the Google/IBM cluster to crawl the web and perform the data cleansing and pre-processing necessary to develop a web dataset of 500 million to 1 billion documents to support the research. Additional effort is being devoted to producing a corpus that is useful for a broad range of research purposes. A project goal is to share the dataset with other researchers on the Google/IBM cluster, and eventually with a broader research community.

The project will have three types of broad impact. The data centers of large web search companies are expensive and major consumers of electrical power, thus reducing their costs has significant financial and environmental benefits. Lower computational costs make it practical for academic researchers to conduct research on datasets that web search companies consider credible, thus increasing the impact of academic research. Finally, research datasets such ours typically have long life spans and are used for diverse research projects by scientists around the world.

 

Project Personnel

Jamie Callan, Principal Investigator
Jaime Arguello, Graduate Research Assistant
Mark Hoy, Senior Research Programmer
Anagha Kulkarni, Graduate Research Assistant

 

Dissemination of Research Results

Our research results are disseminated by research publications, and as part of the open-source Lemur Toolkit.

The research dataset is disseminated from the ClueWeb09 Dataset web page.

 

Collaborating Projects

The ClueWeb09 dataset is the dataset for the Entity Detection, Million Query, Relevance Feedback, and Web tracks of the 2009 Text REtrieval Conference (TREC).


    This research is sponsored in part by National Science Foundation grant IIS-0841275 and a gift from Yahoo! Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s), and do not necessarily reflect those of the sponsors.  


Updated on March 25, 2009.
Jamie Callan