LTI logl Selective search logl

Selective Search of Large-Scale Text Collections

Jamie Callan
Language Technologies Institute
School of Computer Science
Carnegie Mellon University

Project Overview

This project develops an alternative architecture for large-scale text search in which the document corpus is decomposed into index shards that are expected to have skewed utility distributions, thus enabling most index partitions to be ignored for most queries. This selective search architecture is as effective as conventional search engine architectures, but has far lower computational costs and reveals new challenges and opportunities in large-scale search. The decomposition process creates text collections, thus inviting research on what characteristics are desired or to be avoided in a text collection to enable accurate search. New resource selection algorithms are developed to address efficiency problems in existing algorithms and dynamically adjust search costs based on query difficulty. The project includes collaboration with three research groups at other universities, to help their research, leverage their expertise in designing new approaches to problems, and investigate the effectiveness of our research in more varied situations. The result is an 'off-the-shelf' method that provides an order of magnitude reduction in search costs over the current state-of-the-art, especially on corpora of more than a billion documents, and that can be easily customized or extended to support varied needs.

Selective search is significant in part because it provides a new perspective on how to organize a very large collection of documents so that it can be searched accurately and efficiently. This new understanding reveals new research problems and undiscovered weaknesses in existing algorithms that will have impact within the scientific community. Research results from this project are disseminated in research papers that appear in the most competitive conferences and journals; in the Lemur Project's open-source search engines, which are used by a broad international scientific community; and in the Lemur Project's ClueWeb public search services, which integrate research and education by enabling scientists and classroom students to do experiments on large, state-of-the-art text corpora. Selective search is also of practical significance. Text search is one of the most widely used computer science technologies. The state-of-the-art in many areas of industry and science is increasingly associated with large-scale datasets, which makes it difficult for organizations with modest computational resources to compete. This project reduces the computational costs of searching large-scale text collections by an order of magnitude or more. It has the potential to reduce the energy and other costs associated with the data centers of large search providers, which has important economic and societal benefits.

This project builds on a foundation created by the NSF-funded project An Integrated Architecture for Federated Search that was done with Jaime Arguello and Anagha Kulkarni.

 

Project Personnel

Jamie Callan, Principal Investigator
Yubin Kim, Research Assistant
Reyyan Yeniterzi Research Assistant

 

Collaborators

J. Shane Culpepper, Royal Melbourne Institute of Technology (RMIT)
Anagha Kulkarni, San Francisco State University
Alistair Moffat, University of Melbourne
Mark Sanderson, Royal Melbourne Institute of Technology (RMIT)

 

Dissemination of Research Results

Research results are disseminated by research publications, as part of the open-source Lemur Project, and via the Lemur Project's public search services.

 

Datasets

Our research is conducted with the following datasets, which were created by Anagha Kulkarni.

gov2-50-bytopic-ak13.v1: A partitioning of the GOV2 dataset into 50 shards. (133 MB compressed, 419 MB uncompressed)
clueweb09e-807-bytopic-ak13.v1: A partitioning of the ClueWeb09-English dataset into 807 shards. (1.5 GB compressed, 13 GB uncompressed)

 


NSF logo     This research is sponsored by National Science Foundation grant IIS-1302206. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s), and do not necessarily reflect those of the sponsor.

Updated on May 23, 2014.
Jamie Callan