LTI logl Selective search logl

Selective Search of Large-Scale Text Collections

Jamie Callan
Language Technologies Institute
School of Computer Science
Carnegie Mellon University

Project Overview

This project develops an alternative architecture for large-scale text search in which the document corpus is decomposed into index shards that are expected to have skewed utility distributions, thus enabling most index partitions to be ignored for most queries. This selective search architecture is as effective as conventional search engine architectures, but has far lower computational costs and reveals new challenges and opportunities in large-scale search. The decomposition process creates text collections, thus inviting research on what characteristics are desired or to be avoided in a text collection to enable accurate search. New resource selection algorithms are developed to address efficiency problems in existing algorithms and dynamically adjust search costs based on query difficulty. The project includes collaboration with three research groups at other universities, to help their research, leverage their expertise in designing new approaches to problems, and investigate the effectiveness of our research in more varied situations. The result is an 'off-the-shelf' method that provides an order of magnitude reduction in search costs over the current state-of-the-art, especially on corpora of more than a billion documents, and that can be easily customized or extended to support varied needs.

Selective search is significant in part because it provides a new perspective on how to organize a very large collection of documents so that it can be searched accurately and efficiently. This new understanding reveals new research problems and undiscovered weaknesses in existing algorithms that will have impact within the scientific community. Research results from this project are disseminated in research papers that appear in the most competitive conferences and journals; in the Lemur Project's open-source search engines, which are used by a broad international scientific community; and in the Lemur Project's ClueWeb public search services, which integrate research and education by enabling scientists and classroom students to do experiments on large, state-of-the-art text corpora. Selective search is also of practical significance. Text search is one of the most widely used computer science technologies. The state-of-the-art in many areas of industry and science is increasingly associated with large-scale datasets, which makes it difficult for organizations with modest computational resources to compete. This project reduces the computational costs of searching large-scale text collections by an order of magnitude or more. It has the potential to reduce the energy and other costs associated with the data centers of large search providers, which has important economic and societal benefits.

This project builds on a foundation created by the NSF-funded project An Integrated Architecture for Federated Search that was done with Jaime Arguello and Anagha Kulkarni.

 

Project Personnel

Jamie Callan, Principal Investigator
Zhuyun Dai Research Assistant
Hafeezul Mohammad, Research Assistant
Yubin Kim, Research Assistant
Keyang Xu Research Assistant
Xin Qian Research Assistant
Reyyan Yeniterzi Research Assistant

 

Collaborators

J. Shane Culpepper, Royal Melbourne Institute of Technology (RMIT)
Anagha Kulkarni, San Francisco State University
Alistair Moffat, University of Melbourne
Mark Sanderson, Royal Melbourne Institute of Technology (RMIT)
João Magalhães, Universidade Nova de Lisboa
Flávio Martins, Universidade Nova de Lisboa

 

Dissemination of Research Results

Research results are disseminated by research publications, as part of the open-source Lemur Project, and via the Lemur Project's public search services. A partial list of project publications is provided below.

 

Datasets

Our research is conducted by partitioning several well-known large text collections. The list below describes how these datasets were partitioned for experiments in our published papers, so that others may recreate our experiments. See also the virtual appendices for the published papers above.

 


NSF logo     This research is sponsored by National Science Foundation grant IIS-1302206. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s), and do not necessarily reflect those of the sponsor.

Updated on November 16, 2018.
Jamie Callan