Selective Search of Large-Scale Text Collections
This project develops an alternative architecture for large-scale text search in which the document corpus is decomposed into index shards that are expected to have skewed utility distributions, thus enabling most index partitions to be ignored for most queries. This selective search architecture is as effective as conventional search engine architectures, but has far lower computational costs and reveals new challenges and opportunities in large-scale search. The decomposition process creates text collections, thus inviting research on what characteristics are desired or to be avoided in a text collection to enable accurate search. New resource selection algorithms are developed to address efficiency problems in existing algorithms and dynamically adjust search costs based on query difficulty. The project includes collaboration with three research groups at other universities, to help their research, leverage their expertise in designing new approaches to problems, and investigate the effectiveness of our research in more varied situations. The result is an 'off-the-shelf' method that provides an order of magnitude reduction in search costs over the current state-of-the-art, especially on corpora of more than a billion documents, and that can be easily customized or extended to support varied needs.
Selective search is significant in part because it provides a new perspective on how to organize a very large collection of documents so that it can be searched accurately and efficiently. This new understanding reveals new research problems and undiscovered weaknesses in existing algorithms that will have impact within the scientific community. Research results from this project are disseminated in research papers that appear in the most competitive conferences and journals; in the Lemur Project's open-source search engines, which are used by a broad international scientific community; and in the Lemur Project's ClueWeb public search services, which integrate research and education by enabling scientists and classroom students to do experiments on large, state-of-the-art text corpora. Selective search is also of practical significance. Text search is one of the most widely used computer science technologies. The state-of-the-art in many areas of industry and science is increasingly associated with large-scale datasets, which makes it difficult for organizations with modest computational resources to compete. This project reduces the computational costs of searching large-scale text collections by an order of magnitude or more. It has the potential to reduce the energy and other costs associated with the data centers of large search providers, which has important economic and societal benefits.
This project builds on a foundation created by the NSF-funded project An Integrated Architecture for Federated Search that was done with Jaime Arguello and Anagha Kulkarni.
|Jamie Callan,||Principal Investigator|
|Yubin Kim,||Research Assistant|
|Reyyan Yeniterzi||Research Assistant|
|J. Shane Culpepper,||Royal Melbourne Institute of Technology (RMIT)|
|Anagha Kulkarni,||San Francisco State University|
|Alistair Moffat,||University of Melbourne|
|Mark Sanderson,||Royal Melbourne Institute of Technology (RMIT)|
Research results are disseminated by research publications, as part of the open-source Lemur Project, and via the Lemur Project's public search services. A partial list of project publications is provided below.
A. Kulkarni and J. Callan. Selective search: Efficient and effective search of large textual collections. ACM Transactions on Information Systems. ACM. Accepted.
Our research is conducted with the following datasets, which were created by Anagha Kulkarni.
|gov2-50-bytopic-ak13.v1:||A partitioning of the GOV2 dataset into 50 shards.||(133 MB compressed, 419 MB uncompressed)|
|clueweb09e-807-bytopic-ak13.v1:||A partitioning of the ClueWeb09-English dataset into 807 shards.||(1.5 GB compressed, 13 GB uncompressed)|
|This research is sponsored by National Science Foundation grant IIS-1302206. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s), and do not necessarily reflect those of the sponsor.|