Selective Search of Large-Scale Text Collections

Jamie Callan
Language Technologies Institute
School of Computer Science
Carnegie Mellon University

Project Overview

This project develops an alternative architecture for large-scale text search in which the document corpus is decomposed into index shards that are expected to have skewed utility distributions, thus enabling most index partitions to be ignored for most queries. This selective search architecture is as effective as conventional search engine architectures, but has far lower computational costs and reveals new challenges and opportunities in large-scale search. The decomposition process creates text collections, thus inviting research on what characteristics are desired or to be avoided in a text collection to enable accurate search. New resource selection algorithms are developed to address efficiency problems in existing algorithms and dynamically adjust search costs based on query difficulty. The project includes collaboration with three research groups at other universities, to help their research, leverage their expertise in designing new approaches to problems, and investigate the effectiveness of our research in more varied situations. The result is an 'off-the-shelf' method that provides an order of magnitude reduction in search costs over the current state-of-the-art, especially on corpora of more than a billion documents, and that can be easily customized or extended to support varied needs.

Selective search is significant in part because it provides a new perspective on how to organize a very large collection of documents so that it can be searched accurately and efficiently. This new understanding reveals new research problems and undiscovered weaknesses in existing algorithms that will have impact within the scientific community. Research results from this project are disseminated in research papers that appear in the most competitive conferences and journals; in the Lemur Project's open-source search engines, which are used by a broad international scientific community; and in the Lemur Project's ClueWeb public search services, which integrate research and education by enabling scientists and classroom students to do experiments on large, state-of-the-art text corpora. Selective search is also of practical significance. Text search is one of the most widely used computer science technologies. The state-of-the-art in many areas of industry and science is increasingly associated with large-scale datasets, which makes it difficult for organizations with modest computational resources to compete. This project reduces the computational costs of searching large-scale text collections by an order of magnitude or more. It has the potential to reduce the energy and other costs associated with the data centers of large search providers, which has important economic and societal benefits.

This project builds on a foundation created by the NSF-funded project An Integrated Architecture for Federated Search that was done with Jaime Arguello and Anagha Kulkarni.

Project Personnel

Jamie Callan,	Principal Investigator
Zhuyun Dai	Research Assistant
Hafeezul Mohammad,	Research Assistant
Yubin Kim,	Research Assistant
Keyang Xu	Research Assistant
Xin Qian	Research Assistant
Reyyan Yeniterzi	Research Assistant

Collaborators

J. Shane Culpepper,	Royal Melbourne Institute of Technology (RMIT)
Anagha Kulkarni,	San Francisco State University
Alistair Moffat,	University of Melbourne
Mark Sanderson,	Royal Melbourne Institute of Technology (RMIT)
João Magalhães,	Universidade Nova de Lisboa
Flávio Martins,	Universidade Nova de Lisboa

Dissemination of Research Results

Research results are disseminated by research publications, as part of the open-source Lemur Project, and via the Lemur Project's public search services. A partial list of project publications is provided below.

Z. Dai and J. Callan. Inverted list caching for topical index shards (short paper). In Advances in Information Retrieval - 40th European Conference on IR Research, pp. 577-583. Springer. 2018.
Z. Dai, Y. Kim and J. Callan. Learning to rank resources (short paper). In Proceedings of the 40th International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM. 2017.
Virtual Appendix
Z. Dai, Y. Kim, and J. Callan. How random decisions affect selective distributed search (short paper). In Proceedings of the 38th International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM. 2015.
Z. Dai, C. Xiong, and J. Callan. Query-biased partitioning for selective search. In Proceedings of the 25th ACM Conference on Information and Knowledge Management (CIKM '16). ACM. 2016.
Virtual Appendix
Y. Kim. Robust selective search. Ph.D. dissertation, Carnegie Mellon University. 2018 (expected).
Y. Kim and J. Callan. Measuring the effectiveness of selective search index partitions without supervision. In Proceedings of the 2018 International Conference on The Theory of Information Retrieval. 2018.
Y. Kim, J. Callan, S. Culpepper and A. Moffat. Efficient distributed selective search. Information Retrieval Journal. Springer, 20(3), pp 221-252. 2017.
Virtual Appendix
Y. Kim, J. Callan, S. Culpepper and A. Moffat. Does selective search benefit from WAND optimization? In Advances in Information Retrieval - 38th European Conference on IR Research, ECIR 2016, pp 145-158. Springer. 2016.
Y. Kim, J. Callan, S. Culpepper and A. Moffat. Load-balancing in distributed selective search (short paper). In Proceedings of the 39th International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM. 2016.
A. Kulkarni and J. Callan. Selective search: Efficient and effective search of large textual collections. ACM Transactions on Information Systems, 33(4). ACM. 2015.
F. Martins, J. Magalhães and J. Callan. A vertical PRF architecture for microblog search. In Proceedings of the 2018 International Conference on The Theory of Information Retrieval. 2018.
H. R. Mohammad, K. Xu, J. Callan and J. S. Culpepper. Dynamic shard cutoff prediction for selective search. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 85-94. ACM. 2018.
S. Palakodety and J. Callan. Query transformations for result merging. In Proceedings of the Twenty-Third Text REtrieval Conference (TREC 2014). National Institute of Standards and Technology, special publication 500-308. 2015.

Datasets

Our research is conducted by partitioning several well-known large text collections. The list below describes how these datasets were partitioned for experiments in our published papers, so that others may recreate our experiments. See also the virtual appendices for the published papers above.

clueweb09e-807-bytopic-ak13.v1: The ClueWeb09-English dataset partitioned into 807 shards. [Kulkarni, 2013]. (1.5 GB compressed, 13 GB uncompressed)
clueweb09b-92-bytopic-ak13.v1: The ClueWeb09-B dataset partitioned into 92 shards. [Kulkarni, 2013]. (154 MB compressed, 1.3 GB uncompressed)
ecir2016_kim_cw09b.v1: The ClueWeb09-B dataset partitioned into 100 shards. [Kim, et al., 2016a]. (125 MB compressed, 1.3 GB uncompressed)
gov2-50-bytopic-ak13.v1: The GOV2 dataset partitioned into 50 shards. [Kulkarni, 2013]. (133 MB compressed, 419 MB uncompressed)
sigir2016_kim_cw09.v1: The ClueWeb09 dataset partitioned into 884 shards. [Kim, et al., 2016b; Kim, et al., 2016c]. (1.7 GB compressed, 13 GB uncompressed)

This research is sponsored by National Science Foundation grant IIS-1302206. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s), and do not necessarily reflect those of the sponsor.

Updated on November 16, 2018.

Jamie Callan