|An Integrated Architecture for Federated Search|
Carnegie Mellon University
Information retrieval research has been hampered by the difficulty of conducting research on web datasets of realistic size, due to the computational resources required for such datasets. Recent IR research developed a research dataset of 1 billion web pages that will be used in community-wide research forums such TREC, but actually using the dataset is beyond the software and hardware capabilities of many academic researchers.
Commercial web search portals increasingly consist of a large, general purpose web search engine, and smaller, more focused vertical search services that are integrated in an ad-hoc manner. As more vertical search services are added over time, ad-hoc integration becomes unwieldy and difficult to maintain.
This project treats both problems as instantiations of the classic federated search problem, in which a search interface must select among, and integrate results from, many distinct search services. Large datasets are made more manageable by dividing them into a hundred or more topic-oriented shards, and then deciding for each query which shards are most likely to contain most of the relevant documents; only those shards are searched, thus reducing computational costs by an order of magnitude or more. Vertical search services are made more manageable through the use of a general-purpose framework that uses multiple, diverse techniques to characterize the contents of each resource, and track how its content and query traffic change over time. Our framework integrates both unsupervised and supervised methods, thus enabling it to be used in academic research environments, which may have little user data, and operational environments, which typically have massive query logs.
Reducing the computational costs of searching large web datasets and supporting a diverse collection of vertical search services requires solving a variety of specific problems having to do with resource definition, resource representation, resource selection, and result merging. This research differs from most prior research in that it specifically studies the effects of resource definition policies, it addresses the requirements of dynamic resources, and it looks beyond average case analysis to characterize the range of search accuracy that a federated search service would experience.
NSF's CluE computer cluster is an important enabler for the proposed research. One of the datasets used in the research contains 1 billion web documents. The project develops and compares methods of organizing documents into topic-oriented shards, which requires building and storing large indexes. Experimental results are compared to a single search engine containing all documents (the global search engine). It would be possible to do this research with academic computing resources, but progress would be slow.
The proposed research lowers the computational requirements necessary to conduct research on realistic web corpora, thus making it easier for academic researchers to do research that companies are more likely to consider reliable. It also has the potential to reduce the energy and other costs of the large data centers operated by web search companies. The research on a more comprehensive federated search framework for vertical search services supports integration of specialized information services in web portals, which is useful to commercial information providers.
New algorithms are disseminated in open-source software published by the Lemur Project. Datasets are published in a form that enables them to be recreated by other researchers. Queries and relevance judgments are published so that they may be used by other researchers.
This project is an extension of preliminary research done in the Multi-Tier Indexing for Web Search Engines project.
Our research results are disseminated by research publications, and as part of open-source software distributed by the Lemur Project. Research data is disseminated from Callan's Data web page.
A partial listing of research publications associated with the project:
J. Arguello, J. Callan, and F. Diaz. "Classification-based resource selection." In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM '09). ACM. 2009.
J. Arguello, J. Callan, F. Diaz, and J.-F. Crespo. "Sources of evidence for vertical selection." In Proceedings of the Thirty Second Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM. 2009.
J. Arguello, F. Diaz, J. Callan, and B. Carterette. "A methodology for evaluating aggregated search results." In Proceedings of the 33rd European Conference on Information Retrieval (ECIR. British Computer Society. 2011.
A. Kulkarni and J. Callan. "Document allocation policies for selective searching of distributed indexes." In Proceedings of the 19th ACM Conference on Information and Knowledge Management (CIKM '10). ACM. 2010.
A. Kulkarni and J. Callan. "Topic-based index partitions for efficient and effective selective search." In SIGIR 2010 Workshop on Large-Scale Distributed Information Retrieval. ACM. 2010.
Several topic-based index partitions are available on Anagha Kulkarni's Topical Shard Definitions web page.
|This research is sponsored in part by National Science Foundation grant IIS-0916553 and two gifts from Yahoo! Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s), and do not necessarily reflect those of the sponsors.|