Modeling and Predicting Term Mismatch for Full-Text Retrieval

	Modeling and Predicting Term Mismatch for Full-Text Retrieval
	Jamie Callan Carnegie Mellon University

Project Overview

Many text search engines use probabilistic reasoning to determine how well a word represents a person.s information need. The probability that a term appears in relevant documents - documents that satisfy the information need - is a fundamental quantity in the theory of probabilistic information retrieval, however prior research provided few clues about how to estimate it reliably. This project uses exploratory data analysis to identify common reasons that user-specified query terms fail to match relevant documents, develops features correlated with each reason, and integrates them into a model that can be trained from data. The resulting term necessity predictions can be used in state-of-the-art retrieval models to improve retrieval accuracy substantially.

Term necessity predictions are based on a two-stage approach to text retrieval. A feature-based analysis of an initial retrieval develops evidence that can be linked to a variety of common reasons that a term might not match relevant documents, for example, centrality, synonymy, and abstractness. This model-based approach can be trained from available data, making it easy to incorporate new features that test new hypotheses, or to train a corpus-specific predictive model. It also has the advantage that probability predictions are query-specific, and linked to features that can guide automatic term weighting as well as interactive or automatic query refinement. The project develops several focused interventions for interactive, automatic query expansion, and relevance feedback refinement of queries.

This project makes an impact on the scientific community by providing new approaches to a central problem that affects probabilistic retrieval models, and the diagnosis and correction of problems in query formation. Improvements in search engine accuracy also affect a broad population of everyday users. The proposed research improves search accuracy for .ordinary people. using unstructured keyword queries, as well as professional searchers who often use sophisticated structured queries to search structured documents.

Research results are disseminated in research papers. New techniques are implemented and disseminated in periodic releases of the Lemur Project's Indri search engine. Indri is used by a broad international research community, thus this form of dissemination makes it more likely that other researchers will study and extend the proposed research.

Project Personnel

Jamie Callan, Principal Investigator
Le Zhao, Graduate Research Assistant
Guoqing Zheng, Graduate Research Assistant
David Pane, Senior Research Programmer

Dissemination of Research Results

Our research results are disseminated by research publications, and as part of the open-source Lemur Project.

A partial listing of research publications associated with the project:

E. Yao, G. Zheng, O. Jin, S. Bao, K. Chen, Z. Su, and Y. Yu. "Probabilistic text modeling with orthogonalized topics." In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '14). ACM. 2014
L. Zhao. Modeling and Solving Term Mismatch for Full-Text Retrieval. Ph.D. dissertation, Language Technologies Institute, Carnegie Mellon University. 2012.
L. Zhao and J. Callan. "Term necessity prediction." In Proceedings of the 19th ACM Conference on Information and Knowledge Management (CIKM '10). ACM. 2010.
L. Zhao and J. Callan. "How to make manual Conjunctive Normal Form queries work in patents search. In Proceedings of the Twentieth Text REtrieval Conference (TREC 2011). National Institute of Standards and Technology, special publication 500-295. 2012.
L. Zhao and J. Callan. "Automatic term mismatch diagnosis for selective query expansion. In Proceedings of the Thirty Fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM. 2012.
L. Zhao, Z. Liu, and J. Callan. "WikiQuery -- An interactive collaboration interface for creating, storing and sharing effective CNF queries. In Proceedings of the SIGIR 2012 Workshop on Open-Source Information Retrieval. 2012.

The project also distributes TermRecallKit, a software library that supports research and experimentation with predicting term necessity / recall weights. The software is distributed as a .tar.gz file. The current version of the software is v1. Version v2 is planned for release after the SIGIR 2015 paper submission deadline.

This research is sponsored by National Science Foundation grant IIS-1018317. We thank the Information Retrieval Facility for providing data and other help. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s), and do not necessarily reflect those of the sponsor.

Updated on December 23, 2014.

Jamie Callan