LTI logo Modeling and Predicting Term Mismatch for Full-Text Retrieval  
  Jamie Callan
Carnegie Mellon University

Project Overview

Many text search engines use probabilistic reasoning to determine how well a word represents a person.s information need. The probability that a term appears in relevant documents - documents that satisfy the information need - is a fundamental quantity in the theory of probabilistic information retrieval, however prior research provided few clues about how to estimate it reliably. This project uses exploratory data analysis to identify common reasons that user-specified query terms fail to match relevant documents, develops features correlated with each reason, and integrates them into a model that can be trained from data. The resulting term necessity predictions can be used in state-of-the-art retrieval models to improve retrieval accuracy substantially.

Term necessity predictions are based on a two-stage approach to text retrieval. A feature-based analysis of an initial retrieval develops evidence that can be linked to a variety of common reasons that a term might not match relevant documents, for example, centrality, synonymy, and abstractness. This model-based approach can be trained from available data, making it easy to incorporate new features that test new hypotheses, or to train a corpus-specific predictive model. It also has the advantage that probability predictions are query-specific, and linked to features that can guide automatic term weighting as well as interactive or automatic query refinement. The project develops several focused interventions for interactive, automatic query expansion, and relevance feedback refinement of queries.

This project makes an impact on the scientific community by providing new approaches to a central problem that affects probabilistic retrieval models, and the diagnosis and correction of problems in query formation. Improvements in search engine accuracy also affect a broad population of everyday users. The proposed research improves search accuracy for .ordinary people. using unstructured keyword queries, as well as professional searchers who often use sophisticated structured queries to search structured documents.

Research results are disseminated in research papers. New techniques are implemented and disseminated in periodic releases of the Lemur Project's Indri search engine. Indri is used by a broad international research community, thus this form of dissemination makes it more likely that other researchers will study and extend the proposed research.


Project Personnel

Jamie Callan, Principal Investigator
Le Zhao, Graduate Research Assistant
Guoqing Zheng, Graduate Research Assistant
David Pane, Senior Research Programmer


Dissemination of Research Results

Our research results are disseminated by research publications, and as part of the open-source Lemur Project.

A partial listing of research publications associated with the project:

The project also distributes TermRecallKit, a software library that supports research and experimentation with predicting term necessity / recall weights. The software is distributed as a .tar.gz file. The current version of the software is v1. Version v2 is planned for release after the SIGIR 2015 paper submission deadline.


NSF logo     This research is sponsored by National Science Foundation grant IIS-1018317. We thank the Information Retrieval Facility for providing data and other help. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s), and do not necessarily reflect those of the sponsor.     IR Facility logo

Updated on December 23, 2014.
Jamie Callan