Language Technologies Institute
Student Research Symposium 2006

Meaning-Based Retrieval for Human Language Technologies

Matthew Bilotti

As the amount of information at our fingertips grows seemingly without bound, so too grows the demand for a class of Human Language Technologies (HLT) applications that facilitate searching, browsing and navigation within large information spaces. These applications are often built around keyword-based text retrieval systems, which have long been the de facto standard for searching for relevant information in a large collection of documents.

Sometimes keywords are not sufficient to represent the concept being searched for. Suppose a MEDLINE researcher, wanting to investigate the effect of heparin on the clotting cascade, asks the question, "What does heparin inhibit?" of a QA system built upon a keyword-based text retrieval system. The question is converted into the keyword query 'heparin inhibit,' and the following two documents are retrieved:

"Platelets contain several factors that inhibit heparin."

"Heparin inhibits thrombin via antithrombin III."

The application presents "platelets" and "thrombin" to the user as answers, but only "thrombin" is correct. Why is the wrong answer returned?

What the user wanted were instances of 'inhibit' events where 'heparin' is the agent. What the user actually received were documents containing the keywords 'inhibit' and 'heparin,' which is a superset of what the user wanted. The system is not able to distinguish between these two because it relies on a keyword-based representation of document meaning that is too weak to represent the agentive relationship between 'heparin' and 'inhibit.'

If we are to improve text retrieval support for HLT applications, we must support indexing and retrieval on the linguistic and semantic content of interest to HLT applications. I propose a novel approach to text retrieval problems called Meaning-Based Retrieval (MBR) in which text meaning is modeled by instances of Meaning Types from a domain-specific Meaning Type System (MTS), which can contain linguistic and semantic annotations in addition to keywords.

Newly-available Information Retrieval technology allows for the indexing and retrieval of hierarchical annotations on text [3, 4]. MBR maps an arbitrary MTS onto this technology by representing linguistic and semantic content as annotations, and allows indexing and retrieval of text at the meaning level.

In this talk, I motivate and describe the MBR algorithm, then discuss preliminary results from applying the MBR prototype to an experiment over the Center for Nonproliferation Studies corpus. I compare MBR using an MTS that supports semantic role annotations versus a keywords-only MTS using standard precision and recall metrics.

[1] MEDLINE. http://www.nlm.nih.gov/pubs/factsheets/medline.html

[2] Medical Subject Headings (MeSH). http://www.nlm.nih.gov/mesh/

[3] Indri Retrieval System. http:// www.lemurproject.org

[4] Ogilvie and Callan. Hierarchical Language Models for Retrieval of XML Components. In Proc. INEX 2004.