Adaptive Integration of Structured and Unstructured Data from Many Sources in a Biological Domain
Description
Our goal in this research is to construct a knowledge-based (KB)
system which will learn to more accurately integrate the many
heterogeneous sources of information that are relevant to a single
scientist's research needs. The system, called Querendipity, works by
loosely integrating data of many sorts (including unstructured text)
into a single typed directed graph, and then querying the graph using
a query language that allows "schema-free similarity queries". These
queries specify a set of query terms (e.g. keywords, entities in the
KB, etc) and constraints on the desired output (e.g. a target data
type). The result of a query is a ranked list of KB entities, ordered
by similarity to the query terms.
After a query, a user can optionally label any subset of the ranked
list of suggested answers as ``relevant'' or ``non-relevant''. These
labels drive a learning phase, the goal of which is to produce a
better ranking. Types of learning currently being investigated
include EM-based parameter turning, learning to discriminatively
re-rank, and learning to restructure the graph (by adding or deleting
edges or vertexes). Queries collected in the laboratories of working
biologists are used to evaluate these learning methods.
The broadest impact of this project is on the problem of learning to
integrate heterogeneous data sources (including free text and
structured data). However, if successful, the KB system will have
broad impact in the biological research community; in particular, we
believe that adaptive personal KB systems of this sort will be a
valuable complement to existing biological KBs.
Acknowledgements
This project is funded by the NSF's Division of Information &
Intelligent Systems as award 0811562
from September 1, 2008 through August 31, 2011.
Project Members
Participants include
- William W. Cohen, of the Lane Center for Computational
Biology and the Machine Learning Department, PI.
- John Woolford of the Department of Biology, coPI.
- Ramnath Balasubramanyan, LTI PhD student
- Ni Lao, LTI PhD student
- Frank Lin, LTI PhD student
- Dana Movshovitz-Attias, CSD PhD student.
- Katie Rivard, research programmer/analyst
- Maryam Aly, undergraduate research assistant (during fall semester 2009)
- Andrew Arnold
(former MLD PhD student, now at WorldQuant)
Relevant publications
Below are some of the publications most relevant to the research behind Querendipity.
Completed
Joanna Bresee, Hajin Choi, Daniel Lee, Ellen Wu (2009): Adaptive Personalized Information Management
for Biologists: Final Report (report)
Integration Software
Integration Datasets
System Snapshots
Snapshots of the system, with code and data, as of a particular date.
Querendipity stands for Query-based User-guided
Exploration of Relations and
ENtities in Data Integrated
Probabilistically or Identified in
Text about Yeast. Development of a model-organism
independent acronym is a subject for further research.
Last modified: Thu Jul 14 14:36:58 Eastern Daylight Time 2011