Adaptive Integration of Structured and Unstructured Data from Many Sources in a Biological Domain

[ Description | Acknowledgements | Project members | Relevant Publications | Software | Datasets Snapshots ]

Description

Our goal in this research is to construct a knowledge-based (KB) system which will learn to more accurately integrate the many heterogeneous sources of information that are relevant to a single scientist's research needs. The system, called Querendipity, works by loosely integrating data of many sorts (including unstructured text) into a single typed directed graph, and then querying the graph using a query language that allows "schema-free similarity queries". These queries specify a set of query terms (e.g. keywords, entities in the KB, etc) and constraints on the desired output (e.g. a target data type). The result of a query is a ranked list of KB entities, ordered by similarity to the query terms.

After a query, a user can optionally label any subset of the ranked list of suggested answers as ``relevant'' or ``non-relevant''. These labels drive a learning phase, the goal of which is to produce a better ranking. Types of learning currently being investigated include EM-based parameter turning, learning to discriminatively re-rank, and learning to restructure the graph (by adding or deleting edges or vertexes). Queries collected in the laboratories of working biologists are used to evaluate these learning methods.

The broadest impact of this project is on the problem of learning to integrate heterogeneous data sources (including free text and structured data). However, if successful, the KB system will have broad impact in the biological research community; in particular, we believe that adaptive personal KB systems of this sort will be a valuable complement to existing biological KBs.

Acknowledgements

This project is funded by the NSF's Division of Information & Intelligent Systems as award 0811562 from September 1, 2008 through August 31, 2011.

Project Members

Participants include

Relevant publications

Below are some of the publications most relevant to the research behind Querendipity.

Completed

  • Joanna Bresee, Hajin Choi, Daniel Lee, Ellen Wu (2009): Adaptive Personalized Information Management for Biologists: Final Report (report)
  • Integration Software

    Integration Datasets

    System Snapshots

    Snapshots of the system, with code and data, as of a particular date.
    Querendipity stands for Query-based User-guided Exploration of Relations and ENtities in Data Integrated Probabilistically or Identified in Text about Yeast. Development of a model-organism independent acronym is a subject for further research.
    Last modified: Thu Jul 14 14:36:58 Eastern Daylight Time 2011