Thesis Proposal: Combining Substructures to Uncover The Relational Web

B. Cenk Gazen

Thesis Committee
Jaime Carbonell, Carnegie Mellon University (chair)
William Cohen, Carnegie Mellon University
John Lafferty, Carnegie Mellon University
Steven Minton, Fetch Technologies

Abstract

I describe an approach to automatically convert web-sites into relational form. The approach relies on the existence of multiple types of substructure within a collection of pages from a web-site. Corresponding to each substructure is an expert that generates a set of simple hints for the particular collection. Each hint describes the alignment of some tokens within relations. An optimization algorithm then finds the relational representation of the given web site such that the likelihood of observing the hints from the relational representation is maximized.

The contributions of the thesis will be a new approach for combining heterogeneous substructures in document collections, an implemented system that will make massive amounts of web data available to applications that use only structured data, and new search techniques in probabilistic constraint satisfaction.

Document (pdf)