Language Technologies Thesis Proposal

  • Ph.D. Student
  • Language Technologoies Institute
  • Carnegie Mellon University
Thesis Proposals

Crowd-Sourced Wrapper Construction with End Users

The internet today gives people access to an astounding amount of valuable human knowledge, including a large amount in the form of semistructured lists and tables. Such lists and tables present information visually in a way people can readily understand, but computers in general cannot, without recourse to a wrapper program. To avoid the intractible efforts of manually programming all wrappers, wrapper induction systems learn wrappers from expert annotations for pages of interest. These expert annotations themselves become a bottleneck for wrapper induction systems. We explore the possibility of enlisting the large crowd of everyday end users in the task of annotating datasets on the web.

Making use of the efforts of the crowd relies on overcoming three challenges. First, can end users provide the desired data? Second, how can we effectively motivate them to provide the desired data? Third, how can we divide up the task among the crowd and combine the crowd contributions into a coherent, useful whole? We present our previous work from two systems for end users. SmartWrap leverages users' familiarity with spreadsheets to present the wrapper annotation task as a set of simple operations on a concrete instance. Results from a user study and a wider deployment on Mechanical Turk provide substantial evidence that many end users, including nonprogrammers, can effectively contribute wrappers. Another system, Mixer, allows users to demonstrate and replay repeated paths through a data retrieval task. Results from Mixer indicate that end users would contribute wrapper data in order to use it.

In the thesis, we intend to further investigate how users can be motivated to contribute wrapper data by virtue of the assistance the tool provides. Additionally, we intend to demonstrate that the end user annotations can be used effectively to provide wrappers for a steadily increasing portion of the web.

Thesis Committee:
Anthony Tomasic (ISR/LTI)
John Zimmerman (HCII)
William Cohen (MLD/LTI)
Michael Franklin (University of California, Berkeley)

Proposal Document

