Rosie Jones, PhD

Thesis Dissertation

Rosie Jones Dissertation 2005: Learning to Extract Entities from Labeled and Unlabeled Text

Abstract

Imagine trying to build a system to identify people, locations and organizations, or other arbitrary types, in a human language you are not familiar with. If we knew what kinds of words represent the classes people, locations and organizations, by examining enough text data they occur in, we could learn to recognize the contexts they occur in. And if we knew what kind of contexts they occur in, we could recognize instances of these classes themselves. In this work we address this chicken-and-egg problem by assigning it to a computer, and giving it a small number of examples of the class as initial examples to learn from. We explore several algorithms in which alternating looking at noun-phrases and their local contexts allows us to learn to recognize members of a semantic class in context. We examine active learning algorithms for eliciting useful labels from an expert to improve learning performance, customized to this domain. Finally we explore the graph structure of the underlying labeled and unlabeled data, showing how properties of this graph structure explain performance and inform design choices we have to make when applying these methods to new tasks. Send email to rosie DOT jones AT acm DOT org