Active Learning for
Information Extraction via Bootstrap Learning

Andrew Carlson (acarlson)
Kevin Killourhy (ksk)
Sue Ann Hong (sahong)
Sophie Wang (sophie.wang)

Read the Web, Spring 2006
School of Computer Science
Carnegie Mellon University

Overview

One of the greatest weaknesses of information extraction through bootstrap learning is the tendency of bootstrapping to diverge over time in the "correctness" of extracted facts. Hence some form of quality control is crucial in order to sustain such a bootstrapping system. Our project addresses this issue in two tiers: first, we plan to develope metrics for scoring the correctness of extractions, and second, come up with corresponding active learning for scoring metrics in order to aid quality control.

Resources

Reading List
Documentation
Java Source Code
There are 3 files:
ActiveLearner.java - This is the core class. The bootstrapper using our module would make an object of this type and use the public methods inside.
Question.java - The bootstrapper or the UI module can request for questions to ask the user fro the ActiveLearner. Question is defined for the object that gets passed between the two. Originally, it contains the question to ask, but the UI module can pass it back to ActiveLearner with the answer in the object.
Main.java - This is an illustration of how a bootstrapper would use an ActiveLearner object.
A simple bootstrap extractor [pseudocode ps] - for testing our module

Project Proposal [pdf]

Introduction
Objectives
Methods
Evaluation

Meeting Notes

03/08/2006 - talk to Jon & Jaime
03/31/2006 - Evaluation metrics. API sketch. Common terms. (WARNING: Do not look at source. It'll burn your eyes.)
Preferred Terminology/Nomenclature
1. Relations/Predicate [~Strings]
2. Rules/Extraction Rule/Patterns/Extractors/Contexts [~string, left/right hand side]
3. Claim/Belief/Assertion [e.g. IsCity(...)]
4. Occurrence/Extraction/Instance/Span [~string, spans]
5. Entities/Concepts/Entity Pairs/Example [~string]
04/07/2006 - Action items for next Tuesday, why our project is cool (Kevin).
Rosie Jones's thesis contains a considerable amount of material on active learning for bootstrap learning (Ch.4). However, it only addresses the setting in which the pool of unlabeled data does not change. However, this is not true for a bootstrap information extraction setting like the RtW system, and we may observe different behaviors using the same active learning or scoring schemes. For example, Jon and Jaime's boostrapper using Jones's CO-EM-ish scoring scheme may never converge. Hence we'd like to investigate what kinds of active learning algorithms work well in the changing environment. Also, we would like to further analyze and formalize claims made by Jones on which active learning algorithms/heuristics work well in which settings. In particular, we hope to generalize the settings from "identifying location noun phrases (NP)" or "identifying people NPs" to something more applicable to our information extraction environment (which hopefully encompasses, or at least overlaps with, entity extraction tasks, which was the focus of Jones's study). (Sadly many things are vague in that statement. For example, what are the settings that are applicable to our IE environment that generalizes Jones's? Hopefully things like that will be answered as we analyze our algorithms (which again, are yet to be defined).)

Eventually it'd be nice to be able to say, to do well, or to live long, 1. we can't live without active learning 2. we need ->||<- this much active learning, or 3. we don't need active learning at all. Of course active learning will help, that probably won't be a very interesting thing to say.
04/11/2006 - Rest of the API. Mainloop.
04/11/2006 - Plan of Attack

Timeline

Date	RtW Goals	Our Goals
4/6-4/13	Module code ready. Integration on code stubs.	Finish coding our interface functions. ("proj code") Divide up probability estimation derivation and coding different versions of the core fuction.
4/13-4/20	Integrated system based on proj code.	Code different versions of the core function. Test with simple bootstrapper.
4/20-4/27	Experiments w/ integrated code. Extensions	Merge with J&J?
4/27-5/4	Final evaluation (5/4)	Touch up on whatever needs to be done (alg, merging with J&J).
5/4-5/11	Entire system write-up due.	Write up about our module. Help merge.