ReadTheWeb - Active Learning Group Documentation

Data Flow

Relation Learner (RL) (from J&J) Active Learner (AL) User Interface (UI)
Start of Initialization
A. Provide the Active Learner with the list of relations and the seed rules or seed entity pairs that will be used to initiate the bootstrapping algorithm.
B. Use the relations and the seeds to initialize all internal data structures and, if needed, prepare the user interface for Active Learning.
Start of Bootstrap Iteration
1. Having obtained the entity pairs and extraction rules from new queries, update the Active Learner with the newly extracted rules and entity pairs (and the number of times each pair was extracted).
2. Record the newly extracted rules, entity pairs, and counts from the Relation Learner. Rank unlabeled entity pairs based on how much their labeling would improve our performance according to some measure (e.g., accuracy of entity-relation probability estimates). Decide which and how many entities to ask the user to label.
3. Poll the Active Learner to determine how many questions to ask the user to label and which questions to ask. Prompt the user. Upon receiving answers from the user, return those answers to the Active Learner.
4. Use the answers from the User Interface to compute a score for each entity pair and relation (e.g., probability that the relation holds for the entity). Also compute a score for each rule and relation (e.g., the precision of the rule as an extractor for that relation).
5. Retrieve the rule and entity pair scores from the Active Learner. Select a subset of high scoring rules and entity pairs for next iteration of bootstrapping.
End of Bootstrap Iteration. Repeat steps 1-5.

Biology Scenario

The Active Learner framework is quite general but is currently linked to the Relation Learner module. The Active Learner can and will be applied to whatever task the Relation Learner module is applied. As an example, we might presume that the Relation Learner will be applied to the task of finding the Advises( prof, student ) relation.

The Relation Learner would construct an active learning object, passing it the list of possible relations (Advises, Does Not Advise). The list of relations are assumed to be mutually exclusive and exhaustive. The Relation Learner also provides a list of seed advisor-student pairs that will be used to initiate the bootstrapping algorithm, for instance, Advises( Prof. Smith, Joe ), Advises( Prof. Smith, Sam ), and Advises( Thomas, Mary ). The Active Learner makes note of these seed advisor-student pairs, probably giving them high initial probabilities of maintaining the Advises relation. Then, the bootstrapping process begins.

At the beginning of each bootstrap iteration, the Relation Learner uses existing advisor-student pairs to extract new rules (to be used as patterns in the extraction of new entity pairs). These new rules are provided to the Active Learner along with the entities involved in extracting them. For instance, the statements "Sam is advised by Prof. Smith" and "Mary is advised by Thomas" might be used to learn the rule that "B is advised by A" extracts the entity pair (A,B). This rule along with the pairs (Prof. Smith, Sam) and (Thomas, Mary) are passed to the Active Learner.

The relation learner then uses existing rules (including the newly added ones) to search for and extract advisor-student pairs. These pairs along with the rules that extracted them are passed to the Active Learner. For instance, the "is advised by" rule might find "Adam is advised by Thomas" in addition to "Sam is advised by Prof. Smith" and "Mary is advised by Thomas". All three extracted pairs are passed to the Active Learner: (Thomas, Adam), (Prof. Smith, Sam), and (Thomas, Mary). The Active Learner records which rules extracted which pairs and how often.

The Active Learner then decides which advisor-student pair should be labeled in order to provide the most improvement in performance. A variety of heuristics might be used to rank the advisor-student pairs (e.g., number of mentions and highest uncertainty according to the PMI). The highest ranked advisor-student pairs are marked as those to ask a user to label. In this example, since (Thomas, Adam) is the only entity pair that was not a seed and, hence, does not have a label, it would receive the highest ranking.

When a user is available to answer questions, the user interface polls the Active Learner to determine what and how many questions to ask the user. In this example, the Active Learner would report that it had one question to ask and would request a labeling of either Advises or Does not Advise on the pair (Thomas, Adam). The user interface would ask this question of the user; the user would check the appropriate checkbox (or a default "I don't know" box if that is the case); and, the user interface will report the answer back to the Active Learner.

The active learner uses the new labels to calculate scores for each entity pair and relation. For instance, assuming that the user confirms that Thomas is Adam's advisor, the new score for Advises(Thomas,Adam) would be high and the new score for Does Not Advise(Thomas,Adam) would be low.

These scores are available for the Relation Learner to use in its next round of bootstrapping. Specifically, if the Relation Learner wants to learn more patterns to extract a particular relation, it will use entities with high scores for that relation in the search for the new rule.