Models for Natural Language Learning using Unlabeled Data
Here is a lightly-annotated bibliography of papers on
learning from labeled and unlabeled data. It
focuses on methods especially relevant to bootstrap learning for
natural language analysis, and on theoretical models for how and when
we should expect unlabeled data to be helpful.
Please edit this file ( /afs/cs/project/theo-21/www/semisupervised.html) to add more citations.
language bootstrap learning:
Yarowsky wrote an early paper describing how to learn to disambiguate
word senses. It makes the assumption that each occurance of a
word (e.g., "bank") in a document has the same meaning (e.g., river
bank or financial bank). Abney's paper is a more recent
formal analysis of why Yarowsky's algorithm works.
Cotraining uses unlabeled data together with labeled data to learn
f(x)=y when x can be expressed as a pair of features
x=<x1,x2> such that both x1 and x2 are individually
sufficient to predict y. This has been used to train web page
classifiers, named entity recognizers, image classifiers, and more.
The idea was introduced in 1998 by Blum & Mitchell
and has been applied and extended in several directions.
- Yarowsky, David. 1995. Unsupervised word sense
disambiguation rivaling supervised methods.
In Proceedings of the 33rd Annual Meeting of the Association for
- Abney, Steven. Understanding the
Yarowsky Algorithm, Association for Computational
Another line of bootstrapping algorithms was started by Sergei Brin's
paper on using web search as a subroutine for bootstrap learning using
the web as a training corpus.
Etzioni's group has pushed very large scale extraction from the web,
based on bootstrap learning of named entity extractors and relation
Theoretical models for bootstrap learning:
The above papers contain a number of theoretiical models, especially
the Blum & Mitchell 1998 paper, and the Abney paper. Following are
more recent theoretical models for how and when unlabeled data can
These papers provide PAC-style bounds on co-training and related
learning settings that go beyond those provided in the original
This paper extends the co-training theory model to capture the iterative expansion of the domain of the learned function
This paper considers multiple function approximators instead of
multiple views on the data, leading to a Boosting-style approach.
Whereas the above papers focus on PAC bounds, the following paper has a
very different focus. It presents a statistical model for
estimating accuracy for bootstrap learning of named entity and relation
extractors, under the assumption that correct entities and relations
will be repeatedly extracted from a large corpus, and that correct
extractions will be repeatedly more frequently than incorrect
extractions. This is used by Etzioni's system described above.