Learning with Labeled
and Unlabeled Data
Motivation
For many practical machine learning tasks, it is easy to obtain large quantities of unlabeled data, but it is difficult to obtain labeled data.
So naturally the question arises:-
Can we use the unlabeled data to help us learn?
The answer is basically yes, unlabeled data can help us.
But of course this only provokes more questions. Such as
1) What is the
theoretical bound on how much unlabeled data can help you? (What does it even mean
to say “unlabeled data can help you”)?
Castelli and Cover wrote a ground breaking paper addressing this subject. Can their analysis be improved, or a new analysis proposed?
2) What are some good
algorithms for utilizing unlabeled data?
Several have been proposed, ranging from co-training, graph mincuts and various other approaches.
Which ones perform best in practice?
Can we give good theoretical motivations for these or for new algorithms?
Some previous work:-
Blum and Mitchell, Co-training algorithm (1998)
Blum & Chawla, Graph Mincuts algorithm (2001)
Castelli & Cover, The relative value of unlabeled data
Zhu, Gharamani and Lafferty, Gaussian processes algorithm (2003)
Joachims, Spectral Graph Transducer algorithm (2003)
Update (May 2004): This line of research has so far yielded one publication, “Semi-supervised Learning Using Randomized Mincuts” which extends the graph mincut algorithm of Blum and Chawla.