Efficient Active and Semi-Supervised Learning of Disjunctions

April 17, 2013

In many modern applications, like web-based information gathering,
unlabeled data is abundant but labeling it is expensive. Consequently,
there has been substantial effort in understanding and using
semi-supervised learning (using large amounts of unlabeled data to augment
limited labeled data) and active learning (where the algorithm itself asks
for labels of carefully chosen examples with the goal of minimizing the
human labeling effort). In order to make inferences from this unlabeled
data, it is necessary to (sometimes implicitly) assume some connection
between the target concept being learned and the underlying distribution in
the form of a regularity assumption.

While good regularity assumptions can information-theoretically reduce the labeled sample complexity, in many cases efficient algorithms with provable guarantees are not known. We provide efficient algorithms learning the class of 2-Sided Disjunctions under modest assumptions, addressing an open problem in the field of semi-supervised learning. We are able to remove these assumptions in the active learning domain and present an efficient algorithm matching the label complexity lower bound.

While good regularity assumptions can information-theoretically reduce the labeled sample complexity, in many cases efficient algorithms with provable guarantees are not known. We provide efficient algorithms learning the class of 2-Sided Disjunctions under modest assumptions, addressing an open problem in the field of semi-supervised learning. We are able to remove these assumptions in the active learning domain and present an efficient algorithm matching the label complexity lower bound.

*Joint work with Nina Balcan, Chris Berlind, and Yingyu Liang.*