Active Testing

Nov 9, 2011

ABSTRACT:

One of the motivations for property testing of boolean functions is the idea that testing can serve as a preprocessing step before learning. However, in most machine learning applications, the ability to query functions at arbitrary points in the input space is considered highly unrealistic. Instead, the dominant query paradigm in applied machine learning, called active learning, is one where the algorithm may ask for examples to be labeled, but only from among those that exist in nature. That is, the algorithm may make a polynomial number of draws from the underlying distribution D and then query for labels, but only of points in its sample. In this work, we bring this well-studied model in learning to the domain of testing, calling it active testing.

We show that for a number of important properties, testing can still yield substantial benefits in this setting. This includes testing unions of intervals, testing linear separators, and testing various assumptions used in semi-supervised learning. For example, we show that testing unions of d intervals can be done with O(1) label requests in our setting, whereas it is known to require Omega(sqrt(d)) labeled examples for passive testing (where the algorithm must pay for labels on every example drawn from D) and Omega(d) for learning. In fact, our results for testing unions of intervals also yield improvements on prior work in both the membership query model (where any point in the domain can be queried) and the passive testing model [KR00] as well. In the case of testing linear separators in R^n, we show that both active and passive testing can be done with O(sqrt(n)) queries, substantially less than the Omega(n) needed for learning and also yielding a new upper bound for the passive testing model. We also show a general combination result that any disjoint union of testable properties remains testable in the active testing model, a feature that does not hold for passive testing.

In addition to these specific results, we also develop a general notion of the testing dimension of a given property with respect to a given distribution. We show this dimension characterizes (up to constant factors) the intrinsic number of label requests needed to test that property; we do this for both the active and passive testing models. We then use this dimension to prove a number of lower bounds. For instance, interestingly, one case where we show active testing does not help is for dictator functions, where we give Omega(log n) lower bounds that match the upper bounds for learning this class.

Our results show that testing can be a powerful tool in realistic models for learning, and further that active testing exhibits an interesting and rich structure. Our work in addition develops new characterizations of common function classes that may be of independent interest.

Joint work with Nina Balcan, Eric Blais and Avrim Blum. STOC 2012 submission.

One of the motivations for property testing of boolean functions is the idea that testing can serve as a preprocessing step before learning. However, in most machine learning applications, the ability to query functions at arbitrary points in the input space is considered highly unrealistic. Instead, the dominant query paradigm in applied machine learning, called active learning, is one where the algorithm may ask for examples to be labeled, but only from among those that exist in nature. That is, the algorithm may make a polynomial number of draws from the underlying distribution D and then query for labels, but only of points in its sample. In this work, we bring this well-studied model in learning to the domain of testing, calling it active testing.

We show that for a number of important properties, testing can still yield substantial benefits in this setting. This includes testing unions of intervals, testing linear separators, and testing various assumptions used in semi-supervised learning. For example, we show that testing unions of d intervals can be done with O(1) label requests in our setting, whereas it is known to require Omega(sqrt(d)) labeled examples for passive testing (where the algorithm must pay for labels on every example drawn from D) and Omega(d) for learning. In fact, our results for testing unions of intervals also yield improvements on prior work in both the membership query model (where any point in the domain can be queried) and the passive testing model [KR00] as well. In the case of testing linear separators in R^n, we show that both active and passive testing can be done with O(sqrt(n)) queries, substantially less than the Omega(n) needed for learning and also yielding a new upper bound for the passive testing model. We also show a general combination result that any disjoint union of testable properties remains testable in the active testing model, a feature that does not hold for passive testing.

In addition to these specific results, we also develop a general notion of the testing dimension of a given property with respect to a given distribution. We show this dimension characterizes (up to constant factors) the intrinsic number of label requests needed to test that property; we do this for both the active and passive testing models. We then use this dimension to prove a number of lower bounds. For instance, interestingly, one case where we show active testing does not help is for dictator functions, where we give Omega(log n) lower bounds that match the upper bounds for learning this class.

Our results show that testing can be a powerful tool in realistic models for learning, and further that active testing exhibits an interesting and rich structure. Our work in addition develops new characterizations of common function classes that may be of independent interest.

Joint work with Nina Balcan, Eric Blais and Avrim Blum. STOC 2012 submission.