Conditional Distribution Search
Apr 15, 2015

We consider a new family of machine learning tasks involving the identification of a (possibly rare) condition in a data set; specifically, we will consider the simple example of the task of identifying a condition under which some other property holds with high probability. This task captures a kind of "abductive reasoning," in which we are seeking some explicit condition that implies the occurrence of some other condition to be diagnosed or explained. We will also propose another task in the family, the task of finding a condition under which the data set is fit well by a linear model.

Identifying a rare event inherently requires a large data set. We propose, moreover, that some prominent success stories of "big data" are properly best viewed as examples of rare event identification. In particular, this suggests that the benefit of the large data set in these cases is due to the coverage of such rare events. This is in contrast to the usual view of the role of data set size (in classification) that a larger data set enables us to fit a richer model with many parameters (which would suggest that "big data" is primarily a matter of fitting exceptionally sophisticated models).