Many deployed machine learning systems such as those for error prediction, information filtering, fraud detection, network intrusion detection and surveillance deal with building classifiers (or predictors) that are not deployed in isolation but are part of a larger interactive system with an expert in the loop. These applications involve a limited number of labeled examples with a high cost of labeling, and a large number of unlabeled examples, with majority of them being negative (skewed class distribution). Typically, once the models are trained, these systems score a large amount of new data and present the experts with a ranked list of cases to review. The immediate goal of such interactive systems is to help the experts find relevant (positive) examples.
The commonly used approaches in such skewed class distribution problems use supervised classification and focus on increasing the classifier accuracy (or related metrics). These approaches can be effective at identification (i.e., have a high precision) but may not take into account other relevant optimization goals when deployed in the interactive setting. For instance, the relative costs (or utility) of identifying different ``positive" examples need not be the same. Ignoring these criteria can have serious implications on the relevant effectiveness of the system possibly adversely affecting the business goals motivating their use. The business goal of deploying these machine learning systems is not just to maximize the performance of the classifier but to make the experts more efficient at performing their task.
Consequently, these approaches must aim to optimize multiple criteria including the 'cost' of labeling an instance, the utility in terms of the relevance (of highly ranked instances) to the experts ('exploitation') in conjunction with future utility or effectiveness ('exploration'). Even though the problems like fraud detection and surveillance have been researched and there are deployed systems for these problems, most of the related research has focused on one or two aspects of the overall problem and not managing all the trade-offs over time between the three factors: exploration, exploitation, and labeling/reviewing cost together.
In this thesis, we define and characterize the area of interactive classification for such practical data mining systems. We propose a factorization of the problem and a framework to solve this class of problems. The hypothesis is that jointly managing the trade-offs between the three factors: exploration, exploitation, and labeling/reviewing cost will lead to better performance of the system over time. We show evidence of the utility of this framework using real data for several tasks.
Jaime Carbonell (Co-Chair)
Alexander Rudnicky (Co-Chair)
Andrew E. Fano (Accenture)
Rayid Ghani (University of Chicago)
staceyy [atsymbol] cs.cmu.edu