Next: Performance Measures Up: SMOTE: Synthetic Minority Over-sampling Previous: SMOTE: Synthetic Minority Over-sampling

Introduction

A dataset is imbalanced if the classes are not approximately equally represented. Imbalance on the order of 100 to 1 is prevalent in fraud detection and imbalance of up to 100,000 to 1 has been reported in other applications [1]. There have been attempts to deal with imbalanced datasets in domains such as fraudulent telephone calls [2], telecommunications management [3], text classification [4,5,6,7,8] and detection of oil spills in satellite images [9].

The performance of machine learning algorithms is typically evaluated using predictive accuracy. However, this is not appropriate when the data is imbalanced and/or the costs of different errors vary markedly. As an example, consider the classification of pixels in mammogram images as possibly cancerous [10]. A typical mammography dataset might contain 98% normal pixels and 2% abnormal pixels. A simple default strategy of guessing the majority class would give a predictive accuracy of 98%. However, the nature of the application requires a fairly high rate of correct detection in the minority class and allows for a small error rate in the majority class in order to achieve this. Simple predictive accuracy is clearly not appropriate in such situations. The Receiver Operating Characteristic (ROC) curve is a standard technique for summarizing classifier performance over a range of tradeoffs between true positive and false positive error rates [11]. The Area Under the Curve (AUC) is an accepted traditional performance metric for a ROC curve [12,13,14]. The ROC convex hull can also be used as a robust method of identifying potentially optimal classifiers [1]. If a line passes through a point on the convex hull, then there is no other line with the same slope passing through another point with a larger true positive (TP) intercept. Thus, the classifier at that point is optimal under any distribution assumptions in tandem with that slope.

The machine learning community has addressed the issue of class imbalance in two ways. One is to assign distinct costs to training examples [15,16]. The other is to re-sample the original dataset, either by over-sampling the minority class and/or under-sampling the majority class [17,18,4,19]. Our approach [20] blends under-sampling of the majority class with a special form of over-sampling the minority class. Experiments with various datasets and the C4.5 decision tree classifier [21], Ripper [22], and a Naive Bayes Classifier show that our approach improves over other previous re-sampling, modifying loss ratio, and class priors approaches, using either the AUC or ROC convex hull.

Section 2 gives an overview of performance measures. Section 3 reviews the most closely related work dealing with imbalanced datasets. Section 4 presents the details of our approach. Section 5 presents experimental results comparing our approach to other re-sampling approaches. Section 6 discusses the results and suggests directions for future work.

Next: Performance Measures Up: SMOTE: Synthetic Minority Over-sampling Previous: SMOTE: Synthetic Minority Over-sampling

Nitesh Chawla (CS)
6/2/2002