Next: Introduction

Journal of Artificial Intelligence Research 16 (2002), 321 -- 357. Submitted 09/01; published 06/02.
©2002 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.

SMOTE: Synthetic Minority Over-sampling Technique

Nitesh V. Chawla¹, Kevin W. Bowyer²,
Lawrence O. Hall¹, W. Philip Kegelmeyer³

¹Department of Computer Science and Engineering, ENB 118
University of South Florida
4202 E. Fowler Ave.
Tampa, FL 33620-5399, USA
²Department of Computer Science and Engineering
384 Fitzpatrick Hall
University of Notre Dame
Notre Dame, IN 46556, USA
³Sandia National Laboratories
Biosystems Research Department, P.O. Box 969, MS 9951
Livermore, CA, 94551-0969, USA

chawla@csee.usf.edu, kwb@cse.nd.edu, hall@csee.usf.edu, wpk@ca.sandia.gov

Abstract:

An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of ``normal'' examples with only a small percentage of ``abnormal'' or ``interesting'' examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.

Next: Introduction

Nitesh Chawla (CS)
6/2/2002