Datasets

Next: ROC Creation Up: Experiments Previous: Experiments

Datasets

We experimented on nine different datasets. These datasets are summarized in Table 5.2. These datasets vary extensively in their size and class proportions, thus offering different domains for SMOTE. In order of increasing imbalance they are:

1.: The Pima Indian Diabetes [32] has 2 classes and 768 samples. The data is used to identify the positive diabetes cases in a population near Phoenix, Arizona. The number of positive class samples is only 268. Good sensitivity to detection of diabetes cases will be a desirable attribute of the classifier.
2.: The Phoneme dataset is from the ELENA project. The aim of the dataset is to distinguish between nasal (class 0) and oral sounds (class 1). There are 5 features. The class distribution is 3,818 samples in class 0 and 1,586 samples in class 1.
3.: The Adult dataset [32] has 48,842 samples with 11,687 samples belonging to the minority class. This dataset has 6 continuous features and 8 nominal features. SMOTE and SMOTE-NC (see Section 6.1) algorithms were evaluated on this dataset. For SMOTE, we extracted the continuous features and generated a new dataset with only continuous features.
4.: The E-state data [33] consists of electrotopological state descriptors for a series of compounds from the National Cancer Institute's Yeast AntiCancer drug screen. E-state descriptors from the NCI Yeast AntiCancer Drug Screen were generated by Tripos, Inc. Briefly, a series of about 60,000 compounds were tested against a series of 6 yeast strains at a given concentration. The test was a high-throughput screen at only one concentration so the results are subject to contamination, etc. The growth inhibition of the yeast strain when exposed to the given compound (with respect to growth of the yeast in a neutral solvent) was measured. The activity classes are either active -- at least one single yeast strain was inhibited more than 70%, or inactive -- no yeast strain was inhibited more than 70%. The dataset has 53,220 samples with 6,351 samples of active compounds.
5.: The Satimage dataset [32] has 6 classes originally. We chose the smallest class as the minority class and collapsed the rest of the classes into one as was done in [25]. This gave us a skewed 2-class dataset, with 5809 majority class samples and 626 minority class samples.
6.: The Forest Cover dataset is from the UCI repository [32]. This dataset has 7 classes and 581,012 samples. This dataset is for the prediction of forest cover type based on cartographic variables. Since our system currently works for binary classes we extracted data for two classes from this dataset and ignored the rest. Most other approaches only work for only two classes [19,18,17,1]. The two classes we considered are Ponderosa Pine with 35,754 samples and Cottonwood/Willow with 2,747 samples. Nevertheless, the SMOTE technique can be applied to a multiple class problem as well by specifying what class to SMOTE for. However, in this paper, we have focused on 2-classes problems, to explicitly represent positive and negative classes.
7.: The Oil dataset was provided by Robert Holte and is used in their paper [9]. This dataset has 41 oil slick samples and 896 non-oil slick samples.
8.: The Mammography dataset [10] has 11,183 samples with 260 calcifications. If we look at predictive accuracy as a measure of goodness of the classifier for this case, the default accuracy would be 97.68% when every sample is labeled non-calcification. But, it is desirable for the classifier to predict most of the calcifications correctly.
9.: The Can dataset was generated from the Can ExodusII data using the AVATAR [34] version of the Mustafa Visualization tool. The portion of the can being crushed was marked as ``very interesting'' and the rest of the can was marked as ``unknown.'' A dataset of size 443,872 samples with 8,360 samples marked as ``very interesting'' was generated.

Next: ROC Creation Up: Experiments Previous: Experiments

Nitesh Chawla (CS)
6/2/2002