Next: ROC Creation
We experimented on nine different datasets. These datasets are summarized in Table 5.2.
These datasets vary extensively in their size and class proportions, thus offering different
domains for SMOTE. In order of increasing imbalance they are:
- The Pima Indian Diabetes  has 2 classes and
768 samples. The data is used to identify the positive diabetes cases in a population near Phoenix, Arizona. The
number of positive class samples is only 268. Good sensitivity to detection of diabetes cases will be a desirable
attribute of the classifier.
- The Phoneme dataset is from the ELENA project. The aim of the dataset is to distinguish
between nasal (class 0) and oral sounds (class 1). There are 5 features. The class distribution is 3,818 samples
in class 0 and 1,586 samples in class 1.
- The Adult dataset  has 48,842 samples with 11,687 samples belonging to the minority class. This dataset
has 6 continuous features and 8 nominal features. SMOTE and SMOTE-NC (see Section 6.1) algorithms were evaluated on this dataset. For SMOTE, we extracted the continuous features and generated a new dataset with only continuous features.
- The E-state data  consists of electrotopological state descriptors for a
series of compounds from the National Cancer Institute's Yeast AntiCancer drug screen.
E-state descriptors from the NCI Yeast AntiCancer Drug Screen were
generated by Tripos, Inc. Briefly, a series of about 60,000 compounds were
tested against a series of 6 yeast strains at a given concentration. The
test was a high-throughput screen at only one concentration so the results
are subject to contamination, etc. The growth inhibition of the yeast
strain when exposed to the given compound (with respect to growth of the
yeast in a neutral solvent) was measured. The activity classes are either
active -- at least one single yeast strain was inhibited more than 70%, or
inactive -- no yeast strain was inhibited more than 70%. The dataset has
53,220 samples with 6,351 samples of active compounds.
- The Satimage dataset  has 6 classes originally. We chose the smallest class as
the minority class and collapsed the rest of the classes into one as was done in . This gave us a
skewed 2-class dataset, with 5809 majority class samples and 626 minority class samples.
- The Forest Cover dataset is from the UCI repository . This dataset has 7 classes and 581,012 samples.
This dataset is for the prediction of forest cover type based on cartographic variables. Since our
system currently works for binary classes we extracted data for two classes from this dataset and
ignored the rest. Most other approaches only work for only two classes
[19,18,17,1]. The two classes we considered are Ponderosa Pine with 35,754
samples and Cottonwood/Willow with 2,747 samples. Nevertheless, the SMOTE technique can be applied
to a multiple class problem as well by specifying what class to SMOTE for. However, in this paper, we have focused on 2-classes problems, to explicitly represent positive and negative
- The Oil dataset was provided by Robert Holte and is used in their
paper . This dataset has 41 oil slick samples and 896 non-oil slick samples.
- The Mammography dataset  has 11,183 samples with 260 calcifications. If we look at predictive
accuracy as a measure of goodness of the classifier for this case, the default accuracy would be 97.68% when
every sample is labeled non-calcification. But, it is desirable for the classifier to predict most of the
- The Can dataset was generated from the Can ExodusII data using
the AVATAR  version of the Mustafa Visualization
tool. The portion of the can being crushed was marked
as ``very interesting'' and the rest of the can was marked as ``unknown.''
A dataset of size 443,872 samples with 8,360 samples marked as ``very
interesting'' was generated.
Next: ROC Creation
Nitesh Chawla (CS)