To evaluate the post-processor it was applied to all datasets containing continuous attributes from the UCI machine learning repository [Murphy and Aha, 1993] that were then held (due to previous machine learning experimentation) in the local repository at Deakin University. These datasets are believed to be broadly representative of those in the repository as a whole. After experimentation with these eleven data sets, two additional data sets, sick euthyroid and discordant results, were retrieved from the UCI repository and added to the study in order to investigate specific issues, as discussed below.
The resulting thirteen datasets are described in Table 1. The second column contains the number of attributes by which each object is described. Next is the proportion of these that are continuous. The fourth column indicates the proportion of attribute values in the data that are missing (unknown). The fifth column indicates the number of objects that the data set contains. The sixth column indicates the proportion of these that belong to the class represented by the most objects within the data set. The final column indicates the number of classes that the data set describes. Note that the glass type dataset uses the Float/Not Float/Other three class classification rather than the more commonly used six class classification.
|No. of||contin-||%||No. of||common||No. of|
|breast cancer Wisconsin||9||100||<1||699||66||2|
|Cleveland heart disease||13||46||<1||303||54||2|
|Hungarian heart disease||13||46||20||295||64||2|
|Pima indians diabetes||8||100||0||768||65||2|
Each data set was divided into training and evaluation sets 100 times. Each training set consisted of 80% of the data, randomly selected. Each evaluation set consisted of the remaining 20% of the data. Both C4.5 and C4.5X were applied to each of the resulting 1300 (13 data sets by 100 trials) training and evaluation set pairs.
Table 2 summarizes the percentage predictive accuracy obtained for the unpruned decision trees generated by both C4.5 and C4.5X. It presents the mean ( ) and standard deviation (s) over each set of 100 trials with respect to each data set for both C4.5 and C4.5X along with the results of a two-tailed matched pairs t-test comparing these means. For twelve of the thirteen data sets C4.5X obtained a higher mean accuracy than C4.5. For the remaining data set, hypothyroid, C4.5 obtained higher mean predictive accuracy than C4.5CS (albeit by a small margin--measured to two decimal places the respective mean accuracies were 99.51 and 99.46, respectively). For nine of the data sets the advantage toward C4.5X is statistically significant at the 0.05 level (p<=0.05), although the advantage with respect to the discordant results data is too small to be apparent when measured to one decimal place (measured to two decimal places the values are 98.58 and 98.62 respectively). The advantage toward C4.5 for the hypothyroid data is also statistically significant at the 0.05 level. The differences in mean predictive accuracy for the Hungarian heart disease, new thyroid and sick euthyroid data sets are not significant at the 0.05 level.
|breast cancer Wisconsin||94.1||1.8||94.4||1.7||-3.2||0.002|
|Cleveland heart disease||72.8||5.0||74.4||4.8||-6.1||0.000|
|Hungarian heart disease||77.0||5.3||77.4||5.2||-1.8||0.082|
|Pima indians diabetes||70.2||3.5||71.3||3.6||-8.1||0.000|
|breast cancer Wisconsin||95.1||1.7||95.2||1.7||-2.0||0.051|
|Cleveland heart disease||74.1||5.3||74.8||5.3||-3.7||0.000|
|Hungarian heart disease||79.2||4.9||79.4||4.8||-1.0||0.310|
|Pima indians diabetes||72.2||3.5||72.8||3.5||-5.9||0.000|
Table 3 uses the same format as Table 2 to summarize the predictive accuracy obtained for the pruned decision trees generated by both C4.5 and C4.5X. For the same twelve data sets C4.5X obtained a higher mean predictive accuracy than C4.5. For the remaining data set, hypothyroid, C4.5 again obtained higher mean predictive accuracy, although again the magnitude of the difference is so small that it is not apparent at the level of precision displayed (measured to two decimal places the mean accuracies are 99.51 and 99.46). For six of the data sets the advantage toward C4.5X is statistically significant at the 0.05 level, although the difference is only apparent at a precision of two decimal places for the discordant results data (99.81 and 99.82, respectively). The advantage toward C4.5 for the hypothyroid data is also statistically significant at the 0.05 level. The differences for breast cancer Wisconsin, echocardiogram, Hungarian heart disease, iris, new thyroid and sick euthyroid are not statistically significant at the 0.05 level.
After completing experimentation on the initial eleven data sets, the results for the hypothyroid data stood out in stark contrast from those for the other ten. This raised the possibility that there might be distinguishing features of the hypothyroid data that accounted for this difference in performance. Table 1 indicates this data set is clearly distinguishable from the other ten initial data sets in the following six respects--
To explore these issues the discordant results and sick euthyroid data sets were retrieved from the UCI repository and added to the study. These data sets are identical to the hypothyroid data set with the exception that each has a different class attribute. All three data sets contain the same objects, described by the same attributes. The addition of the discordant results and sick euthyroid data did little to illuminate this issue however. For all three data sets the changes in accuracy are of very small magnitude. For hypothyroid there is a significant advantage to C4.5. For sick euthyroid there is no significant advantage to either system. For the discordant results data there is a significant advantage to C4.5X.
The question of whether there is a distinguishing feature of the hypothyroid data that explains the observed results remains unanswered. Further investigation of this issue lies beyond the scope of the current paper but remains an interesting direction for future research.
These results suggest that C4.5X's post-processing more frequently increases predictive accuracy than not for the type of data to be found in the UCI repository. (Of the twenty-six comparisons, there was a significant increase for fifteen and there was a significant decrease for only two. A sign test reveals that this rate of success is significant at the 0.05 level, p=0.001.)
Tables 4 and 5 summarize the number of nodes in the decision trees developed. Table 4 addresses unpruned decision trees and Table 5 addresses pruned decision trees. Each post-processing modification replaces a single leaf with a split and two leaves. At most one such modification can be performed per leaf in the original tree. For all data sets the post-processed decision trees are significantly more complex than the original decision trees. In most cases post-processing has increased the mean number of nodes in the decision trees by approximately 50%. This demonstrates that the post-processing is causing substantial change.
|breast cancer Wisconsin||38.1||6.0||64.0||10.3||-51.5||0.000|
|Cleveland heart disease||66.7||7.1||100.2||11.3||-61.9||0.000|
|Hungarian heart disease||62.1||7.5||94.8||13.0||-50.1||0.000|
|Pima indians diabetes||164.8||10.8||238.8||16.3||-108.9||0.000|
|breast cancer Wisconsin||19.2||5.0||33.1||8.6||-34.9||0.000|
|Cleveland heart disease||44.6||8.3||68.3||12.8||-43.6||0.000|
|Hungarian heart disease||26.8||11.4||41.2||17.3||-22.1||0.000|
|Pima indians diabetes||112.0||16.4||163.9||24.0||-62.5||0.000|