Tradeoff Simplicity-Accuracy

Experiments were carried out using three variants of WIDC: with optimistic pruning (o), with pessimistic pruning (p), and without pruning ( $\emptyset$ ). Table 1 presents some results on various datasets, most of which were taken from the UCI repository of machine learning database [Blake et al.1998]. For each dataset, the eventual discretization of attributes was performed following previous recommendations and experimental setups [de Carvalho Gomes Gascuel1994]. The results were computed using a ten-fold stratified cross validation procedure [Quinlan1996]. The least errors for WIDC are underlined for each domain. For the sake of comparisons, column ``Others'' points out various results for other algorithms, intended to help getting a general picture of what can be the performances of efficient approaches with different outputs (decision lists, trees, committees, etc.), in terms of errors (and, when applicable, sizes). Some of the most relevant results for WIDC are summarized in the scatterplots of Table 2.

Table 1: Experimental results using WIDC.

	WIDC(o)			WIDC(p)			WIDC( $\emptyset$ )
Domain	err $\%$	$r_{DC}$	$l_{DC}$	err $\%$	$r_{DC}$	$l_{DC}$	err $\%$	$r_{DC}$	$l_{DC}$	Other
Australian	$\underline{15.57}$									$15.1_{39.0}$ f
Balance							$\underline{14.29}$			$20.1_{86.0}$ f
Breast-W				$\underline{4.08}$						$4.9_{21.8}$ f
Bupa	$\underline{36.57}$									$37.3_{37.0}$ f
Echo				$\underline{27.86}$						$32.3_{35.4}$ a
Glass2				$\underline{21.17}$						$26.3_{8.0}$ f
Heart-S				$\underline{19.48}$						c
Heart-C				$\underline{21.85}$						$22.5_{52.0}$ a
Heart-H							$\underline{20.00}$			$21.8_{60.3}$ a
Hepatitis							$\underline{15.29}$			$19.2_{34.0}$ a
Horse	$\underline{15.26}$									$15.7_{13.4}$ f
Iris	$\underline{5.33}$			$\underline{5.33}$						c
Labor	$\underline{15.00}$			$\underline{15.00}$						$16.31_{6.8}$ d
LED7							$\underline{24.73}$			$25.73_{12.2}$ d
LEDeven				$\underline{12.43}$						$13.00_{19.2}$ f
LEDeven2							$\underline{21.70}$			$23.1_{25.4}$ f
Lung	$\underline{42.50}$			$\underline{42.50}$			$\underline{42.50}$			e
Monk1	$\underline{15.00}$			$\underline{15.00}$			$\underline{15.00}$			$16.66_{5.0}$ d
Monk2				$\underline{21.48}$						$29.39_{18.0}$ d
Monk3	$\underline{3.04}$									$2.67_{2.0}$ d
Pima				$\underline{26.17}$						c
Pole				$\underline{33.52}$						$35.5_{81.6}$ f
Shuttle	$\underline{3.27}$			$\underline{3.27}$						$1.7_{29.8}$ f
TicTacToe				$\underline{20.10}$						$18.3_{130.9}$ f
Vehicle2	$\underline{26.47}$									$25.6_{43.0}$ f
Vote0	$\underline{6.81}$									$4.3_{49.6}$ a
Vote1				$\underline{9.98}$						$10.89_{6.4}$ d
Waveform							$\underline{20.24}$			$33.5_{21.8}$ b
Wine							$\underline{7.89}$			e
XD6	$\underline{16.73}$									$21.2_{58.0}$ f

Conventions: $l_{DC}$ is the whole number of literals of a DC, $r_{DC}$ is its number of rules. For ``Others'', numbers are given on the form $\mbox{error}_{(\mbox{size})}$ , where a is improved CN2 (CN2-POE) building DLs, size is the number of literals [Domingos1998]. b is ICDL building DLs, notations follow a [Nock Jappy1998]. c is C4.5 [Franck Witten1998]. d is IDC building DCs, notations follow a [Nock Gascuel1995]. e is 1-Nearest Neighbor rule and f is C4.5 (pruned, default parameters) building DTs; the size of a tree is its whole number of nodes.

Table 2: Scatterplots summarizing some results of Table 1 for the three flavors of WIDC, in terms of error (first row) and size ( $l_{DC}$ , second row), on the thirty datasets. Each point above the

line depicts a dataset for which the algorithm in abscissa performs better.

$\epsfig{file=../PLOTS/plot-acc-Op-Pe.eps,width=4.6cm,height=4.6cm}$	$\epsfig{file=../PLOTS/plot-acc-Pe-Em.eps,width=4.6cm,height=4.6cm}$	$\epsfig{file=../PLOTS/plot-acc-Op-Em.eps,width=4.6cm,height=4.6cm}$
$\epsfig{file=../PLOTS/plot-lit-Op-Pe.eps,width=4.6cm,height=4.6cm}$	$\epsfig{file=../PLOTS/plot-lit-Pe-Em.eps,width=4.6cm,height=4.6cm}$	$\epsfig{file=../PLOTS/plot-lit-Op-Em.eps,width=4.6cm,height=4.6cm}$

The interpretation of Table 1 using only errors gives the advantage to WIDC with pessimistic pruning, all the more as WIDC(p) has the advantage of providing simpler formulas than WIDC( $\emptyset$ ), and has a much simpler pruning stage than WIDC(o). Results also compare favorably to the ``Other'' results, building either DLs, DTs, or DCs. They are all the more interesting if we compare the errors in the light of the sizes obtained. For the ``Echo'' domain, WIDC with pessimistic pruning beats improved CN2 by two points, but the DC obtained contains roughly eight times fewer literals than CN2-POE's decision list. If we except ``Vote0'', on all other problems on which we dispose of CN2-POE's results, we outperform CN2-POE on both accuracy and size. Finally, on ``Vote0'', note that WIDC with optimistic pruning is slightly outperformed by CN2-POE by $2.51\%$ , but the DC obtained is fifteen times smaller than the decision list of CN2-POE. If we dwell on the results of C4.5, similar conclusions can be brought: on 12 out of 13 datasets on which we ran C4.5, WIDC(p) finds smaller formulas, and still beats C4.5's accuracy on 9 of them. A quantitative comparison of $l_{DC}$ against the number of nodes of the DTs shows that on 4 datasets out of the 13 (Pole, Shuttle, TicTacToe, Australian), the DCs are more than 6 times smaller, while they only incur a loss in accuracy for 2 of them, and limited to 1.8 $\%$ . For this latter problem (TicTacToe), a glimpse at Table 1 shows that the DCs, with less than 7 rules on average, keeps comparatively most of the information contained in DTs having more than a hundred leaves. On many problems where mining issues are crucial, such a size reduction would be well worth the (comparatively slight) loss in accuracy, because we keep a significant part of the information on very small classifiers, thus likely to be interpretable.