Optimistic Pruning

Next: Experiments Up: Pruning a DC Previous: Pessimistic Pruning

Optimistic Pruning

Kearns $\&$ Mansour (1998) present a novel algorithm to prune decision trees, based on a test over locally observed errors. Its principle is simple: each internal node of a DT is tested only once in a bottom-up fashion, and we estimate the local error over the learning examples reaching this node, before and after the removal of the node. If the local error after removal is not greater than the local error before, plus a penalty term, then we remove the node and its subtree. The penalty term makes the pruning essentially optimistic, that is, we tend to overprune the decision tree. However, thanks to local uniform convergence results, and due to the fact that certain sub-classes of decision trees are reasonably large, Kearns $\&$ Mansour (1998) are able to prove that with high probability, the overpruning will not be too severe with respect to the optimal subtree of the initial DT. We refer the reader to their paper for further theoretical results, not needed here. The point is that by using the results of Kearns $\&$ Mansour (1998), we can obtain a similar test for DC. We emphasize that our bound might not enjoy the same theoretical properties as for decision trees, because of the cardinality reasons briefly outlined before. However, such a test is interesting since it may lead especially to very small and interpretable decision committees, with the obvious hope that their accuracy will not decrease too much. Furthermore, the paper of Kearns $\&$ Mansour (1998) does not contain experimental results. We think our criterion as a way to test heuristically the experimental feasibility of some of the results of Kearns $\&$ Mansour (1998). The principle of our criterion is exactly the same as the original test of Kearns $\&$ Mansour (1998) : ``can we compare, when testing some rule $(t,\vec{v})$ and using the examples that satisfy the rule, the errors before and after removing the rule''? Let $\epsilon_{(t,\vec{v})}$ represent the error before removing the rule, on the local sample $LS_{(t,\vec{v})}$ satisfying monomial . Denote $\epsilon_\emptyset$ as the error before removing $(t,\vec{v})$ , still measured on the local sample $LS_{(t,\vec{v})}$ . Then we define the heuristic ``penalty'' (proof omitted: it is a rough upperbound of Kearns $\&$ Mansour (1998), Lemma 1) :

$\displaystyle \alpha'_{(t,\vec{v})}$

$\textstyle =$

$\displaystyle \sqrt{\frac{(\mbox{Set}((t,\vec{v}))+2)\log(n) + \log{1/\delta}}{\vert LS_{(t,\vec{v})}\vert}} \:\:.$

(8)

$\mbox{Set}((t,\vec{v}))$ denotes the maximum number of literals of all rules except $(t,\vec{v})$ in the current DC, that an arbitrary example could satisfy. The fast calculability of $\alpha'_{(t,\vec{v})}$ is obtained at the expense of a greater risk of overpruning, whose effects on some small datasets were experimentally dramatic for the accuracy. In our experiments, which contain very small datasets, we have chosen to tune a parameter limiting the effects of this combinatorial upperbound. More precisely, We have chosen to uniformly resample

into a larger subset of 5000 examples, when the initial

contained less than 5000 examples. By this, we artificially increase $\vert LS_{(t,\vec{v})}\vert$ and mimic for the small domains new domains with an identical larger size, with the additional benefits that reasonable comparisons may be made on pruning.
The value of Criterion( $(t,\vec{v})$ ) is therefore ``TRUE'' iff $\epsilon_{(t,\vec{v})} + \alpha'_ {(t,\vec{v})}\geq \epsilon_\emptyset$ .

Next: Experiments Up: Pruning a DC Previous: Pessimistic Pruning