next up previous
Next: The Post-processor Up: New Experimental Evidence Against Previous: New Experimental Evidence Against

Theoretical Basis for the Decision Tree Post-processor

The similarity assumption is a common assumption in machine learning--that objects that are similar have high probability of belonging to the same class [Rendell and Seshu, 1990]. The techniques to be described rely upon this assumption for their theoretical justification rather than upon the Occam thesis.

Starting from the similarity assumption, machine learning can be viewed as the inference of a suitable similarity metric for a learning task. A decision tree can be viewed as a partitioning of the instance space. Each partition, represented by a leaf, contains the objects that are similar in relevant respects and thus are expected to belong to the same class.

This raises the issue of how similarity should be measured. Instance-based learning methods [Aha, Kibler, and Albert, 1991] tend to map the instance space onto an n-dimensional geometric space and then employ geometric distance measures within that space to measure similarity. Such an approach is problematic on a number of grounds. First, it assumes that the underlying metrics of different attributes are commensurable. How is it possible to determine a priori whether a difference of five years in age signifies a greater or lesser difference in similarity than a difference of one inch in height? Second, it assumes that it is possible to provide a priori definitions of similarity with respect to a single attribute. Can one really make a universal prescription that a value of 16 is always more similar to a value of 2 than to a value of 64? Why should it never be the case that the relevant similarity metric is based on the log of the surface value, in which case 16 would be more similar to 64 than to 2?

If we wish to employ induction to learn classifiers expressed in a particular language then it would appear that we are forced to assume that the language in question in some manner captures a relevant aspect of similarity. Any potential leaf of a decision tree presents a plausible similarity metric (all objects that fall within that leaf are similar in some respect). Empirical evaluation (the performance of that leaf on the training set) can then be used to infer the relevance of that similarity metric to the induction task at hand. If a leaf l covers a large number of objects of class c and few of other classes, then this provides evidence that similarity with respect to the tests that define l is predictive of c.

Figure 1 illustrates a simple instance space and the partition that C4.5 [Quinlan, 1993] imposes thereon. Note that C4.5 forms nodes for continuous attributes, such as A and B, that consist of a test on a cut value x. This test takes the form a <= x. With respect to Figure 1 there is one such cut, on value 5 for attribute A.

Figure 1: A simple instance space

C4.5 infers that the relevant similarity metric relates to attribute A only. The partition (shown by a dashed line) is placed at value 5 for attribute A. However, if one does not accept the Occam thesis, but does accept the similarity assumption, there is no reason to believe that the area of the instance space for which B>5 and A<=5 (lightly shaded in Figure 1) should belong to class + (as determined by C4.5) rather than class -.

C4.5 uses the Occam thesis to justify the termination of partitioning of the instance space as soon as the decision tree accounts adequately for the training set. In consequence, large areas of the instance space that are occupied by no objects in the training set may be left within partitions for which the similarity assumption provides little support. For example, with respect to Figure 1, it could be argued that a more relevant similarity metric with respect to the region A<=5 and B>5 is similarity with respect to B. Within the entire instance space, all objects with values of B>5 belong to class -. There are five such objects. In contrast, there are only three objects with values of A<=5 that provide the evidence that objects in this area of the instance space belong to class +. Each of these tests represents a plausible similarity metric on the basis of the available evidence. Thus, an object within this region will be similar in a plausible respect to three positive and five negative objects. If objects that are similar in relevant respects have high probability of belonging to the same class, and the only other information available is that it is plausible that an object is similar to three positive and five negative objects, then it would appear more probable that the object is negative than positive.

The disagreement between C4.5 and the similarity assumption in this case contrasts with, for example, the area of the instance space for which A<=5 and B<1. In this region, the similarity assumption suggests that C4.5's partition is appropriate because all plausible similarity metrics will indicate that an object in this region is similar to positive objects onlygif.

The post-processor developed for this research analyses decision trees produced by C4.5 in order to identify such regions--those occupied by no objects from the training set but for which there is evidence (in terms of the similarity assumption) favoring relabeling with a different class to that assigned by C4.5. When such regions are identified, new branches are added to the decision tree, creating new partitionings of the instance space. Both trees must provide identical performance with respect to the training set as only regions of the instance space that are occupied by no objects in the training set are affected.

It is difficult to see how any plausible metric for complexity could interpret the addition of such branches as not increasing the complexity of the tree.

The end result is that the post-processor adds complexity to the decision tree without altering how the tree applies to the training data. The Occam thesis predicts that this will, in general, lower predictive accuracy while the similarity assumption predicts that it will, in general, increase predictive accuracy. As will be seen, the latter prediction is consistent with experimental evidence and the former is not.

next up previous
Next: The Post-processor Up: New Experimental Evidence Against Previous: New Experimental Evidence Against

Geoff Webb
Mon Sep 9 12:13:30 EST 1996