Discussion of Validation

Next: General Discussion Up: Simplifying Hierarchical Clusterings Previous: Experiments with Validation

Discussion of Validation

The resampling-based validation method is inspired by earlier work by Fisher [1989], which identified variable frontiers within a strict incremental (i.e., sorting) context -- no separate validation set was reserved, but rather the training set was used for identifying variable frontiers as well. In particular, as each training observation was hierarchically sorted using COBWEB, each of the observation's variable values were predicted and `correct' counts at each node were updated for all correctly anticipated variables. In Fisher [1989] variable values were not masked during sorting -- knowledge of each variable value was used during sorting, thus helping to guide classification, and validation. In addition, the hierarchy changed during sorting/validation. While this incremental strategy led to desirable results in terms of pattern-completion error rate, it is likely that the variable frontiers identified by the incremental method are less desirable than frontiers identified with holdout, where we strictly segregate the training and validation sets of observations. In addition to Fisher [1989], our work on variable frontiers can be traced back to ideas by Lebowitz [1982] and Kolodner [1983], and more directly to Fisher [1987b], Fisher and Schlimmer [1988], and Reich and Fenves [1991], each of which use a very different method to identify something similar in spirit to frontiers as defined here.

Our method of validation and pruning is inspired by retrospective pruning strategies in decision tree induction such as reduced error pruning [Quinlan, 1987; Quinlan, 1993; Mingers, 1989a]. In a Bayesian clustering system such as AUTOCLASS [Cheeseman et al., 1988], or the minimum message length (MML) approach adopted by SNOB [Wallace & Dowe, 1994], the expansion of a hierarchical clustering is mediated by a tradeoff between prior belief in the existence of further structure and evidence in the data for further structure. We will not detail this fundamental tradeoff, but suffice it to say that expansion of a hierarchical clustering will cease along a path when the evidence for further structure in the data is insufficient in the face of prior bias. Undoubtedly, the Bayesian and MML approaches can be adapted to identify variable-specific frontiers, and thus be used in the kind of flexible pruning and focusing strategies that we have implied. In fact, something very similar in intent has been implemented in Autoclass [Hanson, Stutz & Cheeseman, 1991] as a way of reducing the cost of clustering with this system: variables that covary may be `blocked', or in some sense treated as one. This version of AUTOCLASS searches a space of hierarchical clusterings, with blocks of variables assigned to particular clusters in the hierarchy. The interpretation of such assignments is that a cluster `inherits' the variable value distributions of variable blocks assigned to the cluster's ancestors. Inversely, the basic idea is that one need not proceed below a cluster to determine the value distributions of variables assigned to that cluster.

Our experimental results suggest the utility of resampling for validation, the identification of variable frontiers, and pruning. However, the procedure described is not a method per se of clustering over all the available data, since it requires that a validation set be held out during initial hierarchy construction. There are several options that seem worthy of experimental evaluation in adapting this validation strategy as a tool for simplification of hierarchical clusterings. One strategy would be to hold out a validation set, cluster over a training set, identify variable frontiers with the validation set, and then sort the validation set relative to the clustering. This single holdout methodology has its problems, however, for reasons similar to those identified for single holdout in supervised settings [Weiss & Kulikowski, 1991].

A better strategy might be one akin to n-fold-cross-validation: a hierarchical clustering is constructed over all available data, then each observation is removed, it is used for validation with respect to each variable, and then the observation is reinstated in its original location (together with the original variable value statistics of clusters along the path to this location).

Next: General Discussion Up: Simplifying Hierarchical Clusterings Previous: Experiments with Validation

JAIR, 4
Douglas H. Fisher
Sat Mar 30 11:37:23 CST 1996