To test the validation procedure's promise for simplifying hierarchical clusterings, each of the data sets used in the optimization experiments of Section 3.4 was randomly divided into three subsets: 40% for training, 40% for validation, and 20% for test. A hierarchical clustering is first constructed by sorting the training set in randomized order. This hierarchy is then optimized using iterative hierarchical redistribution. Actually, because of cost considerations, a hierarchy is constructed several levels at a time. The hierarchy is initially constructed to height 4, where the deepest level is the set of training observations. This hierarchy is optimized using hierarchical redistribution. Clusters at the bottommost level (i.e., 4) are removed as children of level 3 clusters, and the subset of training observations covered by each cluster of level 3 is hierarchically sorted to a height 4 tree and optimized. The roots of these subordinate clusterings are then substituted for each cluster at depth 3 in the original tree. The process is repeated on clusters at level 3 of the subordinate trees and subsequent trees thereafter until no further decomposition is possible. The final hierarchy, which is not of constant-bounded height, decomposes the entire training set to singleton clusters, each containing a single training observation. The validation set is then used to identify variable frontiers within the entire hierarchy.

During testing of a validated clustering, each variable of each test observation is masked in turn; when classification reaches a cluster on the frontier of the masked variable, the most probable value is predicted as the value of the observation; the proportion of correct predictions for each variable over the test set is recorded. For comparative purposes, we also use the test set to evaluate predictions stemming from the unvalidated tree, where all variable predictions are made at the leaves (singleton clusters) of this tree.

**Table 6:** Characteristics of optimized clusterings
before and after validation.
Average and standard deviations over 20 trials.

Table 6 shows results from 20 experimental trials
using optimized, unvalidated and validated
clusterings generated as just described from random orderings.
The first row of each domain lists the average
number of leaves (over the 20 experimental trials)
for the unvalidated and validated trees.
The unvalidated clusterings decompose the
training data to single-observation leaves -- the number of leaves
equals the number of training observations.
In the validated clustering, we assume that clusters are pruned if they lie
below the frontiers of * all* variables.
Thus, a leaf in a validated clustering
is a cluster (in the original clustering) that is on the
frontier of * at least one* variable, and none of its descendent
clusters (in the original clustering) are on the frontier of any variable.
For example, if we assume that the tree of Figure 4
covers data described only in terms of variables , ,
and , then the number of leaves in this validated clustering
would be **7**.

Prediction accuracies in the second row of each domain entry
are the mean proportion of correct predictions over * all*
variables over 20 trials. Predictions were generated at
leaves (singleton clusters)
in the unvalidated hierarchical clusterings and at appropriate
variable frontiers in the validated clusterings.
In all cases, validation/pruning substantially
reduces clustering size and it does not diminish accuracy.

The number of leaves in the validated case, as we have
described it, assumes a very coarse pruning strategy;
it will not necessarily discriminate a clustering with uniformly
deep frontiers from one with a single or very few deep frontiers.
We have suggested that more flexible pruning or `attention'
strategies might be possible when an analyst is focusing on one or
a few variables. We will not specify such strategies, but the statistic
given in row 3 of each domain entry suggests that clusterings
can be rendered in considerably simpler forms when an analyst's
attention is selective. Row 3
is the * average number of frontier clusters per variable*.
This is an average over all variables and all experimental
trials.
In the validated tree of Figure 4 the average
frontier size is .

Intuitively, a frontier cluster of a variable is a `leaf' as far as prediction of that variable is concerned. The `frontier size' for unvalidated clusterings is simply given by the number of leaves, since this is where all variable predictions are made in the unvalidated case. Our results suggest that when attention is selective, a partial clustering that captures the structure involving selected variables can be presented to an analyst in very simplified form.

Douglas H. Fisher

Sat Mar 30 11:37:23 CST 1996