Ideally, clustering quality as measured by the objective function
should be well correlated with clustering utility as determined by
a performance task: the higher the quality of a
clustering as judged by the objective function, the greater the
performance improvement (e.g., reduction of error rate),
and the lower the quality, the less that performance improves.
However, several authors [Fisher et
al., 1992; Nevins, 1995; Devaney & Ram, 1993]
have pointed out that **PU** scores do not
seem well-correlated with error rates. More precisely,
hierarchical clusterings (constructed by hierarchical sorting)
in which the top-level partition has
a low **PU** score lead to roughly the same error rates as
hierarchies in which the top-level partition has a high **PU**
score, when variable-value predictions are made at leaves
(singleton clusters). Apparently, even with poor
partitions at each level as measured by **PU**, test
observations are classified to the same or similar
observations at the leaves of a hierarchical clustering.
Pattern-completion error rate under these
circumstances seems insufficient to discriminate
what we might otherwise consider to be good and poor clusterings.

Our work on simplification in Section 4 suggests that in addition
to error rate, we might choose to judge competing
hierarchical clusterings based on simplicity or some
similarly-intended criterion.
Both error rate and simplicity
are used to judge classifiers in supervised contexts.
We have seen that holdout can be used to substantially
`simplify' a hierarchical clustering. The question we now ask
is whether hierarchical clusterings that have been
optimized relative to **PU** can be simplified more substantially
than unoptimized clusterings with no degradation in pattern-completion
error rate?

To answer this question we repeated the validation experiments
of Section 4.2 under a second experimental condition: hierarchical
clusterings were constructed from similarity orderings of the
observations using hierarchical sorting.
We saw in Section 3.4 that similarity orderings
tend to result in clusterings judged poor by the **PU** function.
We do not optimize these hierarchies using hierarchical optimization.
Table 7 shows accuracies, number of leaves, and
average frontier sizes, for unoptimized hierarchies constructed
from similarity orderings in the case
where they have been subjected to holdout-based validation
and in the case where they have not. These results are given under the
heading `Unoptimized'.
For convenience, we copy the
results of Table 6 under the heading `Optimized'.

**Table 7:** Characteristics of unoptimized and optimized clusterings
before and after validation.
Average and standard deviations over 20 trials.

As in the optimized case, identifying and exploiting variable frontiers in unoptimized clusterings appears to simplify a clustering substantially with no degradation in error rate. Of most interest here, however, is that optimized clusterings are simplified to a substantially greater extent than unoptimized clusterings with no degradation in error rate.

Thus far, we have focused an the criteria of error rate and simplicity, but in many applications, our real interest in simplicity stems from a broader interest in minimizing the expected cost of exploiting a clustering during classification: we expect that simpler clusterings have lower expected classification costs. We can view the various distinctions between unvalidated/validated and unoptimized/optimized clusterings in terms of expected classification cost. Table 8 shows some additional data obtained from our experiments with validation. In particular, the table shows:

**Leaves (L)**- The mean number of leaves (over 20 trials) before and after validation (assuming the coarse pruning strategy described in Section 4.2) in both the optimized and unoptimized cases (copied from Table 7).
**EPL**- The mean total path length (over 20 trials). The total path length of an unvalidated tree, where each leaf corresponds to a single observation, is the sum of depths of leaves in the tree. In the case of a validated tree, where a leaf may cover multiple observations, the contribution of the leaf to the total path length is the depth of the leaf times the number of observations at that leaf.
**Depth (D)**- The average depth of a leaf in the tree, which is computed by .
**Breadth (B)**- The average branching factor of the tree. Given that
, or for any
**m**. **Cost (C)**- The expected cost of classifying an observation
from the root to a leaf in terms of the number of nodes (clusters)
examined during classification. At each level we examine each cluster
and select the best. Thus, cost is the product of the number of
levels and the number of clusters per level. So
.

**Table 8:** Cost characteristics of unoptimized and optimized clustering
before and after validation.
Average and standard deviations over 20 trials.
Characteristics that are ed are computed from the mean
values of `Leaves' and EPL.

Table 8 illustrates that the expected cost of
classification is less for optimized clusterings than for unoptimized
clusterings in both the unvalidated and validated cases. However, these
results should be taken with a grain of salt, and not simply
because they are estimated values. In particular, we have expressed
cost in terms of the expected number of nodes that would need to be examined
during classification. An implicit assumption is that cost
of examination is constant across nodes. In fact, the cost
per examination * roughly is constant* (per domain) across nodes in our
implementation and many others: at each node, all variables
are examined. Consider that by this measure of cost, the
least cost (unvalidated) clustering is one that splits the observations
in thirds at each node, thus forming a balanced ternary tree,
regardless of the form of structure in the data.

Of course, if such a tree does not reasonably capture structure in data, then we might expect this to be reflected in error rate and/or post-validation simplicity. Nonetheless, there are probably better measures of cost available. In particular, Gennari [1989] observed that when classifying an observation, evaluating the objective function over a proper subset of variables is often sufficient to categorize the observation relative to the same cluster that would have been selected if evaluation had occurred over all variables. Under ideal circumstances, when clusters of a partition are well separated (decoupled), testing a very few `critical' variables may be sufficient to advance classification.

Gennari implemented a * focusing* algorithm that sequentially evaluated
the objective function over the variables, one additional variable
at a time from most to least `critical',
until a categorization with respect to one of the
clusters could be made unambiguously.
Using Gennari's
procedure, examination cost is not constant
across nodes.
Carnes and Fisher [Fisher, Xu, Carnes,
Reich, Fenves, Chen, Shiavi, Biswas &
Weinberg, 1993]
adapted Gennari's procedure to good effect in
a diagnosis task, where the intent was to minimize the number
of probes necessary to diagnose a fault. While Gennari
offers a principled focusing strategy that can be used in
conjunction with an objective function, the general idea
of focusing on selected features during classification
can be traced back to UNIMEM [Lebowitz, 1982; 1987] and
CYRUS [Kolodner, 1983].

The results of Table 8 illustrate the form of an expected classification-cost analysis, but we might have also measured cost as time directly using a test set. In fact, comparisons between the time requirements of sorting in the random and similarity ordering conditions of Tables 4 and 5 suggest cost differences between good and poor clusterings in terms of time as well. Regardless of the form of analysis, however, it seems desirable that one express branching factor and cost in terms of the number of variables that need be tested assuming a focusing strategy such as Gennari's. It is likely that this will tend to make better distinctions between clusterings.

Douglas H. Fisher

Sat Mar 30 11:37:23 CST 1996