This section compares iterative optimization strategies under two
experimental conditions.
In the first condition, a random ordering of observations is
generated and hierarchically
sorted. Each of the optimization strategies is then applied independently
to the resultant hierarchical
clustering. These experiments assume that the primary
goal of clustering is to discover a single-level partitioning of
the data that is of optimal quality. Thus,
the * objective function score of the first-level partition* is taken as the
most important dependent variable.
An independent variable is the height of the
initially-constructed clustering; this can effect the granularity of clusters
that are used in hierarchical redistribution.
A hierarchical clustering
of height 2 corresponds to a single level partition of the data
at depth 1 (the root is at depth 0),
with leaves corresponding to individual observations
at depth 2.

In addition to experiments on clusterings derived by sorting random initial orderings, each redistribution strategy was tested on exceptionally poor initial clusterings generated by nonrandom orderings. Just as `dissimilarity' orderings lead to good clusterings, `similarity' orderings lead to poor clusterings [Fisher et al., 1992]. Intuitively, a similarity ordering samples observations within the same region of the data description space before sampling observations from differing regions. The reordering procedure of Section 3.1 is easily modified to produce similarity orderings by ranking each set of siblings in a hierarchical clustering from least to most probable, and appending rather than interleaving observation lists from differing clusters as the algorithm pops up the recursive levels. A similarity ordering is produced by applying this procedure to an initial clustering produced by an earlier sort of a random ordering. Another clustering is then produced by sorting the similarity-ordered data, and the three iterative optimization strategies are applied independently. We do not advocate that one build clusterings from similarity orderings in practice, but experiments with such orderings better test the robustness of the various optimization strategies.

**Table 2:** Iterative optimization strategies with
initial clusterings generated from sorting random and similarity
ordered observations. Tree height is 2.
Averages and standard deviations of **PU** scores over 20 trials.

Table 2 shows the results of
experiments with random and similarity orderings of data from
four databases of the UCI repository.
These results assume an initial clustering of height 2 (i.e., a top-level
partition + observations at leaves).
Each cell represents an average and standard deviation
over 20 trials. The first cell (labeled `sort') of each domain is the
mean **PU** scores initially obtained by sorting. Subsequent
rows under each domain reflect the mean scores obtained by
the reordering/resorting procedure of Section 3.1, iterative
redistribution of single observations described in Section 3.2,
and hierarchical redistribution described in Section 3.3.

The main findings reflected in Table 2 are:

- Initial hierarchical sorting from random input
does reasonably well;
**PU**scores in this case are closer to the scores of optimized trees, than to the poorest scores obtained after sorting on similarity orderings. This weakly suggests that initial sorting on random input takes a substantial step in the space of clusterings towards discovery of the final structure. - Hierarchical redistribution achieves the highest mean
**PU**score in both the random and similarity case in 3 of the 4 domains. The small soybean domain is the exception. - In the House domain (random and similarity case)
and the Mushroom domain (random case only), the standard deviation in
**PU**scores of clusterings optimized by hierarchical redistribution is 0.00, indicating that it has always constructed level-1 partitions of the same**PU**score in all 20 trials. - Reordering and reclustering comes closest to hierarchical redistribution's performance in all cases, bettering it in the Small Soybean domain.
- Single-observation redistribution modestly improves an initial sort, and is substantially worse than the other two optimization methods.

Note that with initial hierarchical clusterings of height 2, the only difference between iterative hierarchical redistribution and redistribution of single observations is that hierarchical redistribution considers `merging' clusters of the partition (by reclassifying one with respect to the others) prior to redistributing single observations during each pass through the hierarchy.

Section 3.3 suggested that the expected benefits of hierarchical redistribution might be greater for deeper initial trees with more granular clusters. Table 3 shows results on the same domains and initial orderings when tree height is 4 for hierarchical redistribution; for the reader's convenience we also repeat the results from Table 2 for hierarchical redistribution when tree height is 2. In moving from height 2 to 4, there is modest improvement in the small Soybean domain (particularly under Similarity orderings), and very slight improvement in the large Soybean domain and the Mushroom domain under Similarity orderings.

While the improvements are very modest, moving to height 4 trees leads to near identical performance in the random and similarity ordering conditions. This suggests that hierarchical redistribution is able to effectively overcome the disadvantage of initially poor clusterings.

**Table 3:** Hierarchical redistribution with
initial clusterings generated from sorting random and similarity
ordered observations. Results are shown for
tree heights of 2 (copied from Table 2) and 4.
Averages and standard deviations of **PU** scores over 20 trials.

Experiments with reorder/resort
and iterative distribution of single
observations also were varied with respect to tree height
(e.g., height 3). For each of these methods,
the deepest set of clusters in the initial hierarchy
above the leaves, was taken as the
initial partition. Reordering/resorting scores remained roughly
the same as the height 2 condition, but clusterings produced
by single-observation redistribution had **PU** scores that
were considerably worse than those given in Table 2.

We also recorded execution time for each method. Table 4 shows the time required for each method in seconds. In particular, for each domain, Table 4 lists the mean time for initial sorting, and the mean additional time for each optimization method. Ironically, these experiments demonstrate that even though hierarchical redistribution `bottoms-out' in a single-observation form of redistribution, the former is consistently faster than the latter for trees of height 2 -- reclassifying a cluster simultaneously moves a set of observations, which would otherwise have to be repeatedly evaluated for redistribution individually with increased time to stabilization.

**Table 4:** Time requirements (in seconds) of hierarchical sorting and
iterative optimization with
initial clusterings generated from sorting random and similarity
ordered observations. Tree height is 2.
Averages and standard deviations over 20 trials.

Table 4 assumes the tree constructed by initial
sorting is bounded to height 2. Table 5 gives
the time requirements of hierarchical sorting and hierarchical
redistribution when the initial tree is bounded to height 4.
As the tree gets deeper the cost of hierarchical redistribution
grows substantially, and as our comparison of performance with
height 2 and 4 trees in Table 3 suggests, there are
drastically diminishing returns in terms of partition quality.
Importantly, limited experiments with trees of height 2, 3, and 4
indicate that the cost of hierarchical redistribution is
comparable to the cost of reorder/resort at greater tree heights
and significantly
less expensive than single-observation redistribution. It is
difficult to give a cost analysis of hierarchical
redistribution (and the other methods for that matter),
since bounds on loop iterations probably depend on the nature of
the objective function. Suffice it to say that the number of nodes
that are subject to hierarchical
redistribution in a tree covering **n** observations
is bounded above by **2n-1**; there may be up to **n** leaves
and up to **n-1** internal nodes given that
each internal node has no less than 2 children.

**Table 5:** Time requirements (in seconds) of hierarchical sorting and
hierarchical redistribution with
initial clusterings generated from sorting random and similarity
ordered observations. Results are shown
for tree heights of 2 (copied from Table 4)
and 4.
Averages and standard deviations over 20 trials.

If iterative optimization is to occur in background, real-time response is not important, and cluster quality is paramount, then it is probably worth applying hierarchical redistribution to deeper trees; this is consistent with the philosophy behind such systems as AUTOCLASS and SNOB. For the domains examined here, however, it does not seem cost effective to optimize with trees of height greater than 4. Thus, we adopt a tree construction strategy that builds a hierarchical clustering three levels at a time (with hierarchical redistribution) in the experiments of Section 4.

Douglas H. Fisher

Sat Mar 30 11:37:23 CST 1996