Our experiments demonstrate the relative abilities of three iterative optimization strategies, which have been coupled with the PU objective function and hierarchical sorting to generate initial clusterings. The reorder/resort optimization strategy of Section 3.1 makes most sense with sorting as the primary clustering strategy, but the other optimization techniques are not strongly tied to a particular initial clustering strategy. For example, hierarchical redistribution can also be applied to hierarchical clusterings generated by an agglomerative strategy [Duda & Hart, 1973; Everitt, 1981; Fisher et al., 1992], which uses a bottom-up procedure to construct hierarchical clusterings by repeatedly `merging' observations and resulting clusters until an all-inclusive root cluster is generated. Agglomerative methods do not suffer from ordering effects, but they are greedy algorithms, which are susceptible to the limitations of local decision making generally, and would thus likely benefit from iterative optimization.
In addition, all three optimization strategies can be applied regardless of objective function. Nonetheless, the relative benefits of these methods undoubtedly varies with objective function. For example, the PU function has the undesirable characteristic that it may, under very particular circumstances, view two partitions that are very close in form as separated by a `cliff' [Fisher, 1987b; Fisher et al., 1992]. Consider a partition of M observations involving only two, roughly equal-sized clusters; its PU score has the form . If we create a partition of three clusters by removing a single observation from, say , and creating a new singleton cluster, we have . If M is relatively large, will have a very small score due to the term, (see Section 2.1). Because we are taking the average CU score of clusters, the difference between and may be quite large, even though they differ in the placement of only one observation. Thus, limiting experiments to the PU function may exaggerate the general advantage of hierarchical redistribution relative to the other two optimization methods. This statement is simultaneously a positive statement about the robustness of hierarchical redistribution in the face of an objective function with cliffs, and a negative statement about PU for defining such discontinuities. Nonetheless, PU and variants have been adopted in systems that fall within the COBWEB family [Gennari et al., 1989; McKusick & Thompson, 1990; Reich & Fenves, 1991,Iba & Gennari, 1991; McKusick & Langley, 1991; Kilander, 1994; Ketterlin et al., 1995; Biswas et al., 1994]. Section 5.2 suggests some alternative objective functions.
Beyond the nonoptimality of PU, our findings should not be taken as the best that these strategies can do when they are engineered for a particular clustering system. We could introduce forms of randomization or systematic variation to any of the three strategies. For example, while Michalski and Stepp's seed-selection methodology inspires reordering/resorting, Michalski and Stepp's approach selects `border' observations when the selection of `centroids' fails to improve clustering quality from one iteration to the next; this is an example of the kind of systematic variations that one might introduce in pursuit of better clusterings. In contrast, AUTOCLASS may take large heuristically-guided `jumps' away from a current clustering. This approach might be, in fact, a somewhat less systematic (but equally successful) variation on the macro-operator theme that inspired hierarchical redistribution, and is similar to HIERARCH's approach as well. SNOB [Wallace & Dowe, 1994] employs a variety of search operators, including operators similar to COBWEB's merge and split (though without the same restrictions on local application), random restart of the clustering process with new seed observations, and `redistribution' of observations. In fact, the user can program SNOB's search strategy using these differing primitive search operators. In any case, systems such as CLUSTER/2, AUTOCLASS, and SNOB do not simply `give up' when they fail to improve clustering quality from one iteration to the next.
As SNOB illustrates, one or more strategies might be combined to advantage. As an additional example, Biswas et al.  adapt Fisher, Xu, and Zard's  dissimilarity ordering strategy to preorder observations prior to clustering. After sorting using PU, their ITERATE system then applies iterative redistribution of single observations using a category match measure by Fisher and Langley .
The combination of preordering and iterative redistribution appears to yield good results in ITERATE. Our results with reorder/resort suggest that preordering is primarily responsible for quality benefits over a simple sort, but the relative contribution of ITERATE's redistribution operator is not certain since it differs in some respects from the redistribution technique described in this paper. However, the use of three different measures -- distance, PU, and category match -- during clustering may be unnecessary and adds undesirable coupling in the design of the clustering algorithm. If, for example, one wants to experiment with the merits of differing objective functions, it is undesirable to worry about the `compatibility' of this function with two other measures. In contrast, reordering/resorting generalizes Fisher et al.'s  ordering strategy; this generalization and the iterative redistribution strategy we describe assume no auxiliary measures beyond the objective function. In fact, as in Fisher [1987a; 1987b], an evaluation of ITERATE's clusterings is made using measures of variable value predictability or , predictiveness or , and their product. It is not clear that a system need exploit several related, albeit different measures during the generation and evaluation of clusterings; undoubtedly a single, carefully selected objective function can be used exclusively during clustering.
Reordering/resorting and iterative redistribution of single observations could be combined in a manner similar to ITERATE's exploitation of certain specializations of these procedures. Our results suggest that reordering/resorting would put a clustering in a good `ballpark', while iterative redistribution would subsequently make modest refinements. We have not combined strategies, but in a sense conducted the inverse of an `ablation' study, by evaluating individual strategies in isolation. In the limited number of domains explored in Section 3.4, however, it appears difficult to better hierarchical redistribution.
Finally, our experiments applied various optimization techniques after all data was sorted. It may be desirable to apply the optimization procedures at intermittent points during sorting. This may improve the quality of final clusterings using reordering/resorting and redistribution of single observations, as well as reduce the overall cost of constructing final optimized clusterings using any of the methods, including hierarchical redistribution, which already appears to do quite well on the quality dimension. In fact, HIERARCH can be viewed as performing something akin to a restricted form of hierarchical redistribution after each observation. This is probably too extreme -- if iterative optimization is performed too often, the resultant cost can outweigh any savings gleaned by maintaining relatively well optimized trees throughout the sorting process. Utgoff  makes a similar suggestion for intermittent restructuring of decision trees during incremental, supervised induction.