The results of Section 5.1 suggest that the PU function is useful in identifying structure in data: clusterings optimized relative to this function were simpler and as accurate as clusterings that were not optimized relative to the function. Thus, PU leads to something reasonable along the error rate and simplicity dimensions, but can other objective functions do a better job along these dimensions? Based on earlier discussion on the limitations of PU, notably that averaging CU over the clusters of a partition introduced `cliffs' in the space of partitions, it is likely that better objective functions can be found. For example, we might consider Bayesian variants like those found in AUTOCLASS [Cheeseman et al., 1988] and Anderson and Matessa's  system, or the closely related MML approach of SNOB [Wallace & Dowe, 1994]. We do not evaluate alternative measures such as these here, but do suggest a number of other candidates.
Section 2.1 noted that the CU function could be viewed as a summation over Gini Indices, which measured the collective impurity of variables conditioned on cluster membership. Intuition may be helped further by an information-theoretic analog to CU [Corter & Gluck, 1992]:
The information-theoretic analog can be understood as a summation over information gain values, where information gain is an often used selection criterion for decision tree induction [Quinlan, 1986]: the clustering analog rewards clusters, , that maximize the sum of information gains over the individual variables, .
Both the Gini and Information Gain measures are often-used bases for selection measures of decision tree induction. They are used to measure the expected decrease in impurity or uncertainty of a class label, conditioned on knowledge of a given variable's value. In a clustering context, we are interested in the decrease in impurity of each variable's value conditioned on knowledge of cluster membership -- thus, we use a summation over suitable Gini Indices or alternatively, information gain scores. However, it is well known that in the context of decision tree induction, both measures are biased to select variables with more legal values. Thus, various normalizations of these measures or different measures altogether, have been devised. In the clustering adaptation of these measures normalization is also necessary, since alone or its information-theoretic analog will favor a clustering of greatest cardinality, in which the data are partitioned into singleton clusters, one for each observation. Thus, PU normalizes the sum of Gini indices by averaging.
A general observation is that many selection measures used for decision tree induction can be adapted as objective functions for clustering. There are a number of selection measures that suggest themselves as candidates for clustering, in which normalization is more principled than averaging. Two candidates are Quinlan's  Gain Ratio and Lopez de Mantaras'  normalized information gain.
[Quinlan, 1986] [Lopez de Mantaras, 1991]
From these we can derive two objective functions for clustering:
The latter of these clustering variations was defined in Fisher and Hapanyengwi . Our nonsystematic experimentation with Lopez de Mantaras' normalized information gain variant suggests that it mitigates the problems associated with PU, though conclusions about its merits must await further experimentation. In general, there are a wealth of promising objective functions based on decision tree selection measures that we might consider. We have described two, but there are others such as Fayyad's  ORT function.
The relationship between supervised and unsupervised measures also has been pointed out in the context of Bayesian systems [Duda & Hart, 1973]. Consider AUTOCLASS [Cheeseman et al.1988], which searches for the most probable clustering, H, given the available data set, D -- i.e., the clustering with highest . Under independence assumptions made by AUTOCLASS, the computation of includes the easily seen mechanisms of the simple Bayes classifier used in supervised contexts.
We have not compared the proposed derivations of decision tree selection measures or the Bayesian/MML measures of AUTOCLASS and SNOB as yet, but have proposed pattern-completion error rate, simplicity, and classification cost as external, objective criteria that could be used in such comparisons. An advantage of the Bayesian and MML approaches is that, with proper selection of prior biases, these do not require a separate strategy (e.g., resampling) for pruning, and these strategies can be adapted for variable frontier identification. Rather, the objective function used for cluster formation serves to cease hierarchical decomposition as well. Though we know of no experimental studies with the Bayesian and MML techniques along the accuracy and cost dimensions outlined here, we expect that each would perform quite well.