The results of Section 5.1 suggest that the **PU** function
is useful in identifying structure in data: clusterings optimized
relative to this function were simpler and as accurate as
clusterings that were not optimized relative to the function. Thus,
**PU** leads to something reasonable along the error rate and
simplicity dimensions, but can other objective functions
do a better job along these dimensions? Based on earlier discussion
on the limitations of **PU**, notably that averaging **CU** over the
clusters of a partition introduced `cliffs' in the space
of partitions, it is likely that better objective functions can be
found. For example, we might consider Bayesian variants like
those found in AUTOCLASS [Cheeseman
et al., 1988]
and Anderson and Matessa's [1991] system,
or the closely related MML approach of SNOB [Wallace & Dowe, 1994].
We do not evaluate
alternative measures such as these here, but do suggest a
number of other candidates.

Section 2.1 noted that the **CU** function could be viewed
as a summation over Gini Indices, which measured the
collective impurity of variables conditioned on cluster
membership.
Intuition may be helped further by an information-theoretic
analog to **CU** [Corter & Gluck, 1992]:

The information-theoretic analog
can be understood as a summation over * information
gain* values, where information gain is an often used selection
criterion for decision tree induction
[Quinlan, 1986]: the clustering analog rewards
clusters, , that maximize the sum of information gains over
the individual variables, .

Both the Gini and Information Gain measures are
often-used bases for selection measures of decision tree
induction. They are used to measure the expected decrease in
impurity or uncertainty of a class label, conditioned
on knowledge of a given variable's value. In a clustering
context, we are interested in the decrease in impurity
of * each* variable's value conditioned on knowledge
of cluster membership -- thus, we use a summation
over suitable Gini Indices or alternatively, information
gain scores. However, it is well known that in the context of decision
tree induction, both measures are biased to select variables with more
legal values. Thus, various normalizations of these measures
or different measures altogether, have been devised.
In the clustering adaptation of these measures normalization
is also necessary, since alone or its information-theoretic
analog will favor a clustering of greatest cardinality, in which the
data are partitioned into
singleton clusters, one for each observation. Thus, **PU** normalizes
the sum of Gini indices by averaging.

A general observation is that many selection measures used for decision tree induction can be adapted as objective functions for clustering. There are a number of selection measures that suggest themselves as candidates for clustering, in which normalization is more principled than averaging. Two candidates are Quinlan's [1986] Gain Ratio and Lopez de Mantaras' [1991] normalized information gain.

[Quinlan, 1986] [Lopez de Mantaras, 1991]

From these we can derive two objective functions for clustering:

The latter of these clustering variations was defined in
Fisher and Hapanyengwi [1993].
Our nonsystematic experimentation with Lopez de Mantaras'
normalized information gain variant suggests that
it mitigates the problems associated with
**PU**, though conclusions about its merits must await
further experimentation. In general, there are a wealth
of promising objective functions based on decision tree selection measures
that we might consider. We have described two, but there are
others such as Fayyad's [1991] ORT function.

The relationship between supervised and unsupervised measures also
has been pointed out in the context of Bayesian systems [Duda & Hart, 1973].
Consider AUTOCLASS [Cheeseman
et al.1988], which searches for
the most probable clustering, **H**, given the available data set,
**D** -- i.e.,
the clustering with highest . Under
independence assumptions made by AUTOCLASS, the computation
of includes the easily seen mechanisms of the simple Bayes
classifier used in supervised contexts.

We have not compared the proposed derivations of decision tree selection measures or the Bayesian/MML measures of AUTOCLASS and SNOB as yet, but have proposed pattern-completion error rate, simplicity, and classification cost as external, objective criteria that could be used in such comparisons. An advantage of the Bayesian and MML approaches is that, with proper selection of prior biases, these do not require a separate strategy (e.g., resampling) for pruning, and these strategies can be adapted for variable frontier identification. Rather, the objective function used for cluster formation serves to cease hierarchical decomposition as well. Though we know of no experimental studies with the Bayesian and MML techniques along the accuracy and cost dimensions outlined here, we expect that each would perform quite well.

Douglas H. Fisher

Sat Mar 30 11:37:23 CST 1996