Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!bb3.andrew.cmu.edu!newsfeed.pitt.edu!gatech!news.mathworks.com!newsfeed.internetmci.com!howland.erols.net!news.sprintlink.net!news-stk-200.sprintlink.net!news.sprintlink.net!news-stk-11.sprintlink.net!interpath!news.interpath.net!sas!newshost.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: what is K-means algorithm
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <DwIAx2.8tv@unx.sas.com>
Date: Wed, 21 Aug 1996 21:12:38 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
References: <4trca8$db3@qualcuno.nettuno.it> <51k9vexq41.fsf@kal-el.ugr.es>  <4upspd$81f@nuscc.nus.sg> <4us34b$cos@eng_ser1.erg.cuhk.hk>  <Dw77n5.EzG@unx.sas.com> <5191b9qkv6.fsf@kal-el.ugr.es>
Organization: SAS Institute Inc.
Lines: 455


In article <5191b9qkv6.fsf@kal-el.ugr.es>, jmerelo@kal-el.ugr.es (J.J. Merelo Guervos) writes:
|> >>>>> "Warren" == Warren Sarle <saswss@hotellng.unx.sas.com> writes:
|> In article <Dw77n5.EzG@unx.sas.com> saswss@hotellng.unx.sas.com (Warren Sarle) writes:
|> 
|>     Warren> An algorithm than can reliably determine the number of
|>     Warren> clusters for nontrivial data and that is not as
|>     Warren> computationally intensive as bootstrapping or MCMC would
|>     Warren> be quite a revelation. Could we have some references,
|>     Warren> please?
|> 
|> I have worked on that for quite a while now... check out
|> http://kal-el.ugr.es/g-lvq/g-lvq.html. This is a genetic optimization
|> of a supervised clustering algorithm, Kohonen's LVQ.

Since the context was k-means, I was referring to unsupervised learning,
but I will be interested to see what you are doing.

|> I would like to have a reference for what you have already said, too.

Here is a summary regarding the (unsupervised) number-of-clusters
question from a statistical point of view:

Adapted Mar 16, 1996, from the SAS/STAT User's Guide (1990) and Sarle 
and Kuo (1993). Copyright 1996 by SAS Institute Inc, Cary, NC, USA.

The Number of Clusters
----------------------

There are no completely satisfactory methods for determining the number
of population clusters for any type of cluster analysis (Everitt 1979,
1980; Hartigan 1985; Bock 1985).

If your purpose in clustering is dissection, that is, to summarize the
data without trying to uncover real clusters, it may suffice to look at
R^2 for each variable and pooled over all variables.  Plots of R^2
against the number of clusters are useful.

It is always a good idea to look at your data graphically. If you have
only two or three variables, use PROC PLOT to make scatterplots
identifying the clusters. With more variables, use PROC CANDISC to
compute canonical variables for plotting.

Ordinary significance tests, such as analysis-of-variance F tests, are
not valid for testing differences between clusters. Since clustering
methods attempt to maximize the separation between clusters, the
assumptions of the usual significance tests, parametric or
nonparametric, are drastically violated. For example, if you take a
sample of 100 observations from a single univariate normal distribution,
have PROC FASTCLUS divide it into two clusters, and run a t test between
the clusters, you usually obtain a probability level of less than
0.0001.  For the same reason, methods that purport to test for clusters
against the null hypothesis that objects are assigned randomly to
clusters (McClain and Rao 1975; Klastorin 1983) are useless.

Most valid tests for clusters either have intractable sampling
distributions or involve null hypotheses for which rejection is
uninformative. For clustering methods based on distance matrices, a
popular null hypothesis is that all permutations of the values in the
distance matrix are equally likely (Ling 1973; Hubert 1974). Using this
null hypothesis, you can do a permutation test or a rank test.  The
trouble with the permutation hypothesis is that with any real data, the
null hypothesis is implausible even if the data do not contain clusters.
Rejecting the null hypothesis does not provide any useful information
(Hubert and Baker 1977).

Another common null hypothesis is that the data are a random sample from
a multivariate normal distribution (Wolfe 1970, 1978; Duda and Hart
1973; Lee 1979).  The multivariate normal null hypothesis is better than
the permutation null hypothesis, but it is not satisfactory because
there is typically a high probability of rejection if the data are
sampled from a distribution with lower kurtosis than a normal
distribution, such as a uniform distribution. The tables in Englemann
and Hartigan (1969), for example, generally lead to rejection of the
null hypothesis when the data are sampled from a uniform distribution.
Hawkins, Muller, and ten Krooden (1982,337-340) discuss a highly
conservative Bonferroni method for hypothesis testing.  The
conservativeness of this approach may compensate to some extent for the
liberalness exhibited by tests based on normal distributions when the
population is uniform.

Perhaps a better null hypothesis is that the data are sampled from a
uniform distribution (Hartigan 1978; Arnold 1979; Sarle 1983).  The
uniform null hypothesis leads to conservative error rates when the data
are sampled from a strongly unimodal distribution such as the normal.
However, in two or more dimensions and depending on the test statistic,
the results can be very sensitive to the shape of the region of support
of the uniform distribution.  Sarle (1983) suggests using a hyperbox
with sides proportional in length to the singular values of the centered
coordinate matrix.

Given that the uniform distribution provides an appropriate null
hypothesis, there are still serious difficulties in obtaining sampling
distributions.  Some asymptotic results are available (Hartigan 1978,
1985; Pollard 1981; Bock 1985) for the within-cluster sum of squares,
the criterion that PROC FASTCLUS and Ward's minimum variance method
attempt to optimize.  No distributional theory for finite sample sizes
has yet appeared.  Currently, the only practical way to obtain sampling
distributions for realistic sample sizes is by computer simulation.

Arnold (1979) used simulation to derive tables of the distribution of a
criterion based on the determinant of the within-cluster sum of squares
matrix |W|. Both normal and uniform null distributions were used.
Having obtained clusters with either FASTCLUS or CLUSTER, you can
compute Arnold's criterion with the ANOVA or CANDISC procedure.
Arnold's tables provide a conservative test because FASTCLUS and CLUSTER
attempt to minimize the trace of W rather than the determinant.
Marriott (1971, 1975) also gives useful information on |W| as a
criterion for the number of clusters.

Sarle (1983) used extensive simulations to develop the cubic clustering
criterion (CCC), which can be used for crude hypothesis testing and
estimating the number of population clusters.  The CCC is based on the
assumption that a uniform distribution on a hyperrectangle will be
divided into clusters shaped roughly like hypercubes. In large samples
that can be divided into the appropriate number of hypercubes, this
assumption gives very accurate results. In other cases the approximation
is generally conservative.  For details about the interpretation of the
CCC, consult Sarle (1983).

Milligan and Cooper (1985) and Cooper and Milligan (1984) compared
thirty methods for estimating the number of population clusters using
four hierarchical clustering methods.  The three criteria that performed
best in these simulation studies with a high degree of error in the data
were a pseudo F statistic developed by Calinski and Harabasz (1974), a
statistic referred to as J_e(2)/J_e(1) by Duda and Hart (1973) that can
be transformed into a pseudo t^2 statistic, and the cubic clustering
criterion.  The pseudo F statistic and the CCC are printed by FASTCLUS;
these two statistics and the pseudo t^2 statistic, which can be applied
only to hierarchical methods, are printed by CLUSTER.  It may be
advisable to look for consensus among the three statistics, that is,
local peaks of the CCC and pseudo F statistic combined with a small
value of the pseudo t^2 statistic and a larger pseudo t^2 for the next
cluster fusion.  It must be emphasized that these criteria are
appropriate only for compact or slightly elongated clusters, preferably
clusters that are roughly multivariate normal.

Mixture models, usually Gaussian, provide a useful statistical model for
cluster analysis. Hypothesis testing requires bootstrapping except in
special cases (Titterington, Smith, and Makov 1985; McLachlan, and
Basford 1988); The Bayesian approach is promising for a variety of
mixture models, both Gaussian and non-Gaussian. See Binder (1978, 1981)
and Banfield and Raftery (1993).

Recent research has tended to deemphasize mixture models in favor of
nonparametric models in which clusters correspond to modes in the
probability density function. Hartigan and Hartigan (1985) and Hartigan
(1985) developed a test of unimodality vs. bimodality in the univariate
case.

Nonparametric tests for the number of clusters can also be based on
nonparametric density estimates.  This approach requires much weaker
assumptions than mixture models, namely, that the observations are
sampled independently and that the distribution can be estimated
nonparametrically.  Silverman (1986) describes a bootstrap test for the
number of modes using a Gaussian kernel density estimate, but problems
have been reported with this method under the uniform null distribution
(Marron, S., personal communication). Further developments in
nonparametric methods are given by Mueller and Sawitzki (1991), Minnotte
(1992), and Polonik (1993).  All of these methods suffer from heavy
computational requirements.

One useful descriptive approach to the number-of-clusters problem is
provided by Wong and Schaack (1982), based on a kth-nearest-neighbor
density estimate.  The kth-nearest-neighbor clustering method developed
by Wong and Lane (1983) is applied with varying values of k.  Each value
of k yields an estimate of the number of modal clusters.  If the
estimated number of modal clusters is constant for a wide range of k
values, there is strong evidence of at least that many modes in the
population.  A plot of the estimated number of modes against k can be
highly informative. Attempts to derive a formal hypothesis test from
this diagnostic plot have met with difficulties, but a simulation
approach similar to Silverman's (1986) does seem to work (Girman 1994).
The simulation, of course, requires considerable computer time.

Sarle and Kuo (1993) document a less expensive approximate nonparametric
test for the number of clusters that has been implemented in the
MODECLUS procedure in the SAS/STAT product. This test sacrifices
statistical efficiency for computational efficiency. The method for
conducting significance tests is as follows:
 * Estimate densities using fixed-radius uniform kernels.
 * Obtain preliminary clusters by a "valley-seeking" method.
   Other clustering methods could be used but would yield less power.
 * Compute an approximate p-value for each cluster by comparing
   the estimated maximum density in the cluster with
   the estimated maximum density on the cluster boundary.
 * Repeatedly join the least significant cluster
   with a neighboring cluster until
   all remaining clusters are significant.
 * Estimate the number of population clusters as the number of
   significant sample clusters.
 * The proceeding steps can be repeated for any number of different
   radii, and the estimate of the number of population clusters can
   be taken to be the maximum number of significant sample clusters
   for any radius.

This method has the following useful features:
 * No distributional assumptions are required.
 * The choice of smoothing parameter is not critical since you
   can try any number of different values.
 * The data can be coordinates or distances.
 * Time and space requirements for the significance tests
   are no worse than those for obtaining the clusters.
 * The power is high enough to be useful for practical purposes.

The method for computing the p-values is based on a series of plausible
approximations. There are as yet no rigorous proofs that the method is
infallible. Neither are there any asymptotic results. However,
simulations for sample sizes ranging from 20 to 2000 indicate that the
p-values are almost always conservative. The only case discovered so far
in which the p-values are liberal is a uniform distribution in one
dimension for which the simulated error rates exceed the nominal
significance level only slightly for a limited range of sample sizes.


References
----------

Anderberg, M.R. (1973), Cluster Analysis for Applications, New York:
Academic Press, Inc.

Arnold, S.J. (1979), "A Test for Clusters," Journal of Marketing
Research, 16,545-551.

Art, D., Gnanadesikan, R., and Kettenring, R. (1982), "Data-based
Metrics for Cluster Analysis," Utilitas Mathematica, 21A, 75-99.

Banfield, J.D. and Raftery, A.E. (1993), "Model-Based Gaussian and Non-
Gaussian Clustering", Biometrics, 49, 803-821.

Barnett, V., ed. (1981), Interpreting Multivariate Data, New York:
John Wiley & Sons, Inc.

Binder, D.A. (1978), "Bayesian Cluster Analysis," Biometrika, 65, 31-38.

Binder, D.A. (1981), "Approximations to Bayesian Clustering Rules,"
Biometrika, 68, 275-285.

Blashfield, R.K. and Aldenderfer, M.S. (1978), "The Literature on
Cluster Analysis," Multivariate Behavioral Research, 13, 271-295.

Bock, H.H. (1985), "On Some Significance Tests in Cluster Analysis,"
Journal of Classification, 2, 77-108.

Calinski, T. and Harabasz, J. (1974), "A Dendrite Method for Cluster
Analysis," Communications in Statistics, 3, 1-27.

Cooper, M.C. and Milligan, G.W.  (1988), "The Effect of Error on
Determining the Number of Clusters," Proceedings of the International
Workshop on Data Analysis, Decision Support and Expert Knowledge
Representation in Marketing and Related Areas of Research, 319-328.

Duda, R.O. and Hart, P.E. (1973), Pattern Classification and Scene
Analysis, New York:  John Wiley & Sons, Inc.

Duran, B.S. and Odell, P.L. (1974), Cluster Analysis, New York:
Springer-Verlag.

Englemann, L. and Hartigan, J.A. (1969), "Percentage Points of a Test
for Clusters," Journal of the American Statistical Association, 64,
1647-1648.

Everitt, B.S. (1979), "Unresolved Problems in Cluster Analysis,"
Biometrics, 35, 169-181.

Everitt, B.S. and Hand, D.J. (1981), Finite Mixture Distributions,
New York: Chapman and Hall.

Girman, C.J. (1994), "Cluster Analysis and Classification Tree
Methodologyas an Aid to Improve Understandinh of Benign Prostatic
Hyperplasia," Ph.D. thesis, Chapel Hill, NC: Department of
Biostatistics, University of North Carolina.

Gitman, I. (1973), "An Algorithm for Nonsupervised Pattern
Classification,` IEEE Transactions on Systems, Man, and Cybernetics,
SMC-3, 66-74.

Good, I.J. (1977), "The Botryology of Botryology," in Classification
and Clustering, ed. J. Van Ryzin, New York: Academic Press, Inc.

Harman, H.H. (1976), Modern Factor Analysis, 3d Edition, Chicago:
University of Chicago Press.

Hartigan, J.A. (1975), Clustering Algorithms, New York:  John Wiley &
Sons, Inc.

Hartigan, J.A. (1977), "Distribution Problems in Clustering," in
Classification and Clustering, ed. J. Van Ryzin, New York:  Academic
Press, Inc.

Hartigan, J.A. (1978), "Asymptotic Distributions for Clustering
Criteria,"  Annals of Statistics, 6, 117-131.

Hartigan, J.A. (1981), "Consistency of Single Linkage for High-Density
Clusters," Journal of the American Statistical Association, 76, 388-394.

Hartigan, J.A. (1985), "Statistical Theory in Clustering," Journal of
Classification, 2, 63-76.

Hartigan, J.A. and Hartigan, P.M. (1985), "The Dip Test of Unimodality,`
Annals of Statistics, 13, 70-84.

Hartigan, P.M. (1985), "Computation of the Dip Statistic to Test for
Unimodality," Applied Statistics, 34, 320-325.

Hawkins, D.M., Muller, M.W., and ten Krooden, J.A. (1982), "Cluster
Analysis," in Topics in Applied Multivariate Analysis, ed. D.M.
Hawkins, Cambridge: Cambridge University Press.

Hubert, L. (1974), "Approximate Evaluation Techniques for the
Single-Link and Complete-Link Hierarchical Clustering Procedures,"
Journal of the American Statistical Association, 69, 698-704.

Hubert, L.J. and Baker, F.B. (1977), "An Empirical Comparison of
Baseline Models for Goodness-of-Fit in r-Diameter Hierarchical
Clustering," in Classification and Clustering, ed. J. Van Ryzin, New
York: Academic Press, Inc.

Huizinga, D. H. (1978), "A Natural or Mode Seeking Cluster Analysis
Algorithm,` Technical Report 78-1, Behavioral Research Institute, 2305
Canyon Blvd., Boulder, Colorado 80302.

Kaufmann, L. and Rousseeuw, P.J. (1990), Finding Groups in Data, New
York: John Wiley & Sons, Inc.

Klastorin, T.D. (1983), "Assessing Cluster Analysis Results," Journal of
Marketing Research, 20, 92-98.

Koontz, W.L.G. and Fukunaga, K. (1972a), "A Nonparametric Valley-Seeking
Technique for Cluster Analysis,` IEEE Transactions on Computers, C-21,
171-178.

Koontz, W.L.G. and Fukunaga, K. (1972b), "Asymptotic Analysis of a
Nonparametric Clustering Technique,` IEEE Transactions on Computers,
C-21, 967-974.

Koontz, W.L.G., Narendra, P.M., and Fukunaga, K. (1976), "A
Graph-Theoretic Approach to Nonparametric Cluster Analysis,` IEEE
Transactions on Computers, C-25, 936-944.

Lee, K.L. (1979), "Multivariate Tests for Clusters," Journal of the
American Statistical Association, 74, 708-714.

Ling, R.F (1973), "A Probability Theory of Cluster Analysis," Journal of
the American Statistical Association, 68, 159-169.

MacQueen, J.B. (1967), "Some Methods for Classification and Analysis of
Multivariate Observations," Proceedings of the Fifth Berkeley Symposium
on Mathematical Statistics and Probability, 1, 281-297.

Marriott, F.H.C. (1971), "Practical Problems in a Method of Cluster
Analysis," Biometrics, 27, 501-514.

Marriott, F.H.C. (1975), "Separating Mixtures of Normal Distributions,"
Biometrics, 31, 767-769.

Massart, D.L. and Kaufman, L. (1983), The Interpretation of Analytical
Chemical Data by the Use of Cluster Analysis, New York:  John Wiley &
Sons, Inc.

McClain, J.O. and Rao, V.R. (1975), "CLUSTISZ: A Program to Test for the
Quality of Clustering of a Set of Objects," Journal of Marketing
Research, 12, 456-460.

McLachlan, G.J. and Basford, K.E. (1988), Mixture Models, New York:
Marcel Dekker, Inc.

Mezzich, J.E and Solomon, H. (1980), Taxonomy and Behavioral Science,
New York: Academic Press, Inc.

Milligan, G.W. (1980), "An Examination of the Effect of Six Types of
Error Perturbation on Fifteen Clustering Algorithms," Psychometrika, 45,
325-342.

Milligan, G.W. (1981), "A Review of Monte Carlo Tests of Cluster
Analysis," Multivariate Behavioral Research, 16, 379-407.

Milligan, G.W. and Cooper, M.C. (1985), "An Examination of Procedures
for Determining the Number of Clusters in a Data Set," Psychometrika,
50, 159-179.

Minnotte, M.C. (1992), "A Test of Mode Existence with Applications to
Multimodality,` Ph.D. thesis, Rice University, Department of Statistics.

Mizoguchi, R. and Shimura, M. (1980), "A Nonparametric Algorithm for
Detecting Clusters Using Hierarchical Structure,` IEEE Transactions on
Pattern Analysis and Machine Intelligence, PAMI-2, 292-300.

Mueller, D.W. and Sawitzki, G. (1991), "Excess mass estimates and tests
for multimodality,` JASA 86, 738-746.

Pollard, D. (1981), "Strong Consistency of k-Means Clustering," Annals
of Statistics, 9, 135-140.

Polonik, W. (1993), "Measuring Mass Concentrations and Estimating
Density Contour Clusters--An Excess Mass Approach," Technical Report,
Beitraege zur Statistik Nr. 7, Universitaet Heidelberg.

Sarle, W.S. (1982), "Cluster Analysis by Least Squares," Proceedings of
the Seventh Annual SAS Users Group International Conference, 651-653.

Sarle, W.S. (1983), Cubic Clustering Criterion, SAS Technical Report
A-108, Cary, NC: SAS Institute Inc.

Sarle, W.S and Kuo, An-Hsiang (1993), The MODECLUS Procedure, SAS
Technical Report P-256, Cary, NC: SAS Institute Inc.

Scott, A.J. and Symons, M.J. (1971), "Clustering Methods Based on
Likelihood Ratio Criteria," Biometrics, 27, 387-397.

Silverman, B.W. (1986), Density Estimation, New York: Chapman and Hall.

Sneath, P.H.A. and Sokal, R.R. (1973), Numerical Taxonomy, San
Francisco: W.H. Freeman.

Spath, H. (1980), Cluster Analysis Algorithms, Chichester, England:
Ellis Horwood.

Symons, M.J. (1981), "Clustering Criteria and Multivariate Normal
Mixtures," Biometrics, 37, 35-43.

Titterington, D.M., Smith, A.F.M., and Makov, U.E. (1985), Statistical
Analysis of Finite Mixture Distributions, New York: John Wiley & Sons, Inc.

Tukey, P.A. and Tukey, J.W. (1981), "Data-Driven View Selection;
Agglomeration and Sharpening,` in Barnett (1981).

Ward, J.H. (1963), "Hierarchical Grouping to Optimize an Objective
Function," Journal of the American Statistical Association, 58, 236-244.

Wolfe, J.H. (1970), "Pattern Clustering by Multivariate Mixture
Analysis," Multivariate Behavioral Research, 5, 329-350.

Wolfe, J.H. (1978), "Comparative Cluster Analysis of Patterns of
Vocational Interest," Multivariate Behavioral Research, 13, 33-44.

Wong, M.A. (1982), "A Hybrid Clustering Method for Identifying
High-Density Clusters," Journal of the American Statistical Association,
77, 841-847.

Wong, M.A. and Lane, T. (1983), "A kth Nearest Neighbor Clustering
Procedure," Journal of the Royal Statistical Society, Series B, 45,
362-368.

Wong, M.A. and Schaack, C. (1982), "Using the kth Nearest Neighbor
Clustering Procedure to Determine the Number of Subpopulations,"
American Statistical Association 1982 Proceedings of the Statistical
Computing Section, 40-48.


-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
