MRKD-trees for very fast clustering: why we want them, how we use
them, and what next?

Clustering is important in many fields including manufacturing,
biology, finance, and astronomy.  Mixture models are a popular
approach due to their statistical foundations, and EM is a very
popular method for finding mixture models.  EM, however, requires many
accesses of the data, and thus has been dismissed as impractical for
data mining of enormous datasets.  We present a new algorithm, based
on extensions to the multiresolution kdtrees of (Deng and Moore 96),
which dramatically reduces the cost of EM-based clustering, with
savings rising linearly with the number of datapoints.  Although
presented here for maximum likelihood estimation of Gaussian mixture
models, it is also applicable to non-Gaussian models (provided class
densities are monotonic in Mahalanobis distance), mixed
categorical/numeric clusters, Bayesian methods such as Autoclass
(Cheeseman 93), and, in new work by Dan Pelleg: for k-means
clustering. We will compare with other fast clustering approaches such
as Priebe's one-pass method.