Efficient algorithms for non-parametric clustering with clutter

Weng-Keen Wong


  Detecting and counting overdensities in data is a common problem in the physical and geographic sciences. One of the most successful of recent algorithms for the counting version of the problem was introduced by Cuevas, Febrero and Fraiman [Cuevas et. al. 2000], which will be referred to as the CFF algorithm. This algorithm first determines the subset of data points that are in high density regions using a non-parametric density estimator. A clustering step follows where such high density points are agglomerated. While this algorithm was originally intended to estimate the number of clusters, it can also be used to perform non-parametric clustering against a noisy background. However, the algorithm proposed by CFF is too computationally expensive to work on large datasets with greater than two dimensions. We propose an alternative implementation of the CFF algorithm producing exactly the same results but addressing the computational problems in both the density estimation and in the agglomeration step. We will then illustrate the effectiveness of our approach on large multi-dimensional astrophysics datasets.

This is joint work with Andrew Moore.

Note that this talk is in partial fulfillment of the speaking requirement.

Back to the Main Page

Charles Rosenberg
Last modified: Wed Apr 3 17:01:42 EST 2002