Title: 'N-Body' Problems in Statistical Learning
Authors: Alexander Gray and Andrew Moore
Abstract:
We present very fast algorithms for a large class of statistical
problems, which we call all-point-pairs problems, or 'N-body'-like
problems. These are problems which abstractly require a comparison of
each of the N points in a dataset with each other point and would
naively be solved using N^2 computations. Such algorithms cannot be
applied to practical problems where N is large.
There are many common problems in learning/statistics of this form.
In this paper we focus on several examples of all-point-pairs problems
within nonparametric statistical learning: nearest-neighbor
classification, kernel density estimation, outlier detection, and
important set of spatial statistics: the two-point correlation,
multiple two-point correlation, and the n-point correlation. The
methodology presented in this paper is also applicable in principle to
fast learning of large-scale parametric models such as radial basis
function neural networks, mixtures of Gaussians, hidden Markov models,
and Gaussian processes, as well as several other common methods.
We'll present five layers of techniques resulting in a general class
of simple recursive geometric algorithms. The aforementioned examples
will illustrate some of the modifications needed for different
statistical methods.
We give an asymptotic analysis and we show empirically that this
algorithmic framework leads to several orders of magnitude in speedup
over the naive computation, even for small datasets. We are aware of
no exact algorithms for these problems which are faster either
empirically or theoretically.
Keywords:
nearest-neighbor classification, kernel density
estimation, outlier detection, two-point correlation, n-point
correlation, nonparametric statistics, machine learning, kd-trees,
hyperstructure, cached sufficient statistics, N-body problem,
computational geometry, all-nearest-neighbors, range-counting,
range-searching.