Proximity problems ask questions about distances between points
such as various sorts of nearestneighbor queries. Algorithms and
data structures (typically trees) are studied with theoretical goals
in the field of computational geometry and with practical goals in
many fields such as databases and machine learning.

Tutorial: Data Structures for Fast Statistics
 
Our main interest in proximity methods is a little
different from the usual ones. We are typically not interested in performing,
say, nearestneighbor queries per se; instead, we use the data
structures that are effective for proximity problems to accelerate
more complicated statistical and machine learning methods: see this
tutorial given at ICML 04.

The Proximity Project is a large undertaking begun in summer 2004
which attempts to finally answer the question once and for all "What
is the best nearestneighbor method?", by implementing and empirically
comparing the bestknown published methods from the last three decades
for five variations of the nearestneighbor searching problem. Source
code and datasets will be available. The results are done but the
tedious writeup is not, though I am doing my best to make this appear
here shortly.



NearestNeighbor Classification
 
Perhaps the most common need for nearestneighbor search is for the
purpose of nearestneighbor classification. It turns out,
however, that solving this problem need not be as hard as actually
finding the k nearest neighbors and then finding out what the
majority label is. To determine the correct class label for a query
point, one need only determine whether more of its k nearest
neighbors are of the positive class or of the negative class.
We demonstrate for the first time that this can be achieved exactly using
bounds on distances, in less time than it takes to actually identify
the k points in question using the best stateoftheart
algorithms. We show, for example, that this can make efficient search
possible for a dataset with over 1 million dimensions and a few
hundred thousand points, something which cannot be achieved with full
nearestneighbor search. (Liu, Moore, and Gray,
New Algorithms for
Efficient Highdimensional Nonparametric Classification [pdf], [ps] NIPS 2003.) A journal
version has been accepted to JMLR but hasn't appeared yet. Ting Liu has also
developed the multiclass generalization of this algorithm, in another paper.

Approximate NearestNeighbor
 
For a finite dataset in high dimension, it may not be possible in practice
to find the nearest neighbor in less time than exhaustive search. So what
can be done?
Much attention has been paid lately to the idea of approximate
nearestneighbor searching  instead of finding the exact nearest
neighbor, change the problem to that of finding a neighbor within
1+ε of the distance to the true nearest neighbor. While this may
seem sensible, this criterion says nothing about the rank of
the points returned by such a method  which may be arbitrarily poor.
In fact the higher the dimension, the more of the points lie within
1+ε of the absolute nearestneighbor distance  so it can be that
returning virtually any point is sufficient to solve this problem.
Thus I strongly urge caution regarding using this approach in practice.
Nonetheless, some applications of nearestneighbor searching can
withstand occasionally egregious errors. Computer vision researchers,
for example, have found this approach useful in some applications.
Our treebased data structure and associated algorithm yields up to
30x speedups over the bestknown methods such as recent hashing approaches.
(Liu, Moore, Gray, and Yang,
An Investigation of
Practical Approximate NearestNeighbor Algorithms [pdf], [ps] NIPS 2004.)

Although the nearestneighbor searching problem is posed in terms of a
single query point and a set of reference points to be searched, in
fact it is very often (perhaps most often)
the case that the user has in hand a set
of query points and wants the nearest neighbor for each. It turns out
that this can be done in less time than it takes to process each query
point individually.
See our formulation of this problem (in its most general bichromatic form)
as an Nbody problem, which can utilize
virtually any spacepartitioning tree. The standard nearestneighbor
branchandbound algorithm is a special case of this algorithm, corresponding
to a query set of size 1.

