Stanford University

Download slides here (gzipped tar archive of PostScript files).

The theoretical part of the talk is based on the STOC '98 paper ``Approximate nearest neighbors: Towards removing the curse of dimensionality'', by Piotr Indyk and Rajeev Motwani. The talk also includes preliminary experimental work.

Say we have a set `S` of `n` points in
`d`-dimensional space. Given a query point `q`, the
*nearest-neighbor problem* is to
find the point in `S` closest to `q`
according to metric `d`. Define `d`(`q`,
`S`) as this distance.

The very simple *brute-force algorithm* does this fairly
quickly: Compute the distance from `q` to each of the points
in `S`. If the distance function is reasonably computable,
this takes O(`n``d`) time. This is good enough for
most purposes, but for some purposes it is impractical.
For example, consider information retrieval systems (like AltaVista or
Excite) where each document is represented as a set of words: The value
of `d` is the vocabulary size (maybe 10^{5}), and
the number `n` of documents is around 10^{8}.

Although nobody has improved on the brute-force algorithm for
high-dimensional nearest neighbors, recently several have worked on
approximation algorithms: After all, in an information retrieval system
finding the closest document is not essential, and the distance metric
is just an approximation anyway.
The goal is a quick algorithm that guarantees a
point from `S` at distance from `q` at most
(1 + `eps`) `d`(`q`, `S`).
The time to process a single query should be polynomial in
log `n`, `d`, and 1/`eps`, while
the preprocessing time to build the data structure and the size of the
data structure itself should be polynomial
in `n`, `d`, and 1/`eps`.

Although this goal has not yet been achieved, this talk presents
two recent algorithms. One has polynomial preprocessing and sublinear
- but not polylogarithmic in `n` - query time (if
`eps`>1); the other provides the desired query time but
requires storage that is nearly linear in `n` but has a factor
of (1/`eps`)^{d}.

This talk presents the ideas behind the recent algorithms and some measurements of implementations. Briefly, the measurements indicate that more recent algorithms are moving into the range of being viable alternatives to the brute-force algorithm.