High-Dimensional Similarity Search in Large Databases

Rajeev Motwani (rajeev@CS.Stanford.EDU)
Stanford University

Download slides here (gzipped tar archive of PostScript files).

The theoretical part of the talk is based on the STOC '98 paper ``Approximate nearest neighbors: Towards removing the curse of dimensionality'', by Piotr Indyk and Rajeev Motwani. The talk also includes preliminary experimental work.

Say we have a set S of n points in d-dimensional space. Given a query point q, the nearest-neighbor problem is to find the point in S closest to q according to metric d. Define d(q, S) as this distance.

The very simple brute-force algorithm does this fairly quickly: Compute the distance from q to each of the points in S. If the distance function is reasonably computable, this takes O(nd) time. This is good enough for most purposes, but for some purposes it is impractical. For example, consider information retrieval systems (like AltaVista or Excite) where each document is represented as a set of words: The value of d is the vocabulary size (maybe 105), and the number n of documents is around 108.

Although nobody has improved on the brute-force algorithm for high-dimensional nearest neighbors, recently several have worked on approximation algorithms: After all, in an information retrieval system finding the closest document is not essential, and the distance metric is just an approximation anyway. The goal is a quick algorithm that guarantees a point from S at distance from q at most (1 + eps) d(q, S). The time to process a single query should be polynomial in log n, d, and 1/eps, while the preprocessing time to build the data structure and the size of the data structure itself should be polynomial in n, d, and 1/eps.

Although this goal has not yet been achieved, this talk presents two recent algorithms. One has polynomial preprocessing and sublinear - but not polylogarithmic in n - query time (if eps>1); the other provides the desired query time but requires storage that is nearly linear in n but has a factor of (1/eps)d.

This talk presents the ideas behind the recent algorithms and some measurements of implementations. Briefly, the measurements indicate that more recent algorithms are moving into the range of being viable alternatives to the brute-force algorithm.