Optimal Distance Measures for the Nearest Neighbor Rule

Shyjan Mahamud

Abstract

  The nearest neighbor rule for classification is useful whenever it is difficult to model each class with generative approaches. The classification performance of NN rule is critically dependent on the distance measure used for finding the nearest neighbor. A distance measure that minimizes the mis-classification risk for the 1-nearest neighbor search can be shown to be the probability that a pair of input measurements belong to different classes, a result that is also known in the literature on Canonical Distortion Measures where a different criteria is minimized. This pair-wise probability is not in general a metric distance measure, although somewhat surprisingly it does satisfy the triangle inequality. Furthermore, it can out-perform any metric distance, approaching even the Bayes optimal performance.

Although in theory we can express the optimal distance measure in terms of generative models, in practice this approach is limited by the ability to estimate generative models from data. Instead, we us a linear logistic model for the optimal distance measure that combines the discriminative powers of more elementary distance measures associated with a collection of simple feature spaces that are easy and efficient to implement.

For performing efficient nearest neighbor search over large training sets, the linear model was extended to discretized distance measures that combines distance measures associated with discriminators. The discrete model was combined with the continuous model to yield a hierarchical distance model that is both fast as well as accurate.

We will also discuss connections to boosting and the maximum entropy approach.


Back to the Main Page

Charles Rosenberg
Last modified: Thu Oct 3 22:10:07 EDT 2002