NIPS*1999

Statistical Learning in High Dimensions

December 1999

Breckenridge, Colorado

Abstracts

Learning the Joint Distribution of High-Dimensional Data with Neural Networks

Yoshua Bengio
Universite de Montreal

Using the outputs of a neural networks to represent the parameters of the conditional distribution of a random variable, and using the same neural network to model many conditional distributions, one can learn the joint distribution of high-dimensional data (even high-dimensional discrete data) with a reasonable number of free parameters, while taking advantage of hidden units to capture commonalities between the different conditional distributions. Preliminary comparative experiments yielding excellent results in terms of out-of-sample log-likelihood will be described. Click here for the paper Top of page

Information Theoretic Feature Extraction

John W. Fisher III
Massachusetts Institute of Technology

Modern machine learning methods are being applied to data of increasingly higher dimension. Classical decision-theoretic approaches are not well suited to high-dimensional data. Consequently, dimensionality reduction, or feature extraction, is often first performed in an attempt to simplify the estimation or classification task. Common methods for feature extraction, however, are either ad hoc or optimal only in the signal reconstruction sense (e.g. eigenvector based methods). The challenging task is to learn "informative" directions from high-dimensional data. Utilizing principles of information theory, non-parametric statistics and machine learning I describe a task-driven feature extraction approach. Specifically, the features preserve information related to the given estimation/classification problem. Mutual information, motivated by Fano's inequality, is the criterion used for feature extraction. The novelty of the approach is that mutual information is optimized in the feature space (thereby avoiding the curse of dimensionality) without explicit estimation or modeling of the underlying density. I present experimental results for two challenging image processing problems: pose estimation and classification of high-resolution SAR imagery. Top of page

Tutorial: Nearest neighbor and proximity problems in high dimensions

Piotr Indyk
Stanford University

Recently, in the field of Computational Geometry, there has been significant progress in designing algorithms for proximity problems (e.g. nearest neighbor, closest pair, minimum spanning tree) in high-dimensional spaces. The key property of the new algorithms is that the dependence of their running time on the dimension is only polynomial (as opposed to exponential dependence exhibited by previous techniques), while maintaining the sublinear/subquadratic dependence on the number of input points. This comes at a price of approximation: the algorithms find solutions within a factor of c>1 from the optimal ones. In this talk I will review some of those developments, in particular:

a c-approximation algorithm for nearest neighbor, with query time roughly O(d n^{1/c}) and space O(dn + n^{1+1/c}), where d is the dimension and n is the number of input points
c-approximation algorithms for the closest pair and minimum spanning tree, with running times roughly O(d n^{1+1/c})

If time allows, I will also present experimental evaluations of some of the above algorithms, which lead to conclusion that the above running times are significant overestimations of the actual behavior of the algorithms on real data. Top of page

Fast algorithms for learning tree distributions in high dimensional domains

Marina Meila
Carnegie Mellon University

Chow and Liu introduced an algorithm for fitting a multivariate distribution with a tree graphical model. The original algorithm is quadratic in the domain dimension n and linear in the number of data points N that define the target distribution. We present two algorithms that aim to accelerate the Chow and Liu algorithm for high dimensional domains and large data sets. The first algorithm constructs a Gaussian tree over a continuous domain using sampling to direct the search towards the most promising edges. The second algorithm is taylored for discrete domains with sparse variables. We show that in the case of sparse data, a tree distribution can be fit exactly in time and memory that is typically subquadratic in both n and N. The algorithm exhibits speed ups of up to three orders of magnitude in experiments.
joint work with Andrew Moore Top of page

An Application of the Multivariate Kolmogorov-Smirnov Statistic

Sayan Mukherjee
M.I.T.

In the Support Vector Method for density estimation a key idea used is the multivariate Kolmogorov-Smirnov Statistic. Some properties of this statistic will be discussed: independece with respect to distribution and scaling as the square root of n for n large enough. Empirical plots of the distribution will be shown. Finally, applying this statistic to density esitmation will be described. Top of page

Dimension Reduction Techniques for Qualitative and Count Data

Thomas Hofmann
Brown University

Many applications of machine learning methods in domains such as information retrieval, natural language processing, molecular biology, neuroscience, and economics have to be able to deal with various sorts of discrete data that is typically of very high dimensionality.

One standard approach to deal with high dimensional data is to perform a dimension reduction and map the data to some lower dimensional representation. Reducing the data dimensionality is often a valuable analysis by itself, but it might also serve as a pre-processing step to improve or accelerate subsequent stages such as classification or regression. Two closely related methods that are often used in this context and that can be found in virtually every textbook on unsupervised learning are principal component analysis (PCA) and factor analysis. However, PCA relies on a least--squares approximation principle and factor analysis is based on assumptions about the normality of random variables. In contrast, methods for discrete data such as qualitative or count data (as well as for continuos but non--Gaussian data) have been widely ignored.

Various techniques have been proposed in the statistical literature over the last 60 years: canonical analysis, correspondence analysis, association analysis, and latent class analysis being the most important ones. The history of these methods up to these days has been a story of oblivion and re-discovery. It is thus important to find a systematic framework that allows to understand the relationships between these methods, both, conceptionally and in terms of their computational aspects. We provide such as unifying view by clarifying the geometrical foundations of dimension reduction methods for qualitative data. We also focus on the question in which way today's machine learning problems differ from traditional statistical problems and what consequences this has for the applicability of dimension reduction techniques. Experimental results from information retrieval, collaborative filtering and linguistics are used to stress this point. Top of page

Accelerating EM for large high-dimensional databases

Bo Thiesson
Christopher Meek
David Heckerman
Microsoft Research

The EM algorithm is a popular method for parameter estimation in a variety of problems involving missing data or incomplete information. The EM algorithm often requires significant computational resources and has been dismissed as impractical for large databases. We present an approach for significantly reducing the computational cost of applying the EM algorithm: A version of the incremental EM algorithm described by Neal and Hinton (1998) which iterates through blocks of data in a cyclic way. The number of cases in each block dramatically effects the efficiency of the algorithm. We show how to easily select a (near) optimal block size. We also demonstrate that the method significantly reduces computational costs for two high-dimensional real-world problems with large databases. Top of page

SVM classification in high dimensional spaces for bioinformatics and computer vision applications

Alessandro Verri
M.I.T. and Universita' di Genova

In this talk we report some experimental work in which Support Vector Machines (SVM) are used for addressing classification problems in high dimensional spaces. We consider two different applications. In the first, we describe preliminary results on a cancer classification problem based on gene expression monitoring using DNA microarrays. In the second, we discuss trainable systems for object detection and recognition. In both cases we highlight the motivation and the advantages underlying the use of SVMs in high dimensional spaces and illustrate the main open problems.

Top of page
Statistical Learning in High Dimensions page