NIPS*1999 | Statistical Learning in High Dimensions | December 1999
Breckenridge, Colorado |
Abstracts
Yoshua Bengio
Universite de Montreal
Using the outputs of a neural networks to represent the parameters of the conditional distribution of a random variable, and using the same neural network to model many conditional distributions, one can learn the joint distribution of high-dimensional data (even high-dimensional discrete data) with a reasonable number of free parameters, while taking advantage of hidden units to capture commonalities between the different conditional distributions. Preliminary comparative experiments yielding excellent results in terms of out-of-sample log-likelihood will be described. Click here for the paper Top of page
John W. Fisher III
Massachusetts Institute of Technology
Modern machine learning methods are being applied to data of increasingly higher dimension. Classical decision-theoretic approaches are not well suited to high-dimensional data. Consequently, dimensionality reduction, or feature extraction, is often first performed in an attempt to simplify the estimation or classification task. Common methods for feature extraction, however, are either ad hoc or optimal only in the signal reconstruction sense (e.g. eigenvector based methods). The challenging task is to learn "informative" directions from high-dimensional data. Utilizing principles of information theory, non-parametric statistics and machine learning I describe a task-driven feature extraction approach. Specifically, the features preserve information related to the given estimation/classification problem. Mutual information, motivated by Fano's inequality, is the criterion used for feature extraction. The novelty of the approach is that mutual information is optimized in the feature space (thereby avoiding the curse of dimensionality) without explicit estimation or modeling of the underlying density. I present experimental results for two challenging image processing problems: pose estimation and classification of high-resolution SAR imagery. Top of page
Piotr Indyk
Stanford University
Recently, in the field of Computational Geometry, there has been significant progress in designing algorithms for proximity problems (e.g. nearest neighbor, closest pair, minimum spanning tree) in high-dimensional spaces. The key property of the new algorithms is that the dependence of their running time on the dimension is only polynomial (as opposed to exponential dependence exhibited by previous techniques), while maintaining the sublinear/subquadratic dependence on the number of input points. This comes at a price of approximation: the algorithms find solutions within a factor of c>1 from the optimal ones. In this talk I will review some of those developments, in particular:
Marina Meila
Carnegie Mellon University
Chow and Liu introduced an algorithm for fitting a multivariate
distribution with a tree graphical model. The original algorithm is
quadratic in the domain dimension n and linear in the number of data
points N that define the target distribution. We present two
algorithms that aim to accelerate the Chow and Liu algorithm for high
dimensional domains and large data sets. The first algorithm
constructs a Gaussian tree over a continuous domain using sampling to
direct the search towards the most promising edges. The second
algorithm is taylored for discrete domains with sparse variables. We
show that in the case of sparse data, a tree distribution can be fit
exactly in time and memory that is typically subquadratic in both n
and N. The algorithm exhibits speed ups of up to three orders of
magnitude in experiments.
joint work with Andrew Moore
Top of page
Sayan Mukherjee
M.I.T.
In the Support Vector Method for density estimation a key idea used is the multivariate Kolmogorov-Smirnov Statistic. Some properties of this statistic will be discussed: independece with respect to distribution and scaling as the square root of n for n large enough. Empirical plots of the distribution will be shown. Finally, applying this statistic to density esitmation will be described. Top of page
Thomas Hofmann
Brown University
Many applications of machine learning methods in domains such as information retrieval, natural language processing, molecular biology, neuroscience, and economics have to be able to deal with various sorts of discrete data that is typically of very high dimensionality.
One standard approach to deal with high dimensional data is to perform a dimension reduction and map the data to some lower dimensional representation. Reducing the data dimensionality is often a valuable analysis by itself, but it might also serve as a pre-processing step to improve or accelerate subsequent stages such as classification or regression. Two closely related methods that are often used in this context and that can be found in virtually every textbook on unsupervised learning are principal component analysis (PCA) and factor analysis. However, PCA relies on a least--squares approximation principle and factor analysis is based on assumptions about the normality of random variables. In contrast, methods for discrete data such as qualitative or count data (as well as for continuos but non--Gaussian data) have been widely ignored.
Various techniques have been proposed in the statistical literature over the last 60 years: canonical analysis, correspondence analysis, association analysis, and latent class analysis being the most important ones. The history of these methods up to these days has been a story of oblivion and re-discovery. It is thus important to find a systematic framework that allows to understand the relationships between these methods, both, conceptionally and in terms of their computational aspects. We provide such as unifying view by clarifying the geometrical foundations of dimension reduction methods for qualitative data. We also focus on the question in which way today's machine learning problems differ from traditional statistical problems and what consequences this has for the applicability of dimension reduction techniques. Experimental results from information retrieval, collaborative filtering and linguistics are used to stress this point. Top of page
Bo Thiesson
Christopher Meek
David Heckerman
Microsoft Research
The EM algorithm is a popular method for parameter estimation in a variety of problems involving missing data or incomplete information. The EM algorithm often requires significant computational resources and has been dismissed as impractical for large databases. We present an approach for significantly reducing the computational cost of applying the EM algorithm: A version of the incremental EM algorithm described by Neal and Hinton (1998) which iterates through blocks of data in a cyclic way. The number of cases in each block dramatically effects the efficiency of the algorithm. We show how to easily select a (near) optimal block size. We also demonstrate that the method significantly reduces computational costs for two high-dimensional real-world problems with large databases. Top of page
Alessandro Verri
M.I.T. and Universita' di Genova
In this talk we report some experimental work in which Support Vector Machines (SVM) are used for addressing classification problems in high dimensional spaces. We consider two different applications. In the first, we describe preliminary results on a cancer classification problem based on gene expression monitoring using DNA microarrays. In the second, we discuss trainable systems for object detection and recognition. In both cases we highlight the motivation and the advantages underlying the use of SVMs in high dimensional spaces and illustrate the main open problems.