NIPS*1999

Post-Conference
Workshop

Statistical Learning in High Dimensions
December 1999

Breckenridge, Colorado

Workshop Co-Chairs:
Marina Meila, Robotics Institute, Carnegie Mellon University
Andrew W. Moore, Robotics Institute, Carnegie Mellon University


 


PROBLEMS AND GOALS

FORMAT

SPEAKERS

ABSTRACTS

SCHEDULE





Problems and goals:
High dimensionality data is one of the challenges that machine learning researchers are increasingly often faced with. This is a consequence of both the increasing volume of the available data collections and of the pervasion of machine learning techniques to ever wider application areas. Domains with typically high data dimensionality are pattern recognition and image processing, text and language modeling, diagnosis systems, computational biology, genetics.

Constructing models from data in high dimensional domains raises problems that are inexistent or less severe in the lower dimensional cases: Data, models and error surfaces in more than three dimensions are hard to visualize and to represent intuitively. If the variables are discrete the size of the state space grows exponentially with the number of dimensions and may exceed the number of samples available by many orders of magnitude. As a consequence, overfitting avoidance and feature selection become critical. Moreover, an increased dimensionality in the parameter space (in case of parametric models) may lead to an exponential increase in the computational demands for finding the optimal set of parameters. Therefore, a special attention must be paid to algorithmic issues: the need for models that can be learned efficiently and can be used efficiently is stringent for large dimensional data sets.

Of the wide existing spectrum of models and machine learning methods only a few ones are currently applied to high dimensional problems. Many of these models are among the simplest ones (naive Bayes, nearest neighbor). The reasons lie in both computational difficulties and in local minima and model selection problems. The goal of this workshop is to better understand the sources of difficulty in training (and using) high-dimensional statistical models and to expose the participants to recent solutions and to approaches outside the traditional scope of NIPS (e.g. multiscale models). We will focus on the algorithmic side of the problem by emphasizing on:

  • fast and very fast exact algorithms
  • models/training algorithms with no local minima (e.g. support vectors, tree distributions)
  • data structures
  • approximate and domain-specific algorithms
We hope that the worshop will enable the application of machine learning techniques to a larger class of significant real world problems.

The workshop will bring together researchers from domains dealing with constructing statistical models of high-dimensional data, from data mining, graphical models and algorithms to discuss issues that are relevant across different fields. It aims is to make the NIPS community aware of domain-specific approaches to this problem, of the typical assumptions that allow learning in various fields of application. In addition, the workshop will help other communities understand how learning and the NIPS community can be useful in solving their problems.

A key goal of the workshop will be to expose researchers to ideas and open problems like:

  • When " quadratic" is not good enough. Very fast algorithms for large problems.
  • How to use prior knowledge in speeding up training and search? Lessons from domain specific paradigms.
  • How to efficiently prune irrelevant features. Implicit versus explicit feature selection techniques.
  • Efficient computation of sufficient statistics. It is known that in learning the structure of a graphical model a large number of statistics (cooccurrence counts) must be evaluated and that this is one of the most computationally intensive stages of the search over models. What techniques are available for computing/storing the sufficient statistics efficiently, approximating them, predicting their values based on other statistics?
  • Approximate belief net propagation methods that scale well
  • Supervised versus unsupervised training. Often times, one finds that density estimators perform well in classification/recognition tasks. What causes this behaviour? Are there lessons to be learned that would improve classifier training?

Format:
This will be a one day workshop, interspersing short invited talks (20 min) with moderated discussions. The speakers will be encouraged to talk about challenges and controversial topics both in their prepared talks and in the insuing discussions. The choice of topics will be balanced between problems/challenges and presentation of algorithmic solutions. For the later, the presentation of new approaches or work in progress will be especially encouraged. The former topic may include tutorial material if it refers to fields outside the scope of the attendees' majority. To maximize the profit for everybody participating the focus will be on algorithmic issues that are general or common to several fields and on identifying solutions with potential of generalization. Since one of the goals of the workshop is to facilitate communication between researchers in different subfields, ample time will be given to questions. The last part of the workshop will be devoted to a discussion of the most promising approaches and ideas that will have emerged during the workshop.

Contact Info

Marina Meila
Smith Hall 208
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh PA 15213
mmp@cs.cmu.edu
Phone:(412)268-8424
Fax:(412)268-5571

Andrew W. Moore
Smith Hall 221
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh PA 15213
awm@cs.cmu.edu
Phone:(412)268-7599
Fax:(412)268-5571

 Top of page