ICML 2004 Tutorial: Data Structures for Fast Statistics

DATA STRUCTURES
FOR FAST STATISTICS
A Tutorial to be presented at ICML 04

Alexander Gray and Andrew Moore
School of Computer Science, Carnegie Mellon University

NOTE: Printed handouts will be given to participants at this tutorial. There is no need to print materials beforehand for this tutorial.

Here is a gzipp'ed tar file of the notes in pdf form.

Topic
This half-day tutorial will describe the application of fundamental computer science concepts to algorithmic problems arising in machine learning and statistics. All machine learning and statistical methods must be implemented on computers -- as such, their computational properties have as much practical importance as their statistical properties. In the last 10 years, powerful data structures for both continuous and discrete data have provided orders-of-magnitude computational speedups for a wide variety of statistical problems; some of these results have been widely-publicized and some not. The goal of the tutorial is to explain these data structures, which originated in various fields (e.g. discrete algorithms, databases, computational geometry), and show how they can be used in a number of canonical calculations which occur ubiquitously as subroutines within statistical and machine learning methods.

To clarify, this tutorial is not about ``new statistics specific to large datasets'' but ``how to make your old favorite statistics tractable on large datasets''. Our approach will be to focus in detail on a few template problems and approaches to them, to ensure that at least the essential concepts for designing fast algorithms for the basic cases are conveyed. The tutorial will conclude by surveying more advanced problems and approaches, with references.

Outline

  I. INTRODUCTION. (5 min.)
    1. Why fast algorithms for statistics?
    2. FAQ.

  II. BASIC PROXIMITY PROBLEMS. (45 min.)
    1. Nearest-neighbor problems and range search problems in 
       statistics/learning.         
    2. kd-trees for fast nearest-neighbor search.                
    3. kd-trees for fast range search.
    4. Higher dimensions: metric trees.                          
    5. Five kinds of nearest-neighbor problems, and various
       search structures.                                    
    6. In brief: Fast nearest-neighbor classification.           
    7. Fast k-means: an example of embedded sub-problems.
    8. FAQ.                                                      

  III. BASIC DECISION AND SUMMATION PROBLEMS. (45 min.)
    1. Kernel decision, summation, and sum decision problems in 
       statistics/learning.
    2. Fast locally weighted regression.
    3. Fast Bayes classification: stepping stone to EM.
    4. Fast EM algorithm for mixtures of Gaussians.

  V. BASIC `ALL'-TYPE PROBLEMS. (40 min.)
    1. `All'-type problems in statistics/learning
    2. Dual-tree idea.                                           
    3. Fast bichromatic all-nearest-neighbors.
    4. Fast kernel density estimation.
    5. The well-separated pair decomposition for Cuevas 		 
       clustering.                                               
    6. In brief: Gaussian process regression.
    7. In brief: Multipole methods from computational physics.   

  VI. BASIC COUNTING PROBLEMS. (30 min.)
    1. Counting problems in statistical learning.
    2. Frequent sets.
    3. AD-trees.
    4. Fast Bayes nets learning.

  VII. FURTHER TOPICS. (15 min.)
    1. In brief: Single-pass algorithms: BIRCH, Priebe's method
    2. In brief: Incremental and sampling algorithms.
    3. In brief: Fast spatial scan statistics.
    4. In brief: n-point correlation.                   
    5. In brief: Fast kernel classification.

Intended Audience
Anyone who designs statistical and machine learning methods to be used on real-world datasets can benefit from understanding the tools we will describe. We will assume very little prior knowledge, aside from familiarity with common statistical and machine learning methods and general principles of computer programming. We correspondingly expect a large audience of participants with diverse backgrounds and needs.

The goal of the tutorial is that participants will walk away with the basic viewpoints needed to 1) apply the existing computational methods which we'll describe to their own instances of standard statistical problems, 2) understand more advanced existing computational methods based on the same underlying principles, and 3) develop new algorithmic approaches of their own to their unique problems.