Topic
This halfday tutorial will describe the application of fundamental
computer science concepts to algorithmic problems arising in machine
learning and statistics. All machine learning and statistical methods
must be implemented on computers  as such, their computational
properties have as much practical importance as their statistical
properties. In the last 10 years, powerful data structures for both
continuous and discrete data have provided ordersofmagnitude
computational speedups for a wide variety of statistical problems;
some of these results have been widelypublicized and some not. The
goal of the tutorial is to explain these data structures, which
originated in various fields (e.g. discrete algorithms, databases,
computational geometry), and show how they can be used in a number of
canonical calculations which occur ubiquitously as subroutines within
statistical and machine learning methods.
To clarify, this tutorial is not about ``new statistics specific to
large datasets'' but ``how to make your old favorite statistics
tractable on large datasets''. Our approach will be to focus in
detail on a few template problems and approaches to them, to ensure
that at least the essential concepts for designing fast algorithms for
the basic cases are conveyed. The tutorial will conclude by surveying
more advanced problems and approaches, with references.

Outline
I. INTRODUCTION. (5 min.)
1. Why fast algorithms for statistics?
2. FAQ.
II. BASIC PROXIMITY PROBLEMS. (45 min.)
1. Nearestneighbor problems and range search problems in
statistics/learning.
2. kdtrees for fast nearestneighbor search.
3. kdtrees for fast range search.
4. Higher dimensions: metric trees.
5. Five kinds of nearestneighbor problems, and various
search structures.
6. In brief: Fast nearestneighbor classification.
7. Fast kmeans: an example of embedded subproblems.
8. FAQ.
III. BASIC DECISION AND SUMMATION PROBLEMS. (45 min.)
1. Kernel decision, summation, and sum decision problems in
statistics/learning.
2. Fast locally weighted regression.
3. Fast Bayes classification: stepping stone to EM.
4. Fast EM algorithm for mixtures of Gaussians.
V. BASIC `ALL'TYPE PROBLEMS. (40 min.)
1. `All'type problems in statistics/learning
2. Dualtree idea.
3. Fast bichromatic allnearestneighbors.
4. Fast kernel density estimation.
5. The wellseparated pair decomposition for Cuevas
clustering.
6. In brief: Gaussian process regression.
7. In brief: Multipole methods from computational physics.
VI. BASIC COUNTING PROBLEMS. (30 min.)
1. Counting problems in statistical learning.
2. Frequent sets.
3. ADtrees.
4. Fast Bayes nets learning.
VII. FURTHER TOPICS. (15 min.)
1. In brief: Singlepass algorithms: BIRCH, Priebe's method
2. In brief: Incremental and sampling algorithms.
3. In brief: Fast spatial scan statistics.
4. In brief: npoint correlation.
5. In brief: Fast kernel classification.
