This half-day tutorial will describe the application of fundamental
computer science concepts to algorithmic problems arising in machine
learning and statistics. All machine learning and statistical methods
must be implemented on computers -- as such, their computational
properties have as much practical importance as their statistical
properties. In the last 10 years, powerful data structures for both
continuous and discrete data have provided orders-of-magnitude
computational speedups for a wide variety of statistical problems;
some of these results have been widely-publicized and some not. The
goal of the tutorial is to explain these data structures, which
originated in various fields (e.g. discrete algorithms, databases,
computational geometry), and show how they can be used in a number of
canonical calculations which occur ubiquitously as subroutines within
statistical and machine learning methods.
To clarify, this tutorial is not about ``new statistics specific to
large datasets'' but ``how to make your old favorite statistics
tractable on large datasets''. Our approach will be to focus in
detail on a few template problems and approaches to them, to ensure
that at least the essential concepts for designing fast algorithms for
the basic cases are conveyed. The tutorial will conclude by surveying
more advanced problems and approaches, with references.
I. INTRODUCTION. (5 min.)
1. Why fast algorithms for statistics?
II. BASIC PROXIMITY PROBLEMS. (45 min.)
1. Nearest-neighbor problems and range search problems in
2. kd-trees for fast nearest-neighbor search.
3. kd-trees for fast range search.
4. Higher dimensions: metric trees.
5. Five kinds of nearest-neighbor problems, and various
6. In brief: Fast nearest-neighbor classification.
7. Fast k-means: an example of embedded sub-problems.
III. BASIC DECISION AND SUMMATION PROBLEMS. (45 min.)
1. Kernel decision, summation, and sum decision problems in
2. Fast locally weighted regression.
3. Fast Bayes classification: stepping stone to EM.
4. Fast EM algorithm for mixtures of Gaussians.
V. BASIC `ALL'-TYPE PROBLEMS. (40 min.)
1. `All'-type problems in statistics/learning
2. Dual-tree idea.
3. Fast bichromatic all-nearest-neighbors.
4. Fast kernel density estimation.
5. The well-separated pair decomposition for Cuevas
6. In brief: Gaussian process regression.
7. In brief: Multipole methods from computational physics.
VI. BASIC COUNTING PROBLEMS. (30 min.)
1. Counting problems in statistical learning.
2. Frequent sets.
4. Fast Bayes nets learning.
VII. FURTHER TOPICS. (15 min.)
1. In brief: Single-pass algorithms: BIRCH, Priebe's method
2. In brief: Incremental and sampling algorithms.
3. In brief: Fast spatial scan statistics.
4. In brief: n-point correlation.
5. In brief: Fast kernel classification.