Date:     Tuesday, 12 Oct, 1999
Time:     3:30 p.m.- 4:30 p.m. 
Place:    WeH 5409
Type:     AI seminar 
Duration: 60 min.
Who:      Andrew Moore, CMU Robotics Institute, CALD, 
                        and Schenley Park Research, Inc. 
Topic:    Inner-Loop Statistics in Automated Scientific Discovery   
Host:     T.S. Lee        

Abstract:

Intensive statistical analysis of massive data sources ("data mining")
has been embraced as one of the final areas with a need for massive
computation beyond that available on a $2000 computer or $200
videogame.  We begin this talk with two examples of software, instead
of hardware, giving 1000-fold speedups over traditional
implementations of statistical algorithms for prediction, density
estimation, and clustering.

We then pause to examine directions in which these software solutions
when faced with Physics, Biology and commercial scientific data seemed
blocked by a curse of dimensionality and limitations on machine main
memories. This is followed by four examples of new pieces of research
that circumvent these barriers: Komarek's lazy cached sufficient
statistics, Pelleg's exact accelerated k-means, multiresolution
ball-trees for very high dimensional real-valued data, and Gordon's
filament identifier.

We then reveal the reason for our new-found respect for
super-computation: when an algorithm you previously ran overnight
executes in seconds, you find yourself wanting to run it ten thousand
times. We show the impact of being able to run intensive statistics as
an inner loop has had on our analysis of cosmology data (preliminary
data from the Sloan Digital Sky Survey) and biotoxin identification,
where desirable but hopelessly extravagant operations such as model
selection, bootstrapping, backfitting, randomization and graphical
model design now become somewhat non-hopeless.

* Joint work with Andy Connolly, Geoff Gordon, Paul Komarek, Bob Nichol,
Dan Pelleg and Larry Wasserman