Date: Tuesday, 12 Oct, 1999 Time: 3:30 p.m.- 4:30 p.m. Place: WeH 5409 Type: AI seminar Duration: 60 min. Who: Andrew Moore, CMU Robotics Institute, CALD, and Schenley Park Research, Inc. Topic: Inner-Loop Statistics in Automated Scientific Discovery Host: T.S. Lee Abstract: Intensive statistical analysis of massive data sources ("data mining") has been embraced as one of the final areas with a need for massive computation beyond that available on a $2000 computer or $200 videogame. We begin this talk with two examples of software, instead of hardware, giving 1000-fold speedups over traditional implementations of statistical algorithms for prediction, density estimation, and clustering. We then pause to examine directions in which these software solutions when faced with Physics, Biology and commercial scientific data seemed blocked by a curse of dimensionality and limitations on machine main memories. This is followed by four examples of new pieces of research that circumvent these barriers: Komarek's lazy cached sufficient statistics, Pelleg's exact accelerated k-means, multiresolution ball-trees for very high dimensional real-valued data, and Gordon's filament identifier. We then reveal the reason for our new-found respect for super-computation: when an algorithm you previously ran overnight executes in seconds, you find yourself wanting to run it ten thousand times. We show the impact of being able to run intensive statistics as an inner loop has had on our analysis of cosmology data (preliminary data from the Sloan Digital Sky Survey) and biotoxin identification, where desirable but hopelessly extravagant operations such as model selection, bootstrapping, backfitting, randomization and graphical model design now become somewhat non-hopeless. * Joint work with Andy Connolly, Geoff Gordon, Paul Komarek, Bob Nichol, Dan Pelleg and Larry Wasserman