Dan Pelleg's Research Statement

Following recent advances in data acquisition and storage technologies, scientists find themselves in a new reality where data is abundant, but the limiting factors are now processor and human time. I focus on the problem of probability density estimators for anomaly detection. Among the challenges are: algorithmic and implementation scalability; fully automatic operation; comprehensibility of the generated models; and good detection performance. I present various algorithms for the different problems. I implement them and evaluate them in a variety of experiments, including case studies.

Practically, this means that I work with astrophysicists, who have lots of data and little computer and human time to process it. In particular, we currently have 200 million objects in the database. However, practical constraints, largely the lack of human time, restrict the active processing to be done on a subset of 50 million. Towards the end of the decade, future projects are expected to generate between 10 and 15 petabytes of raw data. Data will continuously stream in, at the approximate rate of a full Sloan Survey every third night. While we expect the mechanical systems to be able to handle data at this rate and size, there is no reason to believe there will be any matching increase in human attention.