National Science Foundation, Award IIS-0916345
III: Small: Fast Subset Scan for Anomalous Pattern Detection
PI: Daniel B. Neill (neill @ cs.cmu.edu)
Funding duration: August 1, 2009 - July 31, 2013
Funding amount: $499,991
Project personnel:
Daniel B. Neill (Associate
Professor of Information Systems, Heinz College, CMU) (PI)
Seth Flaxman (Ph.D. student, Joint
Ph.D. in Machine Learning and Public Policy, Heinz College and School of Computer Science, CMU)
Edward McFowland III
(Ph.D. student, Heinz College, CMU)
Kenton Murray (M.S. student,
Language Technologies Institute, CMU)
Sriram Somanchi (Ph.D. student, Heinz College, CMU)
Skyler Speakman (Ph.D. student, Heinz College, CMU)
Donghan (Jarod) Wang (research programmer and system administrator, CMU)
Xin Wu (M.S. student, Very Large Information Systems, CMU)
Yating Zhang (MISM student, Heinz College, CMU)
Project alumni:
Michael Baysek (research programmer and system administrator, CMU)
Sayantan Das (M.S., Information Systems Management, CMU)
Tarun Kumar (M.S., Very Large Information Systems, CMU)
Yandong Liu (M.S., Language
Technologies, CMU)
Rajas Lonkar (M.S., Information Systems Management, CMU)
Amrut Nagasunder (M.S., Very Large Information Systems, CMU)
Kan Shao (Ph.D.,
Engineering and Public Policy, and M.S., Machine Learning, CMU)
Huanian Zheng (M.S., Information Technology, CMU)
Project description:
This project focuses on new methods for fast and scalable detection of
anomalous patterns in massive, multivariate datasets. We focus on
real-world application domains where we must detect complex, subtle, and
probabilistic patterns that are difficult to spot with existing
techniques, such as an emerging disease outbreak or a pattern of smuggling
activity. Our work is based on two key insights. First, the pattern
detection problem can be framed as a search over all subsets of the data,
in which we define a measure of the "anomalousness" of a subset and
maximize this measure over all potentially relevant subsets. We have
incorporated this insight into a general "subset scan" framework for
pattern detection. Second, and more surprisingly, we have discovered
that, for many useful detection methods (including Kulldorff's spatial
scan statistic and many recently proposed variants), we can perform an
exact search which efficiently maximizes the measure of anomalousness over
all subsets of the data. We are exploring this new optimization method,
investigating how it can be extended to constrained subset scans and to
more general multivariate pattern detection problems, and examining how it
can be incorporated into our subset scan framework, enabling us to create
a variety of fast, scalable, and useful methods for anomalous pattern
detection.
Detailed descriptions of our current research and educational
activities, and results/findings are
available here.
Publications:
Daniel B. Neill. Fast subset sums for multivariate Bayesian scan
statistics. Proceedings of the 2009 International Society for Disease
Surveillance Annual Conference, 2010. (pdf)
Skyler Speakman and Daniel B. Neill. Fast graph scan for scalable
detection of arbitrary connected clusters. Proceedings of the 2009
International Society for Disease Surveillance Annual Conference,
2010. (pdf)
Daniel B. Neill and Gregory F. Cooper. A multivariate Bayesian scan
statistic for early event detection and characterization. Machine
Learning 79: 261-282, 2010. (pdf)
Daniel Oliveira, Daniel B. Neill, James H. Garrett Jr., and Lucio
Soibelman. Detection of patterns in water distribution pipe breakage
using spatial scan statistics for point events in a physical network.
Journal of Computing in Civil Engineering 25(1): 21-30,
2011. (pdf)
Daniel B. Neill. Fast Bayesian scan statistics for multivariate event
detection and visualization. Statistics in Medicine 30(5):
455-469, 2011. (pdf)
Daniel B. Neill, Edward McFowland III, and Huanian Zheng. Fast subset
scan for multivariate spatial biosurveillance. Emerging Health Threats
Journal 4: s42, 2011. (pdf)
Daniel B. Neill and Yandong Liu. Generalized fast subset sums for
Bayesian detection and visualization. Emerging Health Threats
Journal 4: s43, 2011. (pdf)
Kan Shao, Yandong Liu, and Daniel B. Neill. A generalized fast subset
sums framework for Bayesian event detection. Proceedings of the 11th
IEEE International Conference on Data Mining, 617-625, 2011. (pdf)
Yandong Liu and Daniel B. Neill. Detecting previously unseen outbreaks
with novel symptom patterns. Emerging Health Threats Journal 4:
11074, 2011. (pdf)
Sriram Somanchi and Daniel B. Neill. Fast graph structure learning from
unlabeled data for outbreak detection. Emerging Health Threats
Journal 4: 11017, 2011. (pdf)
Skyler Speakman, Edward McFowland III, Sriram Somanchi, and Daniel B.
Neill. Scalable detection of irregular disease clusters using
soft compactness constraints. Emerging Health Threats Journal 4:
11121, 2011. (pdf)
Daniel B. Neill. Fast subset scan for spatial pattern detection.
Journal of the Royal Statistical Society (Series B: Statistical
Methodology) 74(2): 337-360, 2012. (pdf)
Daniel B. Neill. New directions in artificial intelligence for public health
surveillance. IEEE Intelligent Systems 27(1): 56-59, 2012. (pdf)
Daniel B. Neill, Edward McFowland III, and Huanian Zheng. Fast subset
scan for multivariate event detection. Statistics in Medicine,
in press, 2012. Article published online: 22 NOV 2012, DOI:
10.1002/sim.5675. (link)
Skyler Speakman, Sriram Somanchi, Edward McFowland III, and Daniel B.
Neill. Scalable detection of anomalous subgraphs. Book
chapter, Encyclopedia of Social Network Analysis and Mining,
in press, 2012.
Seth Flaxman and Daniel B. Neill. Detecting spatially localized subsets
of leading indicators for event prediction. Submitted for publication,
2012.
Tarun Kumar and Daniel B. Neill. Fast tensor scan for event detection and
characterization. Submitted for publication, 2012.
Edward McFowland III, Skyler Speakman, and Daniel B. Neill. Fast
generalized subset scan for anomalous pattern detection. Submitted for
publication, 2012.
Sriram Somanchi and Daniel B. Neill. Fast graph structure learning from
unlabeled data for event detection. Submitted for publication,
2012.
Skyler Speakman, Edward McFowland III, and Daniel B. Neill. Scalable
detection of anomalous patterns with connectivity constraints. Submitted
for publication, 2012.
Presentations:
Daniel B. Neill. Fast subset sums for multivariate Bayesian scan
statistics. International Society for Disease Surveillance Annual
Conference, Miami, FL, December 2009. (pdf)
Skyler Speakman and Daniel B. Neill. Fast graph scan for scalable
detection of arbitrary connected clusters. International Society for
Disease Surveillance Annual Conference, Miami, FL, December 2009. (pdf)
Daniel B. Neill, Fast subset scanning for multivariate event detection.
ENAR 2010 Annual Meeting, New Orleans, LA, March 2010. (pdf)
Edward McFowland III, Skyler Speakman, and Daniel B. Neill. Fast
generalized subset scan for anomalous pattern detection. Sixteenth
Conference for African American Researchers in the Mathematical Sciences,
Baltimore, MD, June 2010. (pdf)
Daniel B. Neill. Fast subset sums for scalable Bayesian detection and
visualization. Fifth International Workshop on Applied Probability,
Madrid, Spain, July 2010. (pdf)
Skyler Speakman, Edward McFowland III, and Daniel B. Neill. Scalable
detection of anomalous patterns with connectivity constraints. INFORMS
Annual Conference, Austin, TX, November 2010. (pdf)
Edward McFowland III, Skyler Speakman, and Daniel B. Neill. Fast
generalized subset scan for anomalous pattern detection. INFORMS Annual
Conference, Austin, TX, November 2010. (pdf)
Daniel B. Neill, Edward McFowland III, and Huanian Zheng. Fast subset
scan for multivariate spatial biosurveillance. International Society for
Disease Surveillance Annual Conference, Park City, UT, December
2010. (pdf)
Daniel B. Neill and Yandong Liu. Generalized fast subset sums for
Bayesian detection and visualization. International Society for Disease
Surveillance Annual Conference, Park City, UT, December 2010. (pdf)
Daniel B. Neill. Research challenges for biosurveillance: the next ten
years (invited plenary). International Society for Disease Surveillance
Annual Conference, Park City, UT, December 2010. (pdf)
Daniel B. Neill. Spatial and subset scanning for multivariate health
surveillance. Data Fusion Research Meeting, Ottawa, ON, March
2011. (pdf)
Daniel B. Neill. Machine learning for population health and disease
surveillance. Advanced Analytics Workshop, Washington, DC, April
2011. (pdf)
Edward McFowland III and Daniel B. Neill. Fast generalized subset scan
for anomalous pattern detection in mixed data sets. 17th Conference for
African-American Researchers in the Mathematical Sciences, Los Angeles,
CA, June 2011.
Daniel B. Neill. Fast multivariate subset scanning for scalable cluster
detection. Joint Statistical Meetings 2011, Miami, FL, August
2011. (pdf)
Edward McFowland III and Daniel B. Neill. Efficient methods for anomalous
pattern detection in general datasets. INFORMS Annual Conference,
Charlotte, NC, November 2011. (pdf)
Sriram Somanchi and Daniel B. Neill. Fast learning of graph structure from
unlabeled data for anomalous pattern detection. INFORMS Annual Conference,
Charlotte, NC, November 2011. (pdf)
Skyler Speakman and Daniel B. Neill. Dynamic pattern detection with connectivity and
temporal consistency constraints. INFORMS Annual Conference, Charlotte, NC, November
2011. (pdf)
Daniel B. Neill. Analytical methods for large scale surveillance of unstructured data.
International Conference on Digital Disease Detection, Boston, MA, February 2012.
(pdf)
Daniel B. Neill and Edward McFowland III. Fast generalized subset scan for anomalous
pattern detection. Sixth International Workshop on Applied Probability, Jerusalem,
Israel, June 2012.
Daniel B. Neill, Skyler Speakman, Edward McFowland III, and Sriram Somanchi. Efficient
subset scanning with soft constraints. Sixth International Workshop on Applied
Probability, Jerusalem, Israel, June 2012.
Skyler Speakman, Edward McFowland III, and Daniel B. Neill. Scalable detection of
anomalous patterns with connectivity constraints. 29th Quality and Productivity Research
Conference, Long Beach, CA, June 2012.
Daniel B. Neill and Seth Flaxman. Detecting spatially localized subsets of leading
indicators for event prediction. 32nd International Symposium on Forecasting, Boston, MA,
June 2012.
Broader Impacts: The Machine Learning and Policy (MLP) Initiative
With the critical importance of addressing global policy problems ranging
from disease pandemics to crime and terrorism, and the continuously
increasing size and complexity of policy data, the use of machine learning
has become increasingly essential for data-driven policy analysis and for
development of new, practical information technologies that can be
directly applied for the public good. The numerous challenges facing our
world will require broad, successful innovations at the intersection of
machine learning and public policy. This endeavor will require widespread
collaboration between machine learning and policy researchers, increased
emphasis on the education of future researchers with in-depth knowledge of
both disciplines, and a broadly shared research focus on developing novel
machine learning methods which directly address critical policy
challenges. We are working to build a multi-pronged curricular program,
the Machine Learning and Policy (MLP) initiative. This program will
facilitate the widespread use of machine learning methods for the public
good by incorporating machine learning throughout the public policy
curriculum. Key components of this program include a new Joint
Ph.D. program in Machine Learning and Public Policy, an introductory
course in machine learning ("Large Scale Data Analysis for Policy") geared
toward public policy students, a Ph.D.-level research seminar in Machine
Learning and Policy, and a course series in "Special Topics in Machine
Learning and Policy", with courses including "Event and Pattern Detection"
(Spring 2010), "Machine Learning for the Developing World" (Spring 2011),
"Harnessing the Wisdom of Crowds" (Spring 2012), and "Crime
Hot-Spot Detection and Prediction" (anticipated Spring 2013).
Tutorials and Educational Material:
Daniel B. Neill. Lecture slides for the course, Large Scale Data Analysis
for Public Policy. Last taught Fall 2011. (link)
Daniel B. Neill. Machine learning and event detection for the public good.
Guest lecture, April 2011.
(pdf)
Daniel B. Neill and Weng-Keen Wong. A tutorial on event
detection. Presented at the 15th ACM SIGKDD Conference on Knowledge
Discovery and Data Mining, 2009. (pdf)
Daniel B. Neill. Spatial scan tips and tricks for practical outbreak
detection. Invited webinar for the International Society for Disease
Surveillance, January 2011. (pdf)
Awards:
The Project PI, Dr. Neill, was named one of the "AI's 10 to Watch" by IEEE
Intelligent Systems, Jan/Feb 2011. (link)
Edward McFowland III was awarded an NSF Graduate Research Fellowship
(link) and an AT&T Labs Research
Fellowship, 2011. (link)
Edward McFowland III was the 2012 winner of the Suresh Konda Award, presented yearly to
Heinz College's best Second Heinz Research Paper.
This material is based upon work supported by the National Science
Foundation, grants IIS-0916345 (primary funding source), IIS-0911032, and
IIS-0953330. Any opinions, findings, and conclusions or recommendations
expressed in this material are those of the author(s) and do not
necessarily reflect the views of the National Science Foundation.
Back to Daniel's home page
Contact the PI: Daniel Neill, neill (at) cs (dot) cmu (dot) edu
Last update: May 18, 2012