National Science Foundation, Award IIS-0916345
III: Small: Fast Subset Scan for Anomalous Pattern Detection
PI: Daniel B. Neill (neill @ cs.cmu.edu)
Funding duration: August 1, 2009 - July 31, 2013
Funding amount: $499,991
NOTE: This project has been completed as of July 2013, but we continued updating the lists of publications and presentations arising from this
project through the end of 2014.
Project personnel:
Daniel B. Neill (Dean's Career Development Professor and Associate
Professor of Information Systems, Heinz College, CMU)
Michael Baysek (research programmer and system administrator, CMU)
Feng Chen (Postdoctoral fellow, Heinz College, CMU)
Sayantan Das (M.S., Information Systems Management, CMU)
Seth Flaxman (Ph.D. student, Joint
Ph.D. in Machine Learning and Public Policy, Heinz College and School of
Computer Science, CMU)
Tarun Kumar (M.S., Very Large Information Systems, CMU)
Kai Liu (M.S., Very Large Information Systems, CMU)
Yandong Liu (M.S., Language
Technologies, CMU)
Rajas Lonkar (M.S., Information Systems Management, CMU)
Edward McFowland III
(Ph.D. student, Heinz College, CMU)
Kenton Murray (M.S., Language Technologies, CMU)
Amrut Nagasunder (M.S., Very Large Information Systems, CMU)
Kan Shao (Ph.D.,
Engineering and Public Policy, and M.S., Machine Learning, CMU)
Sriram Somanchi (Ph.D. student, Heinz College, CMU)
Skyler Speakman (Ph.D. student, Heinz College, CMU)
Donghan (Jarod) Wang (research programmer and system administrator,
CMU)
Xin Wu (M.S., Very Large Information Systems, CMU)
Yating Zhang (M.S., Information Systems Management, CMU)
Huanian Zheng (M.S., Information Technology, CMU)
Project description:
This project focuses on new methods for fast and scalable detection of
anomalous patterns in massive, multivariate datasets. We focus on
real-world application domains where we must detect complex, subtle, and
probabilistic patterns that are difficult to spot with existing
techniques, such as an emerging disease outbreak or a pattern of smuggling
activity. Our work is based on two key insights. First, the pattern
detection problem can be framed as a search over all subsets of the data,
in which we define a measure of the "anomalousness" of a subset and
maximize this measure over all potentially relevant subsets. We have
incorporated this insight into a general "subset scan" framework for
pattern detection. Second, and more surprisingly, we have discovered
that, for many useful detection methods (including Kulldorff's spatial
scan statistic and many recently proposed variants), we can perform an
exact search which efficiently maximizes the measure of anomalousness over
all subsets of the data. We are exploring this new optimization method,
investigating how it can be extended to constrained subset scans and to
more general multivariate pattern detection problems, and examining how it
can be incorporated into our subset scan framework, enabling us to create
a variety of fast, scalable, and useful methods for anomalous pattern
detection.
Detailed descriptions of our research and educational
activities and results are
available here.
Publications:
Daniel B. Neill. Fast subset sums for multivariate Bayesian scan
statistics. Proceedings of the 2009 International Society for Disease
Surveillance Annual Conference, 2010. (pdf)
Skyler Speakman and Daniel B. Neill. Fast graph scan for scalable
detection of arbitrary connected clusters. Proceedings of the 2009
International Society for Disease Surveillance Annual Conference,
2010. (pdf)
Daniel B. Neill and Gregory F. Cooper. A multivariate Bayesian scan
statistic for early event detection and characterization. Machine
Learning 79: 261-282, 2010. (pdf)
Daniel Oliveira, Daniel B. Neill, James H. Garrett Jr., and Lucio
Soibelman. Detection of patterns in water distribution pipe breakage
using spatial scan statistics for point events in a physical network.
Journal of Computing in Civil Engineering 25(1): 21-30,
2011. (pdf)
Daniel B. Neill. Fast Bayesian scan statistics for multivariate event
detection and visualization. Statistics in Medicine 30(5):
455-469, 2011. (pdf)
Daniel B. Neill, Edward McFowland III, and Huanian Zheng. Fast subset
scan for multivariate spatial biosurveillance. Emerging Health Threats
Journal 4: s42, 2011. (pdf)
Daniel B. Neill and Yandong Liu. Generalized fast subset sums for
Bayesian detection and visualization. Emerging Health Threats
Journal 4: s43, 2011. (pdf)
Kan Shao, Yandong Liu, and Daniel B. Neill. A generalized fast subset
sums framework for Bayesian event detection. Proceedings of the 11th
IEEE International Conference on Data Mining, 617-625, 2011. (pdf)
Yandong Liu and Daniel B. Neill. Detecting previously unseen outbreaks
with novel symptom patterns. Emerging Health Threats Journal 4:
11074, 2011. (pdf)
Sriram Somanchi and Daniel B. Neill. Fast graph structure learning from
unlabeled data for outbreak detection. Emerging Health Threats
Journal 4: 11017, 2011. (pdf)
Skyler Speakman, Edward McFowland III, Sriram Somanchi, and Daniel B.
Neill. Scalable detection of irregular disease clusters using
soft compactness constraints. Emerging Health Threats Journal 4:
11121, 2011. (pdf)
Daniel B. Neill. Fast subset scan for spatial pattern detection.
Journal of the Royal Statistical Society (Series B: Statistical
Methodology) 74(2): 337-360, 2012. (pdf)
Daniel B. Neill. New directions in artificial intelligence for public health
surveillance. IEEE Intelligent Systems 27(1): 56-59, 2012. (pdf)
Skyler Speakman, Yating Zhang, and Daniel B. Neill. Tracking dynamic water-borne outbreaks
with temporal consistency constraints. Online Journal of Public Health Informatics 5(1),
2013. (pdf)
Daniel B. Neill and Tarun Kumar. Fast multidimensional subset scan for outbreak detection
and characterization. Online Journal of Public Health Informatics 5(1), 2013.
(pdf)
Daniel B. Neill, Edward McFowland III, and Huanian Zheng. Fast subset
scan for multivariate event detection. Statistics in Medicine
32: 2185-2208, 2013. (pdf)
Edward McFowland III, Skyler Speakman, and Daniel B. Neill. Fast
generalized subset scan for anomalous pattern detection. Journal of Machine
Learning Research, 14: 1533-1561, 2013. (pdf)
Daniel B. Neill. Using artificial intelligence to improve hospital inpatient care.
IEEE Intelligent Systems 28(2): 92-95, 2013. (pdf)
Skyler Speakman, Yating Zhang, and Daniel B. Neill. Dynamic pattern detection with temporal consistency and
connectivity constraints. Proc. 13th IEEE International Conference on Data Mining, 697-706, 2013. (pdf)
Sriram Somanchi and Daniel B. Neill. Discovering anomalous patterns in large digital pathology images. Proc.
8th INFORMS Workshop on Data Mining and Health Informatics, 2013. (pdf)
Feng Chen and Daniel B. Neill. Non-parametric scan statistics for
disease outbreak detection on Twitter. Online Journal of Public
Health Informatics 6(1): e155, 2014. (pdf)
Skyler Speakman, Sriram Somanchi, Edward McFowland III, and Daniel B. Neill. Disease
surveillance, case study. In R. Alhajj and J. Rokne, eds., Encyclopedia of Social Network
Analysis and Mining, pp. 380-385. Springer, 2014. (pdf)
Feng Chen and Daniel B. Neill. Non-parametric scan statistics for event detection and
forecasting in heterogeneous social media graphs. Proceedings of the 20th ACM SIGKDD
Conference on Knowledge Discovery and Data Mining, 1166-1175, 2014. (pdf)
Skyler Speakman, Edward McFowland III, and Daniel B. Neill. Scalable detection of anomalous
patterns with connectivity constraints. Journal of Computational and Graphical
Statistics, 2014, in press. (accepted author
version)
Working papers (status as of December 2014):
Sriram Somanchi, David Choi, and Daniel B. Neill. StarScan: a novel
scan statistic for irregularly-shaped spatial clusters. Accepted to
2014 International Society for Disease Surveillance Annual Conference.
Mallory Nobles, Lana Deyneka, Amy Ising, and Daniel B. Neill.
Identifying emerging novel outbreaks in textual emergency department
data. Accepted to 2014 International Society for Disease Surveillance
Annual Conference.
Daniel B. Neill. Bayesian scan statistics. Book chapter submitted for
publication.
Seth Flaxman and Daniel B. Neill. Detecting spatially localized subsets
of leading indicators for event prediction. Submitted for publication.
Tarun Kumar and Daniel B. Neill. Fast tensor scan for event detection and
characterization. Submitted for publication.
Sriram Somanchi and Daniel B. Neill. Fast graph structure learning from
unlabeled data for event detection. Submitted for publication.
Skyler Speakman, Sriram Somanchi, Edward McFowland III, and Daniel B. Neill. Penalized
fast subset scanning. Submitted for publication.
Kenton Murray, Chris Dyer, Yandong Liu, and Daniel B. Neill. A semantic scan statistic for novel disease outbreak
detection. Submitted for publication.
Seth Flaxman, Daniel B. Neill, and Alexander J. Smola. Gaussian
processes for independence tests with non-iid data in causal inference.
Submitted for publication.
Presentations:
Daniel B. Neill. Fast subset sums for multivariate Bayesian scan
statistics. International Society for Disease Surveillance Annual
Conference, Miami, FL, December 2009. (pdf)
Skyler Speakman and Daniel B. Neill. Fast graph scan for scalable
detection of arbitrary connected clusters. International Society for
Disease Surveillance Annual Conference, Miami, FL, December 2009. (pdf)
Daniel B. Neill, Fast subset scanning for multivariate event detection.
ENAR 2010 Annual Meeting, New Orleans, LA, March 2010. (pdf)
Edward McFowland III, Skyler Speakman, and Daniel B. Neill. Fast
generalized subset scan for anomalous pattern detection. Sixteenth
Conference for African American Researchers in the Mathematical Sciences,
Baltimore, MD, June 2010. (pdf)
Daniel B. Neill. Fast subset sums for scalable Bayesian detection and
visualization. Fifth International Workshop on Applied Probability,
Madrid, Spain, July 2010. (pdf)
Skyler Speakman, Edward McFowland III, and Daniel B. Neill. Scalable
detection of anomalous patterns with connectivity constraints. INFORMS
Annual Conference, Austin, TX, November 2010. (pdf)
Edward McFowland III, Skyler Speakman, and Daniel B. Neill. Fast
generalized subset scan for anomalous pattern detection. INFORMS Annual
Conference, Austin, TX, November 2010. (pdf)
Daniel B. Neill, Edward McFowland III, and Huanian Zheng. Fast subset
scan for multivariate spatial biosurveillance. International Society for
Disease Surveillance Annual Conference, Park City, UT, December
2010. (pdf)
Daniel B. Neill and Yandong Liu. Generalized fast subset sums for
Bayesian detection and visualization. International Society for Disease
Surveillance Annual Conference, Park City, UT, December 2010. (pdf)
Daniel B. Neill. Research challenges for biosurveillance: the next ten
years (invited plenary). International Society for Disease Surveillance
Annual Conference, Park City, UT, December 2010. (pdf)
Daniel B. Neill. Spatial and subset scanning for multivariate health
surveillance. Data Fusion Research Meeting, Ottawa, ON, March
2011. (pdf)
Daniel B. Neill. Machine learning for population health and disease
surveillance. Advanced Analytics Workshop, Washington, DC, April
2011. (pdf)
Edward McFowland III and Daniel B. Neill. Fast generalized subset scan
for anomalous pattern detection in mixed data sets. 17th Conference for
African-American Researchers in the Mathematical Sciences, Los Angeles,
CA, June 2011.
Daniel B. Neill. Fast multivariate subset scanning for scalable cluster
detection. Joint Statistical Meetings 2011, Miami, FL, August
2011. (pdf)
Edward McFowland III and Daniel B. Neill. Efficient methods for anomalous
pattern detection in general datasets. INFORMS Annual Conference,
Charlotte, NC, November 2011. (pdf)
Sriram Somanchi and Daniel B. Neill. Fast learning of graph structure from
unlabeled data for anomalous pattern detection. INFORMS Annual Conference,
Charlotte, NC, November 2011. (pdf)
Skyler Speakman and Daniel B. Neill. Dynamic pattern detection with connectivity and
temporal consistency constraints. INFORMS Annual Conference, Charlotte, NC, November
2011. (pdf)
Daniel B. Neill. Analytical methods for large scale surveillance of unstructured data.
International Conference on Digital Disease Detection, Boston, MA, February 2012.
(pdf)
Daniel B. Neill and Edward McFowland III. Fast generalized subset scan for anomalous
pattern detection. Sixth International Workshop on Applied Probability, Jerusalem, Israel,
June 2012. (pdf)
Daniel B. Neill, Skyler Speakman, Edward McFowland III, and Sriram Somanchi. Efficient
subset scanning with soft constraints. Sixth International Workshop on Applied
Probability, Jerusalem, Israel, June 2012. (pdf)
Skyler Speakman, Edward McFowland III, and Daniel B. Neill. Scalable detection of
anomalous patterns with connectivity constraints. 29th Quality and Productivity Research
Conference, Long Beach, CA, June 2012. (pdf)
Daniel B. Neill and Seth Flaxman. Detecting spatially localized subsets of leading
indicators for event prediction. 32nd International Symposium on Forecasting, Boston, MA,
June 2012. (pdf)
Daniel B. Neill. Predicting and preventing emerging outbreaks of crime. CMU Workshop on
Machine Learning and Social Sciences, Pittsburgh, PA, October 2012. (pdf)
Sriram Somanchi and Daniel B. Neill. Fast graph structure learning from unlabeled data for
event detection. INFORMS Annual Conference, Phoenix, AZ, October 2012.
Skyler Speakman, Yating Zhang, and Daniel B. Neill. Tracking dynamic water-borne outbreaks
with temporal consistency constraints. International Society for Disease Surveillance
Annual Conference, San Diego, CA, December 2012. (pdf)
Daniel B. Neill and Tarun Kumar. Fast multidimensional subset scan for outbreak detection
and characterization. International Society for Disease Surveillance Annual Conference, San
Diego, CA, December 2012. (pdf)
Daniel B. Neill. Fast subset scanning for scalable event and pattern detection. Stony
Brook University, Stony Brook, NY, May 2013. (pdf)
Seth Flaxman and Daniel B. Neill. New tests for space-time interaction
in spatio-temporal point processes. 2nd Spatial Statistics Conference,
Columbus, OH, June 2013. (pdf)
Daniel B. Neill. Machine learning and event detection for the public
good. Data Science for the Social Good Summer Fellowship Program,
Chicago, IL, July 2013. (pdf)
Feng Chen and Daniel B. Neill. Non-parametric scan statistics for event
detection and forecasting in heterogeneous social media graphs. INFORMS
Annual Meeting, Minneapolis, MN, October 2013. (pdf)
Feng Chen and Daniel B. Neill. Non-parametric scan statistics for
disease outbreak detection on Twitter. International Society for
Disease Surveillance Annual Conference, New Orleans, LA, December
2013. (pdf)
Skyler Speakman, Sriram Somanchi, Edward McFowland III, and Daniel B. Neill. Penalized
fast subset scanning. 6th International Conference on Computational and Methodological
Statistics, London, UK, December 2013. (pdf)
Daniel B. Neill. Scaling up event and pattern detection to big data. MIT Workshop on Challenges in Big Data for Data Mining, Machine
Learning and Statistics, Cambridge, MA, March 2014. (pdf)
Daniel B. Neill. Scaling up event and pattern detection to big data. NYU Stern School of Business, Information Systems Seminar, New
York, NY, April 2014. (pdf)
Feng Chen and Daniel B. Neill. Non-parametric scan statistics for event detection and
forecasting in heterogeneous social media graphs. Seventh International Workshop on Applied
Probability, Antalya, Turkey, June 2014. (pdf)
Sriram Somanchi and Daniel B. Neill. A star-shaped scan statistic for detecting
irregularly-shaped spatial clusters. Seventh International Workshop on Applied Probability,
Antalya, Turkey, June 2014. (pdf)
Edward McFowland III and Daniel B. Neill. Discovering novel anomalous patterns in general data.
Statistical Learning and Data Mining Meeting on Data Mining in Business and Industry, Durham, NC,
June 2014. (pdf)
Seth Flaxman, Alex Smola, and Daniel B. Neill. Kernel space-time interaction tests for
identifying leading indicators of crime. Joint Statistical Meetings, Boston, MA, August 2014.
(pdf)
Mallory Nobles, Seth Flaxman, and Daniel B. Neill. Urban predictive analytics. INFORMS Annual Meeting, San Francisco, CA, November 2014.
(pdf)
Sriram Somanchi, David Choi, and Daniel B. Neill. StarScan: a novel scan
statistic for irregularly-shaped spatial clusters. International Society
for Disease Surveillance Annual Conference, Philadelphia, PA, December
2014. (pdf)
Mallory Nobles, Lana Deyneka, Amy Ising, and Daniel B. Neill.
Identifying emerging novel outbreaks in textual emergency department
data. International Society for Disease Surveillance Annual Conference,
Philadephia, PA, December 2014. (pdf)
Broader Impacts: The Machine Learning and Policy (MLP) Initiative
With the critical importance of addressing global policy problems
ranging from disease pandemics to crime and terrorism, and the
continuously increasing size and complexity of policy data, the use of
machine learning has become increasingly essential for data-driven
policy analysis and for development of new, practical information
technologies that can be directly applied for the public good. The
numerous challenges facing our world will require broad, successful
innovations at the intersection of machine learning and public policy.
This endeavor will require widespread collaboration between machine
learning and policy researchers, increased emphasis on the education of
future researchers with in-depth knowledge of both disciplines, and a
broadly shared research focus on developing novel machine learning
methods which directly address critical policy challenges. We are
working to build a multi-pronged curricular program, the Machine
Learning and Policy (MLP) initiative. This program will facilitate the
widespread use of machine learning methods for the public good by
incorporating machine learning throughout the public policy curriculum.
Key components of this program include a new Joint
Ph.D. program in Machine Learning and Public Policy, an introductory
course in machine learning ("Large Scale Data Analysis for Policy")
geared toward public policy students, a Ph.D.-level research seminar in
Machine Learning and Policy, and a course series in "Special Topics in
Machine Learning and Policy", with courses including "Event and Pattern
Detection" (Spring 2010, Spring 2014), "Machine Learning for the Developing World"
(Spring 2011), "Harnessing the Wisdom of Crowds" (Spring 2012), and
"Mining Massive Datasets" (Spring 2013). Project PI Daniel Neill was
also involved in the creation of a CMU workshop and seminar series in
"Machine Learning and Social Sciences" and in creating a new "Policy
Analytics" track for Heinz College's MS in Public Policy and Management
program.
Tutorials and Educational Material:
Daniel B. Neill. Lecture slides for the course, Large Scale Data Analysis
for Public Policy. Last taught Fall 2014. (link)
Daniel B. Neill. Machine learning and event detection for the public good.
Guest lecture, April 2011.
(pdf)
Daniel B. Neill and Weng-Keen Wong. A tutorial on event
detection. Presented at the 15th ACM SIGKDD Conference on Knowledge
Discovery and Data Mining, 2009. (pdf)
Daniel B. Neill. Spatial scan tips and tricks for practical outbreak
detection. Invited webinar for the International Society for Disease
Surveillance, January 2011. (pdf)
Awards:
The Project PI, Dr. Neill, was named one of the "AI's 10 to Watch" by IEEE
Intelligent Systems, Jan/Feb 2011. (link)
Edward McFowland III was awarded an NSF Graduate Research Fellowship
(link) and an AT&T Labs Research
Fellowship, 2011. (link)
Edward McFowland III was the 2012 winner of the Suresh Konda Award, presented yearly to
Heinz College's best First Heinz Research Paper.
Seth Flaxman was the 2013 winner of the Suresh Konda Award,
presented yearly to Heinz College's best First Heinz Research
Paper.
Sriram Somanchi was the 2013 winner of the George Duncan Award,
presented yearly to Heinz College's best Second Heinz Research
Paper.
This material is based upon work supported by the National Science
Foundation, grants IIS-0916345 (primary funding source), IIS-0911032, and
IIS-0953330. Any opinions, findings, and conclusions or recommendations
expressed in this material are those of the author(s) and do not
necessarily reflect the views of the National Science Foundation.
Back to Daniel's home page
Contact the PI: Daniel Neill, neill (at) cs (dot) cmu (dot) edu
Final update: March 10, 2015.