National Science Foundation, Award IIS-0916345
III: Small: Fast Subset Scan for Anomalous Pattern Detection
PI: Daniel B. Neill (neill @ cs.cmu.edu)

Funding duration: August 1, 2009 - July 31, 2013
Funding amount: $499,991

NOTE: This project has been completed as of July 2013, but we continued updating the lists of publications and presentations arising from this project through the end of 2014.

Project personnel:

Daniel B. Neill (Dean's Career Development Professor and Associate Professor of Information Systems, Heinz College, CMU)
Michael Baysek (research programmer and system administrator, CMU)
Feng Chen (Postdoctoral fellow, Heinz College, CMU)
Sayantan Das (M.S., Information Systems Management, CMU)
Seth Flaxman (Ph.D. student, Joint Ph.D. in Machine Learning and Public Policy, Heinz College and School of Computer Science, CMU)
Tarun Kumar (M.S., Very Large Information Systems, CMU)
Kai Liu (M.S., Very Large Information Systems, CMU)
Yandong Liu (M.S., Language Technologies, CMU)
Rajas Lonkar (M.S., Information Systems Management, CMU)
Edward McFowland III (Ph.D. student, Heinz College, CMU)
Kenton Murray (M.S., Language Technologies, CMU)
Amrut Nagasunder (M.S., Very Large Information Systems, CMU)
Kan Shao (Ph.D., Engineering and Public Policy, and M.S., Machine Learning, CMU)
Sriram Somanchi (Ph.D. student, Heinz College, CMU)
Skyler Speakman (Ph.D. student, Heinz College, CMU)
Donghan (Jarod) Wang (research programmer and system administrator, CMU)
Xin Wu (M.S., Very Large Information Systems, CMU)
Yating Zhang (M.S., Information Systems Management, CMU)
Huanian Zheng (M.S., Information Technology, CMU)

Project description:

This project focuses on new methods for fast and scalable detection of anomalous patterns in massive, multivariate datasets. We focus on real-world application domains where we must detect complex, subtle, and probabilistic patterns that are difficult to spot with existing techniques, such as an emerging disease outbreak or a pattern of smuggling activity. Our work is based on two key insights. First, the pattern detection problem can be framed as a search over all subsets of the data, in which we define a measure of the "anomalousness" of a subset and maximize this measure over all potentially relevant subsets. We have incorporated this insight into a general "subset scan" framework for pattern detection. Second, and more surprisingly, we have discovered that, for many useful detection methods (including Kulldorff's spatial scan statistic and many recently proposed variants), we can perform an exact search which efficiently maximizes the measure of anomalousness over all subsets of the data. We are exploring this new optimization method, investigating how it can be extended to constrained subset scans and to more general multivariate pattern detection problems, and examining how it can be incorporated into our subset scan framework, enabling us to create a variety of fast, scalable, and useful methods for anomalous pattern detection.

Detailed descriptions of our research and educational activities and results are available here.



Publications:

Daniel B. Neill. Fast subset sums for multivariate Bayesian scan statistics. Proceedings of the 2009 International Society for Disease Surveillance Annual Conference, 2010. (pdf)

Skyler Speakman and Daniel B. Neill. Fast graph scan for scalable detection of arbitrary connected clusters. Proceedings of the 2009 International Society for Disease Surveillance Annual Conference, 2010. (pdf)

Daniel B. Neill and Gregory F. Cooper. A multivariate Bayesian scan statistic for early event detection and characterization. Machine Learning 79: 261-282, 2010. (pdf)

Daniel Oliveira, Daniel B. Neill, James H. Garrett Jr., and Lucio Soibelman. Detection of patterns in water distribution pipe breakage using spatial scan statistics for point events in a physical network. Journal of Computing in Civil Engineering 25(1): 21-30, 2011. (pdf)

Daniel B. Neill. Fast Bayesian scan statistics for multivariate event detection and visualization. Statistics in Medicine 30(5): 455-469, 2011. (pdf)

Daniel B. Neill, Edward McFowland III, and Huanian Zheng. Fast subset scan for multivariate spatial biosurveillance. Emerging Health Threats Journal 4: s42, 2011. (pdf)

Daniel B. Neill and Yandong Liu. Generalized fast subset sums for Bayesian detection and visualization. Emerging Health Threats Journal 4: s43, 2011. (pdf)

Kan Shao, Yandong Liu, and Daniel B. Neill. A generalized fast subset sums framework for Bayesian event detection. Proceedings of the 11th IEEE International Conference on Data Mining, 617-625, 2011. (pdf)

Yandong Liu and Daniel B. Neill. Detecting previously unseen outbreaks with novel symptom patterns. Emerging Health Threats Journal 4: 11074, 2011. (pdf)

Sriram Somanchi and Daniel B. Neill. Fast graph structure learning from unlabeled data for outbreak detection. Emerging Health Threats Journal 4: 11017, 2011. (pdf)

Skyler Speakman, Edward McFowland III, Sriram Somanchi, and Daniel B. Neill. Scalable detection of irregular disease clusters using soft compactness constraints. Emerging Health Threats Journal 4: 11121, 2011. (pdf)

Daniel B. Neill. Fast subset scan for spatial pattern detection. Journal of the Royal Statistical Society (Series B: Statistical Methodology) 74(2): 337-360, 2012. (pdf)

Daniel B. Neill. New directions in artificial intelligence for public health surveillance. IEEE Intelligent Systems 27(1): 56-59, 2012. (pdf)

Skyler Speakman, Yating Zhang, and Daniel B. Neill. Tracking dynamic water-borne outbreaks with temporal consistency constraints. Online Journal of Public Health Informatics 5(1), 2013. (pdf)

Daniel B. Neill and Tarun Kumar. Fast multidimensional subset scan for outbreak detection and characterization. Online Journal of Public Health Informatics 5(1), 2013. (pdf)

Daniel B. Neill, Edward McFowland III, and Huanian Zheng. Fast subset scan for multivariate event detection. Statistics in Medicine 32: 2185-2208, 2013. (pdf)

Edward McFowland III, Skyler Speakman, and Daniel B. Neill. Fast generalized subset scan for anomalous pattern detection. Journal of Machine Learning Research, 14: 1533-1561, 2013. (pdf)

Daniel B. Neill. Using artificial intelligence to improve hospital inpatient care. IEEE Intelligent Systems 28(2): 92-95, 2013. (pdf)

Skyler Speakman, Yating Zhang, and Daniel B. Neill. Dynamic pattern detection with temporal consistency and connectivity constraints. Proc. 13th IEEE International Conference on Data Mining, 697-706, 2013. (pdf)

Sriram Somanchi and Daniel B. Neill. Discovering anomalous patterns in large digital pathology images. Proc. 8th INFORMS Workshop on Data Mining and Health Informatics, 2013. (pdf)

Feng Chen and Daniel B. Neill. Non-parametric scan statistics for disease outbreak detection on Twitter. Online Journal of Public Health Informatics 6(1): e155, 2014. (pdf)

Skyler Speakman, Sriram Somanchi, Edward McFowland III, and Daniel B. Neill. Disease surveillance, case study. In R. Alhajj and J. Rokne, eds., Encyclopedia of Social Network Analysis and Mining, pp. 380-385. Springer, 2014. (pdf)

Feng Chen and Daniel B. Neill. Non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs. Proceedings of the 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 1166-1175, 2014. (pdf)

Skyler Speakman, Edward McFowland III, and Daniel B. Neill. Scalable detection of anomalous patterns with connectivity constraints. Journal of Computational and Graphical Statistics, 2014, in press. (accepted author version)



Working papers (status as of December 2014):

Sriram Somanchi, David Choi, and Daniel B. Neill. StarScan: a novel scan statistic for irregularly-shaped spatial clusters. Accepted to 2014 International Society for Disease Surveillance Annual Conference.

Mallory Nobles, Lana Deyneka, Amy Ising, and Daniel B. Neill. Identifying emerging novel outbreaks in textual emergency department data. Accepted to 2014 International Society for Disease Surveillance Annual Conference.

Daniel B. Neill. Bayesian scan statistics. Book chapter submitted for publication.

Seth Flaxman and Daniel B. Neill. Detecting spatially localized subsets of leading indicators for event prediction. Submitted for publication.

Tarun Kumar and Daniel B. Neill. Fast tensor scan for event detection and characterization. Submitted for publication.

Sriram Somanchi and Daniel B. Neill. Fast graph structure learning from unlabeled data for event detection. Submitted for publication.

Skyler Speakman, Sriram Somanchi, Edward McFowland III, and Daniel B. Neill. Penalized fast subset scanning. Submitted for publication.

Kenton Murray, Chris Dyer, Yandong Liu, and Daniel B. Neill. A semantic scan statistic for novel disease outbreak detection. Submitted for publication.

Seth Flaxman, Daniel B. Neill, and Alexander J. Smola. Gaussian processes for independence tests with non-iid data in causal inference. Submitted for publication.



Presentations:

Daniel B. Neill. Fast subset sums for multivariate Bayesian scan statistics. International Society for Disease Surveillance Annual Conference, Miami, FL, December 2009. (pdf)

Skyler Speakman and Daniel B. Neill. Fast graph scan for scalable detection of arbitrary connected clusters. International Society for Disease Surveillance Annual Conference, Miami, FL, December 2009. (pdf)

Daniel B. Neill, Fast subset scanning for multivariate event detection. ENAR 2010 Annual Meeting, New Orleans, LA, March 2010. (pdf)

Edward McFowland III, Skyler Speakman, and Daniel B. Neill. Fast generalized subset scan for anomalous pattern detection. Sixteenth Conference for African American Researchers in the Mathematical Sciences, Baltimore, MD, June 2010. (pdf)

Daniel B. Neill. Fast subset sums for scalable Bayesian detection and visualization. Fifth International Workshop on Applied Probability, Madrid, Spain, July 2010. (pdf)

Skyler Speakman, Edward McFowland III, and Daniel B. Neill. Scalable detection of anomalous patterns with connectivity constraints. INFORMS Annual Conference, Austin, TX, November 2010. (pdf)

Edward McFowland III, Skyler Speakman, and Daniel B. Neill. Fast generalized subset scan for anomalous pattern detection. INFORMS Annual Conference, Austin, TX, November 2010. (pdf)

Daniel B. Neill, Edward McFowland III, and Huanian Zheng. Fast subset scan for multivariate spatial biosurveillance. International Society for Disease Surveillance Annual Conference, Park City, UT, December 2010. (pdf)

Daniel B. Neill and Yandong Liu. Generalized fast subset sums for Bayesian detection and visualization. International Society for Disease Surveillance Annual Conference, Park City, UT, December 2010. (pdf)

Daniel B. Neill. Research challenges for biosurveillance: the next ten years (invited plenary). International Society for Disease Surveillance Annual Conference, Park City, UT, December 2010. (pdf)

Daniel B. Neill. Spatial and subset scanning for multivariate health surveillance. Data Fusion Research Meeting, Ottawa, ON, March 2011. (pdf)

Daniel B. Neill. Machine learning for population health and disease surveillance. Advanced Analytics Workshop, Washington, DC, April 2011. (pdf)

Edward McFowland III and Daniel B. Neill. Fast generalized subset scan for anomalous pattern detection in mixed data sets. 17th Conference for African-American Researchers in the Mathematical Sciences, Los Angeles, CA, June 2011.

Daniel B. Neill. Fast multivariate subset scanning for scalable cluster detection. Joint Statistical Meetings 2011, Miami, FL, August 2011. (pdf)

Edward McFowland III and Daniel B. Neill. Efficient methods for anomalous pattern detection in general datasets. INFORMS Annual Conference, Charlotte, NC, November 2011. (pdf)

Sriram Somanchi and Daniel B. Neill. Fast learning of graph structure from unlabeled data for anomalous pattern detection. INFORMS Annual Conference, Charlotte, NC, November 2011. (pdf)

Skyler Speakman and Daniel B. Neill. Dynamic pattern detection with connectivity and temporal consistency constraints. INFORMS Annual Conference, Charlotte, NC, November 2011. (pdf)

Daniel B. Neill. Analytical methods for large scale surveillance of unstructured data. International Conference on Digital Disease Detection, Boston, MA, February 2012. (pdf)

Daniel B. Neill and Edward McFowland III. Fast generalized subset scan for anomalous pattern detection. Sixth International Workshop on Applied Probability, Jerusalem, Israel, June 2012. (pdf)

Daniel B. Neill, Skyler Speakman, Edward McFowland III, and Sriram Somanchi. Efficient subset scanning with soft constraints. Sixth International Workshop on Applied Probability, Jerusalem, Israel, June 2012. (pdf)

Skyler Speakman, Edward McFowland III, and Daniel B. Neill. Scalable detection of anomalous patterns with connectivity constraints. 29th Quality and Productivity Research Conference, Long Beach, CA, June 2012. (pdf)

Daniel B. Neill and Seth Flaxman. Detecting spatially localized subsets of leading indicators for event prediction. 32nd International Symposium on Forecasting, Boston, MA, June 2012. (pdf)

Daniel B. Neill. Predicting and preventing emerging outbreaks of crime. CMU Workshop on Machine Learning and Social Sciences, Pittsburgh, PA, October 2012. (pdf)

Sriram Somanchi and Daniel B. Neill. Fast graph structure learning from unlabeled data for event detection. INFORMS Annual Conference, Phoenix, AZ, October 2012.

Skyler Speakman, Yating Zhang, and Daniel B. Neill. Tracking dynamic water-borne outbreaks with temporal consistency constraints. International Society for Disease Surveillance Annual Conference, San Diego, CA, December 2012. (pdf)

Daniel B. Neill and Tarun Kumar. Fast multidimensional subset scan for outbreak detection and characterization. International Society for Disease Surveillance Annual Conference, San Diego, CA, December 2012. (pdf)

Daniel B. Neill. Fast subset scanning for scalable event and pattern detection. Stony Brook University, Stony Brook, NY, May 2013. (pdf)

Seth Flaxman and Daniel B. Neill. New tests for space-time interaction in spatio-temporal point processes. 2nd Spatial Statistics Conference, Columbus, OH, June 2013. (pdf)

Daniel B. Neill. Machine learning and event detection for the public good. Data Science for the Social Good Summer Fellowship Program, Chicago, IL, July 2013. (pdf)

Feng Chen and Daniel B. Neill. Non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs. INFORMS Annual Meeting, Minneapolis, MN, October 2013. (pdf)

Feng Chen and Daniel B. Neill. Non-parametric scan statistics for disease outbreak detection on Twitter. International Society for Disease Surveillance Annual Conference, New Orleans, LA, December 2013. (pdf)

Skyler Speakman, Sriram Somanchi, Edward McFowland III, and Daniel B. Neill. Penalized fast subset scanning. 6th International Conference on Computational and Methodological Statistics, London, UK, December 2013. (pdf)

Daniel B. Neill. Scaling up event and pattern detection to big data. MIT Workshop on Challenges in Big Data for Data Mining, Machine Learning and Statistics, Cambridge, MA, March 2014. (pdf)

Daniel B. Neill. Scaling up event and pattern detection to big data. NYU Stern School of Business, Information Systems Seminar, New York, NY, April 2014. (pdf)

Feng Chen and Daniel B. Neill. Non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs. Seventh International Workshop on Applied Probability, Antalya, Turkey, June 2014. (pdf)

Sriram Somanchi and Daniel B. Neill. A star-shaped scan statistic for detecting irregularly-shaped spatial clusters. Seventh International Workshop on Applied Probability, Antalya, Turkey, June 2014. (pdf)

Edward McFowland III and Daniel B. Neill. Discovering novel anomalous patterns in general data. Statistical Learning and Data Mining Meeting on Data Mining in Business and Industry, Durham, NC, June 2014. (pdf)

Seth Flaxman, Alex Smola, and Daniel B. Neill. Kernel space-time interaction tests for identifying leading indicators of crime. Joint Statistical Meetings, Boston, MA, August 2014. (pdf)

Mallory Nobles, Seth Flaxman, and Daniel B. Neill. Urban predictive analytics. INFORMS Annual Meeting, San Francisco, CA, November 2014. (pdf)

Sriram Somanchi, David Choi, and Daniel B. Neill. StarScan: a novel scan statistic for irregularly-shaped spatial clusters. International Society for Disease Surveillance Annual Conference, Philadelphia, PA, December 2014. (pdf)

Mallory Nobles, Lana Deyneka, Amy Ising, and Daniel B. Neill. Identifying emerging novel outbreaks in textual emergency department data. International Society for Disease Surveillance Annual Conference, Philadephia, PA, December 2014. (pdf)



Broader Impacts: The Machine Learning and Policy (MLP) Initiative

With the critical importance of addressing global policy problems ranging from disease pandemics to crime and terrorism, and the continuously increasing size and complexity of policy data, the use of machine learning has become increasingly essential for data-driven policy analysis and for development of new, practical information technologies that can be directly applied for the public good. The numerous challenges facing our world will require broad, successful innovations at the intersection of machine learning and public policy. This endeavor will require widespread collaboration between machine learning and policy researchers, increased emphasis on the education of future researchers with in-depth knowledge of both disciplines, and a broadly shared research focus on developing novel machine learning methods which directly address critical policy challenges. We are working to build a multi-pronged curricular program, the Machine Learning and Policy (MLP) initiative. This program will facilitate the widespread use of machine learning methods for the public good by incorporating machine learning throughout the public policy curriculum. Key components of this program include a new Joint Ph.D. program in Machine Learning and Public Policy, an introductory course in machine learning ("Large Scale Data Analysis for Policy") geared toward public policy students, a Ph.D.-level research seminar in Machine Learning and Policy, and a course series in "Special Topics in Machine Learning and Policy", with courses including "Event and Pattern Detection" (Spring 2010, Spring 2014), "Machine Learning for the Developing World" (Spring 2011), "Harnessing the Wisdom of Crowds" (Spring 2012), and "Mining Massive Datasets" (Spring 2013). Project PI Daniel Neill was also involved in the creation of a CMU workshop and seminar series in "Machine Learning and Social Sciences" and in creating a new "Policy Analytics" track for Heinz College's MS in Public Policy and Management program.



Tutorials and Educational Material:

Daniel B. Neill. Lecture slides for the course, Large Scale Data Analysis for Public Policy. Last taught Fall 2014. (link)

Daniel B. Neill. Machine learning and event detection for the public good. Guest lecture, April 2011. (pdf)

Daniel B. Neill and Weng-Keen Wong. A tutorial on event detection. Presented at the 15th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2009. (pdf)

Daniel B. Neill. Spatial scan tips and tricks for practical outbreak detection. Invited webinar for the International Society for Disease Surveillance, January 2011. (pdf)



Awards:

The Project PI, Dr. Neill, was named one of the "AI's 10 to Watch" by IEEE Intelligent Systems, Jan/Feb 2011. (link)

Edward McFowland III was awarded an NSF Graduate Research Fellowship (link) and an AT&T Labs Research Fellowship, 2011. (link)

Edward McFowland III was the 2012 winner of the Suresh Konda Award, presented yearly to Heinz College's best First Heinz Research Paper.

Seth Flaxman was the 2013 winner of the Suresh Konda Award, presented yearly to Heinz College's best First Heinz Research Paper.

Sriram Somanchi was the 2013 winner of the George Duncan Award, presented yearly to Heinz College's best Second Heinz Research Paper.



This material is based upon work supported by the National Science Foundation, grants IIS-0916345 (primary funding source), IIS-0911032, and IIS-0953330. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Back to Daniel's home page
Contact the PI: Daniel Neill, neill (at) cs (dot) cmu (dot) edu
Final update: March 10, 2015.