SPIRITSPIRIT: Streaming Pattern Discovery

[Overview | Technique | People | Publication | Download | Demo]


Data stream is an important model for many different applications: network traffic analysis, sensor network monitoring, moving object tracking, financial data analysis. The main challenge is that 1) speed challenge:  the data are coming into the system for processing in real-time (e.g. stock index quotes, sensor measurements, network traffic);  2) space challenge: the data are usually unbounded which posts another challenge on how to efficiently store and process the data.  A lot of applications on data streams are monitoring applications, where huge volume of real-time data streaming into the system need to be monitored for trend analysis and anomaly detection. It is crucial to detect patterns and correlations that may exist in co-evolving data streams. Streams often are inherently correlated (e.g., temperatures in the same building, host traffic in the same network, prices in the same market, etc.) and it is possible to reduce hundreds of numerical streams into just a handful of hidden variables that compactly describe the key trends and dramatically reduce the complexity of further data processing.

Figure on left illustrates the idea: the top panel plots the original data streams (potentially a large number of them); the middle panel shows the hidden variables computed from the original streams; the bottom panel plots the reconstruction of original stream based on the hidden variables. The idea is that it is hard to monitor a large number of data streams (in the top panel), but it is probably OK to look at the hidden variables (in the middle panel). Furthermore, how good the hidden variables capture the original data streams is reflected in the reconstruction. The closer the original and reconstruction are, the better the hidden variables capture the streams. This is a very important observation since it gives a hint on 1) how to determine the number of hidden variables and 2) how to detect abnormal behavior.


The ultimate goal is to have online algorithms that does the following for a large number of streams:

This material is based upon work supported by the National Science Foundation under Grants No. IIS-0209107, IIS-0205224, INT-0318547, SENSOR-0329549, IIS-0326322, This work is also supported in part by the Pennsylvania
Infrastructure Technology Alliance (PITA),a partnership of Carnegie Mellon, Lehigh University and the Commonwealth of Pennsylvania's Department of Community and Economic Development (DCED). Additional funding was provided by donations from Intel, NTT and Hewlett-Packard. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, or other funding parties.


Centralized SPIRIT

Given n numerical data streams, all of whose values we observe at each time tick t, SPIRIT can incrementally n-dimensional correlations and hidden variables, which summarize the key trends in the entire stream collection. Related publication [2,3]

Distributed SPIRIT

Related publication [1]



  1. Sun,  J., Papadimitriou,  S., Faloutsos,  C. Distributed Pattern Discovery in Multiple Streams, To appear in the Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Singapore, Apr 9-12, 2006 [short version/full version][bib]

    • Abstract: Given m groups of streams which consist of n_1, ...,  n_m coevolving streams in each group, we want to: (i) incrementally n_d local patterns within a single group, (ii) efficiently obtain global patterns across groups, and more importantly, (iii) efficiently do that in real time while limiting shared information across groups. In this paper, we present a distributed, hierarchical algorithm addressing these problems. Our experimental case study confirms that the proposed method can perform hierarchical correlation detection efficiently and effectively

  2.  Papadimitriou,  S., Sun,  J., Faloutsos,  C. Streaming Pattern Discovery in Multiple Time-Series, Proceedings of the Very Large Data Bases Conference (VLDB), Trondheim, Norway, 2005[pdf][bib]

    • Abstract: In this paper, we introduce SPIRIT (Streaming Pattern dIscoveRy in multIple Timeseries). Given n numerical data streams, all of whose values we observe at each time tick t, SPIRIT can incrementally nd correlations and hidden variables, which summarise the key trends in the entire stream collection. It can do this quickly, with no buffering of stream values and without comparing pairs of streams. Moreover, it is any-time, single pass, and it dynamically detects changes. The discovered trends can also be used to immediately spot potential anomalies, to do efficient forecasting and, more generally, to dramatically simplify further data processing. Our experimental evaluation and case studies show that SPIRIT can incrementally capture correlations and discover trends, efficiently and effectively.

  3. Sun,  J., Papadimitriou,  S.,  Faloutsos,  C.  Online  latent  variable detection in  sensor   networks,  To appear in Proceedings of 21th  IEEE International Conference on Data Engineering (ICDE), 2005 (demo).[pdf][bib]

    • Abstract: Sensor networks attract increasing interest, for a broad range of applications. Given a sensor network, one key issue becomes how to utilize it efficiently and effectively. In particular, how can we detect the underlying correlations (latent variables) among many co-evolving sensor measurements? Can we do it incrementally? We present a system that can (1) collect the measurements from the real wireless sensors; (2) process them in real-time; and (3) determine the correlations (latent variables) among the sensor streams on the fly.

  4. Evan Hoke, Jimeng Sun, John D. Strunk, Gregory R. Ganger, Christos Faloutsos. InteMon: Continuous Mining of Sensor Data in Large-scale Self-* Infrastructures. ACM SIGOPS Operating Systems Review, 40(3):38-44. ACM Press, July 2006 [pdf][bib]

    • Abstract: Modern data centers have a large number of components that must be monitored, including servers, switches/routers, and environmental control systems. This paper describes InteMon, a prototype monitoring and mining system for data centers. It uses the SNMP protocol to monitor a new data center at Carnegie Mellon. It stores the monitoring data in a MySQL database, allowing visualization of the time-series data using a JSP web-based frontend interface for system administrators. What sets InteMon apart from other cluster monitoring systems is its ability to automatically analyze correlations in the monitoring data in real time and alert administrators of potential anomalies. It uses efficient, state of the art stream mining methods to report broken correlations among input streams. It also uses these methods to intelligently compress historical data and avoid the need for administrators to configure threshold-based monitoring bands.

  5. Evan Hoke, Jimeng Sun, Christos Faloutsos. InteMon: Intelligent System Monitoring on Large Clusters, Proceedings of VLDB, 2006 (demo).



Simple Web Demo (it requires Sun JVM installed):

IntMon: intelligent monitoring system for large clusters [currently under construction, please come back later]