Finding Patterns and Anomalies in Large Time-Evolving Graphs

 
Christos Faloutsos Phone: (412)-268.1457
Department of Computer Science Fax : (412)-268.5576
Carnegie Mellon Univ. Email: christos AT cs.cmu.edu
Pittsburgh, PA 15213 WWW page: http://www.cs.cmu.edu/~christos

This material is based upon work supported by the National Science Foundation under Grant No. IIS-0534205. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

1. GENERAL INFORMATION

1.1. Abstract

Link to NSF abstract

Social networks can be represented as time-evolving graphs where the nodes are the members/entities of the social network and the edges represent connections/relationships between the nodes. This project tries to answer questions such as: How does a "normal" social network look like? How will it evolve over time? How can we spot "abnormal" interactions (e.g., spam), in a time-evolving e-mail graph? The approach developed in this project is to look for time-evolution "laws", and to design fast, scalable data mining tools for real graphs with millions and billions of nodes. The approach consist of two efforts: (1) discovery of patterns that hold when graphs evolve over time and (2) tools to analyze, visualize and mine such graphs to discover anomalies. The resulting tools will have a broad applicability. They will be vital for mining and outlier detection in numerous settings, such as money-laundering rings, mis-configured routers on the Internet, suspicious user accesses to database records, surprising protein-protein interactions in a gene regulatory network, and many applications involving large-scale evolving social networks. The project Web site (http://www.cs.cmu.edu/~christos/PROJECTS/GRAPH-MINING/) provides additional information and will be used for results dissemination.

1.2. Keywords

Data mining, network traffic, time evolution.

1.3. Funding agency

2. PEOPLE INVOLVED

In addition to the PI, the following graduate students worked on the project.

3. RESEARCH

3.1. Project goals

The goal of the project is to find patterns in large static and time-evolving graphs. The activities were the following:

The first was the use of tensors to analyze time-evolving graphs. Tensors are multi-dimensional extensions of arrays, and there is a wealth of decomposition methods for them, analogous to the SVD decomposition for arrays. We used tensors to find correlations and hidden variables in IP traffic matrices, as well as other graphs, like the DBLP author-paper-keyword datasets, evolving over time. The effort led to several publications [Sun et al, KDD'06], [Sun et al  Siam DM'07], and a tutorial on tensors (with Sun and Kolda, Siam DM'07).

The second activity was the analysis of influence propagation in blogs. We found several power-law patterns, in real blog data, and we publish the results in [Leskovec, Siam DM 2007]. The third activity was the development of fast algorithms to find the so-called 'CenterPiece SubGraphs' (CePS). Given a social network, and $k$ query nodes (eg., criminals), we want to find the most central node or nodes for that set (eg., the master-mind of the crime-gang). Our algorithms are 2 orders of magnitude faster than the naive implementation, and received the 'best paper' award in ICDM [Tong Faloutsos, ICDM'06].

3.2. Current Results

The first major finding was the introduction of tensors for the analysis of time-evolving graphs. Tensors provide powerful tools, which, in our opinion, will greatly benefit the graph mining community. The Ph.D. dissertation of Dr. Jimeng Sun focused exactly on that, and showed that tensors can extract the 'normal' patterns in sensor data, as well as spot outliers.

The second finding was that cascades in blogs follow several power laws. Analysis of a large body of blogs and postings shows that (a) the probability to point to a post drops as a *power law* with time (as opposed to exponentially, that one might guess) (b) the most typical shapes of cascades are 'star-like' cascades, instead of random (c) the activity in cascades is bursty over time, as we showed by using tools from chaotic time series analysis.

The third finding was the 150-fold speed-up of a graph mining algorithm ('CePS'), which tries to find the most central nodes, for a given set of query nodes.

We developed 'NetProbe', a system for fraud detection in electronic auctions. Fraudsters give false feedback  to  their alter egos, to deceive honest users. We observed that a common pattern in that case is the 'bi-partite core': fraudsters get good feedbacks, from 'accomplices' who behave like honest ones, except that they give good feedbacks to many fraudsters-to-be. Our algorithm, using Belief Propagation, is able to detect such bi-partite cores. The system attracted a lot of media attention, as we mention later.

In 2007 we also developed CELF, a fast algorithm to find near-optimal placement for sensors in a water distribution networ, to protect against poisoning attacks. The algorithm is much faster than the usual 'greedy' algorithm; it has provable performance bounds, and it also attracted media attention and a 'best student paper' award [Leskovec+, KDD'07]

In 2008, we worked on weighted graphs and the 'butterfly' generator. We found that the total node weight grows super-linearly with the degree of a node, and that the non-largest connected components remain about constant in size (in fact, they oscillate). The 'butterfly' generator obeys all these new patterns.
In 2008 we also worked on the 'Colibri' method. It focuses on summarizing and tracking time-evolving graphs and it achieves orders of magnitude better speed, for zero or tiny loss of accuracy, versus older methods (CUR, SVD etc).

3.3. Publications - refereed conferences

3.4. Publications - refereed journals

4. IMPACT - DISSEMINATION

4.1. Software

4.2. Demos

Live System

Related Pulications:VLDB06, Operating System Review06

4.3. Awards

4.4. Out-reach - presentations and tutorials

4.5. In the newsNetProbe screenshot



Last updated: June 3, 2009,, by Christos Faloutsos.