10-831/90-921 Event and Pattern Detection

Project Final Report

 

Leman Akoglu

March 02, 2010

 

 

 

Project Title:   Change-Point Detection in Time Series of User Behavior in Mobile Communication Graphs

 

 

Introduction 

 

Anomaly detection has been studied widely in many settings from anomalous point detection on clouds of multi-dimensional points to spatio-temporal anomalous pattern detection with applications to network intrusion detection, medical insurance claim fraud, credit card fraud, electronic auction fraud and many others, with much less focus on anomaly detection in graph data.

 

In this project, I want to study the behavior of users in an anonymous mobile communication network. In this who-texts-whom graph, nodes represent the users and edges represent the SMS interactions between these users. The data consists of several months’ of activity and is therefore time-evolving. Also, the edges can be weighted, with weights denoting the total number of SMSs sent/received between individual pairs, for example. In such a setting of dynamic series of weighted graphs, the main questions are the following: (1) What points in time does the behavior of the nodes change? (2) Can we characterize which nodes cause most of that change?

 

 

Data Description

 

The data spans a time period from December 1st 2007 to May 31st 2008 (183 days). It contains SMS interactions of around 3 million service users (customers). The data also contains incoming/outgoing edges to/from customers from/to out-of-network users. However, since only the activity of actual customers can be tracked, the interactions between out-of-network users do not exist. Therefore, in this work I focus only on the network if the in-network users.

 

 

Method

 

Feature Extraction from Nodes

 

In order to find patterns that nodes of a graph follow, I characterize the nodes with several features, so that each node becomes a multi-dimensional point. In particular, each node is summarized by a set of features extracted from its egonet (egonet of a node includes the node itself, its neighbors, and all the interactions between these nodes). The 12 features considered in this work are as follows:

  1. Indegree/outdegree
  2. Inweight/outweight
  3. Number of neighbors
  4. Number of edges
  5. Number of reciprocal edges
  6. Maximum reciprocal edge ratio
  7. Average in/out weight
  8. Maximum in/out weight

 

Figure 1. Flow of the change-point detection procedure

 

 

Change-Point Detection

 

The flow of the method used in this work to find change-points in the behavior of nodes is illustrated in Figure 1. This method is similar to Ide and Kashima [1], but differs in the construction of the “dependency” matrix C.

 

Here, the data I have looks like the 3-D tensor on the top left of Figure 1, where T=183 days, N=~2M nodes, and F=12 features. To start with, I take one slice of this tensor for a particular feature Fi, say inweight, which is a TxN matrix. Next, I define a window of size W over the time-series of values of nodes for that particular feature. Then, for each window-size time series of a pair of nodes, I compute the correlation using Pearson’s rho,

                           

 

,where X and Y are the length-W vectors for node pair (X,Y).

So, for each window I construct a correlation matrix C, where Cx,y    =  rho(x,y) over window W. Next, I slide the window down one day and do the same for the next week of seven days. As a result, I end up constructing 177 C matrices (top-right in Figure 1).

 

By the Perron-Frobenius theorem, the largest (principal) eigenvector of each of the C matrices is positive. The value for each node in the eigenvector can be thought as the “activity” of that node; that is, the more correlated a node is to the majority of the nodes, the higher its “activity” value will be. Here, I call each such eigenvector as the “eigenbehavior” of the nodes.

 

After finding all the eigenvectors for all 177 C matrices, the change-point in the “eigenbehavior” of nodes is found as follows: For the eigenvector computed at time say t denoted by u(t), I compute an “average” typical “eigenbehavior” denoted by r(t-1) from the last W eigenvectors back in time (See bottom-right in Figure 1). Next, the “eigenbehavior” at time t is compared to the “typical eigenbehavior” by taking the dot-product of those two unit vectors. The change metric used is Z=(1-uTr). Here, if the new “eigenbehavior” u(t) is perpendicular to the typical pattern, their dot-product gives a value of 0 (Z=1), whereas if u is the same as r, then their dot-product gives a value of 1 (Z=0). Therefore, Z changes between 0 and 1 and a higher value of Z indicates a change point and is flagged accordingly. See Figure 1 bottom row as an illustration of the procedure.

 

 

 

Experimental Results

 

Here I start by looking at the distribution of correlation values Ci,j in the C matrices. Figure 2 shows the histogram as well as the CDF of Ci,j values for two different days, Dec 1st and Dec 26th.

 

 

Figure 2. (top) the histogram and (b) the CDF of the distribution of correlation scores Ci,j for (left) Dec 1 and (right) Dec 26.

Here, one observation is that the distribution of correlations is skewed as might be expected. Surprisingly, though, it is skewed towards large values. That is, there are lots of pairs with correlation close to or equal to 1. This happens because over the time window W of 7 days, most of the nodes have no activity –their W-length vectors are all 0s and the pairwise correlations of such 0 vectors are taken to be 1.

 

Next, I compare the results for using SVD versus regular average AVG for computing the typical eigenbehavior r over a window of W (bottom-right in Figure 1). Figure 3 shows the Z scores computed when r is computed with SVD (in blue bars) versus when r is computed by simply taking the average (in red line) for different values of W, (from left to right, top to bottom) 5,7,20 and 50. Notice that the red line almost exactly follows the blue bars. This means that, SVD is giving equal weight to all the W eigenvectors in the past as the average does. Therefore, since computing the average is less expensive, I will use the AVG to compute the r vector in the rest of the experiments.

 

Another take-away from Figure 3 is that the Z scores follow a similar trend when different window sizes are considered. In the rest of the experiments, I will use W=5 over which the r vector is computed via AVG.

 

 

Figure 3. Z scores computed when the typical pattern r vector is computed by taking the SVD (blue bars) versus by regular average (AVG).

Change-Points

 

Here, after computing the Z scores as was explained above, I use a simple heuristic to flag the high Z scores. Rather than using a threshold value for the Z values, I simply compute the difference between the consecutive Z scores and rank the time points according to |Z(t)-Z(t-1)|.

Figure 4 shows the top 10 time ticks for which the difference score is the highest. Note that F is taken to be the “inweight”. Experiments with other features such as “number of reciprocated edges” and “outdegree” also flag similar time points which will be shown later in this report.

 

Here, we observe that the top 2 time periods correspond to the weeks of Christmas and New Year (Dec 26, Jan 2). This shows that even though the data comes from India and mostly people are not Christian, they would be “celebrating” the Christmas. The reason that Jan 2nd rather than Jan 1st is flagged is that it shows that it is a change-point in which things went back to normal.

 

Another surprising finding is with the 3rd time tick which is Apr 7. Similar to Jan 2, this is also a time-point where things turned back to normal. The actual interesting day here is indeed Apr 6th.

Since this data is 2008 data, http://www.infoplease.com/ipa/A0777465.html lists Apr 6th as the “Hindi New Year”. These results suggest that the method is effective in finding points in time for which the collective behavior of the nodes deviate from the recent past.

 

 

ZoverTime_window_5_topt_10.png

Figure 4. Top 10 time points flagged by the method (red bars) for F:inweight.

 

 

As a sanity check, I ran the method on other features such as “numrecip”: number of reciprocal edges and “outdegree”. Figure 5 shows that the method flags almost the same time points including Dec 26th and April 6th also with these features. Moreover, the difference/spike in the Z score is even clearer with these methods. This is intuitive in the sense that, even though the “inweight” (number of SMSs received) is expected to increase on days such as Christmas and New Year, the number of reciprocated interactions are expected to increase more (people tend to reply to celebration messages on such days).

 

 

 

Figure 5. Top 10 time points flagged by the method (red bars) for (left) F:numrecip and (right) F:outdegree.

 

 

 

Change-Nodes

 

Here the question is for a given change-point detected as above, can we go back and detect which node(s) contributed to the change the most?

 

Figure 6 shows the scatter plot of the values of the eigen-scores u(t) versus the typical pattern r(t-1) scores for all the nodes on Dec 26th. Here, we observe that most of the values lie on the diagonal, which shows that a majority of the nodes did not change much on their typical behavior. On the other hand, some points that are far off-diagonal are marked in red that contribute to the Z score the most.

 

Similarly, Figure 7 shows the amount of change ratio (%) for 10K nodes (bottom row shows the values in sorted order). Again, the same top 5 nodes as in Figure 6 are marked in red.

 

Since the data does not contain any labels about any type of anomaly, I plot the time series of the top 5 nodes marked in Figures 6 and 7 in Figure 8 (each row for each node). Here we observe that, 3 of the nodes (rows 1, 4 and 5) have no activity on the week of Dec 26th. This is marked because they are observed to have some activity over the previous weeks. On the other hand the other 2 nodes (rows 2 and 3) have the opposite behavior. They start receiving SMSs after the Christmas week. We also observe that these two sets of nodes lie in the different sections of the diagonal in Figure 6, indicating an opposite change in their behaviors.

  

Figure 6. Scatter plot u(t) versus r(t-1). Each blue dot indicates a node. Nodes far away from the diagonal change in “behavior” the most (top 5 marked in red).

 

 

Figure 7. (top) Change ratios (%) of 10K nodes in u(t) and r(t-1).Each bar indicates a node (top 5 shown in red). (bottom) Ratio values sorted.

 

Figure 7. Time series of inweight values of  top 5 nodes marked in Figures 6&7.

 

 

Figures 8 and 9 show corresponding results for April 7th (3rd most important change point). Note especially the drop in activity over the week of April 7th after high activity on April 6th (Hindi New Year in 2008).

 

Finally, similar results are obtained for other features such as “numrecip” and “outdegree” but I omit them here for brevity. The comparison of the nodes detected by using different features is subtle: Although one can look at the overlap of nodes in the top-k ranked list of results, I just show the time series of the top 5 nodes detected for each feature to see if things make intuitive sense.

 

 

 

 

Change-Points (for Node)

 

Next, I switch to applying the same method on the other dimension of the data tensor I started with. In particular, here instead of looking at the TxN matrix for a particular feature F, I take the TxF matrix for a particular Node X and try to detect “interesting”/change-points for Node X.

 

 

 

 

Figure 8. (left)  Scatter plot u(t) versus r(t-1) on April 7th. (right) Change ratios (%) of 10K nodes in u(t) and r(t-1) on April 7th.

 

 

           Figure 9. Time series of inweight values of top 5 nodes marked in Figure 8.

 

 

 

Since applying the method on all 2 million nodes is not practical, I chose top 2 nodes with the highest number of SMSs received. The first one is

where X denotes the anonymous customer ID, MR is gender, next comes the listed birth-day and the rest are some extra information that are not relevant in this study. The second user is:

Here, we observe that some customers share the same ID –maybe two people sharing the same phone-line/service.

The birthday here is conjectured to be important in the sense that these days are expected to be flagged as change-points for these nodes. Unfortunately, for the first node, the birthday is Oct and the data spans until May. Also, for the second user(s), the day fields being both 1 look fake.

 

Figure 10 shows the top 10 change-points detected for these two users. Unfortunately, it is hard to make any argument about how valid these results are because this time they are more subjective.   

 

 

Figure 10. Top 10 time points flagged by the method (red bars) for top 2 users with the most number of inweight (SMSs received).

 

 

For the first node, week of Jan 6 is found to be the most important change-point. Figure 11 shows the ratio of change for all the 11 features at that time. We observe that inweight, average inweight and maximum inweight contribute to the change the most. Although these three are correlated features, it is surprising that on the contrary, the indegree itself does not change much (first bar). This suggests for receiving many more SMSs but from the same set of contacts (high inweight, constant indegree). Also, the 4th most changing feature is “numrecip”. This also shows that that user is replying many more SMSs received that usual.

 

Figure 12 shows the time series of all the 12 features for node X=84332250336 on Jan 6th. The start and end of the week is marked with red and green lines, respectively. Unfortunately, it is hard to notice any change compared to recent past for this week since the node is highly active over the whole period of 6 months.

 

Results for the second user(s) looks similar and are not shown in this report.

Figure 11. (top) Change ratios (%) of 12 features in u(t) and r(t-1).Each bar indicates a feature (top 5 shown in red). (bottom) Ratio values sorted.

 

 

Figure 12. Time series of all the 12 features for Node X=84332250336 on Jan 6th. Red line marks the start whereas the green line marks the end of the week.

 

 

Discussion and Conclusions

 

The goal of this project is to find (1) “change-points” and also (2) the nodes that are most related to the cause of a particular change-point in time. Intuitively, the method tries to find those points in time in which the behavior of nodes change collectively. Although there exist no ground truth for the SMS data analyzed, the results suggest that the method is effective in detecting interesting time points such as the Christmas and the Hindi New Year as far as one can characterize. On the other hand, identifying the nodes that contribute to a change the most is harder and needs further analysis/ground truth information for evaluation.

       Moreover, since the data is 3-mode, one can perform the same method on the features of a fixed node over time to find change-points per node. For the same reasons as above, making an argument on the accuracy of the findings of the method is hard as the only proper information that can be used in evaluation about users is their birthdays which unfortunately also seem to be forged.

       One possible approach on evaluation is to use other existing algorithms for change-point detection on multivariate time series data such as Spirit [2] and GraphScope [3], but is left as future work for this project.

 

[1] Ide, T. and Kashima, H. Eigenspace-based anomaly detection in computer systems. KDD, 2004

[2] Papadimitriou,  S., Sun,  J., Faloutsos,  C. Streaming Pattern Discovery in Multiple Time-Series, VLDB, 2005 

[3] Sun J.,  Faloutsos C., Papadimitriou S., Yu P. S.: GraphScope: parameter-free mining of large time-evolving graphs. KDD 2007