Ph.D. Candidate, Computer Science Department
Graphs are ubiquitous, representing a variety of information, ranging from who follows whom
on online social networks to who reviews what
on e-commerce sites.
Many of these graphs are large (e.g., online social networks with over two billion active users) and dynamic (i.e., nodes and edges can be added and removed over time).
Moreover, they are with rich side information (e.g., e-commerce reviews with timestamps, ratings, and text)
and thus naturally modeled as tensors (i.e., multi-dimensional arrays).
Given large dynamic graphs and tensors, how can we analyze their structure?
How can we detect interesting anomalies?
Lastly, how can we model behaviors of individuals in the data?
My thesis focuses on these closely related questions, all of which are fundamental to understand massive evolving data on user behavior.
That is, we develop scalable algorithms for mining large dynamic graphs and tensors, with a focus on three tasks:
- Structure Analysis:
We build one-pass, sublinear-space algorithms that incrementally estimate the triangle count, which is an important connectivity measure, in large dynamic graphs.
In particular, our distributed algorithm yields up to 39X more accurate estimates faster than a baseline.
We also develop distributed and out-of-core algorithms for succinctly but accurately summarizing the structure of large graphs and tensors.
They summarize over 25X larger data without quality loss than their best competitors.
- Anomaly Detection:
We develop near-linear time approximation algorithms for detecting unusually dense subgraphs and subtensors, which signal notable anomalies such as 'edit wars' on Wikipedia and fake followers on Twitter.
Especially, our tensor algorithm is up to 114X faster without accuracy loss than the previously best heuristic.
We also extend it for distributed or dynamic data with the same approximation guarantee.
- Behavior Modeling:
We design game-theoretic models for purchases of individuals in social networks and a fast algorithm for finding Nash equilibria of the models.
In addition, we develop a stage model for the progression of individuals and a distributed optimization algorithm for fitting the model to behavior logs with trillions of records.
Using our tools, we measure social inefficiency regarding purchase of sharable goods and discover progression patterns of LinkedIn users.
To achieve the highest performance and scalability, our algorithms for the above tasks employ mathematical techniques (e.g., approximation and sampling), use distributed computing frameworks (e.g., MapReduce and MPI), and/or exploit pervasive patterns in real-world data (e.g., power-law degree distribution).
We successfully apply them to massive datasets, including 20.6 billion
social connections on LinkedIn,
follow relations on Twitter,
hyperlinks between web pages,
edits on Wikipedia,
and a synthetic tensor with 1 trillion