Making Sense out of Large Graphs:
Bridging HCI with Data Mining

PI: Christos Faloutsos, Carnegie Mellon University , phone 412-268.1457, FAX: 412-268.5576, email: christos AT cs.cmu.edu
Co-PI: Aniket Kittur, Carnegie Mellon University,  412-268.7505,  nkittur AT cs.cmu.edu
Co-PI: Duen Horng (Polo) Chau, Georgia Tech, 404-385.7682, polo AT gatech.edu

This material is based upon work supported by the National Science Foundation under Grant No. IIS-1217559. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

1. GENERAL INFORMATION

1.1. Abstract ( Link to NSF abstract)

The goal of this research project is to help people make sense of large graphs, ranging from social networks to network traffic. The approach consists of combining two complementary fields that have historically had little interaction -- data mining and human-computer interaction -- to develop interactive algorithms and interfaces that help users gain insights from graphs with hundreds of thousands of nodes and edges. The goal of the project is to develop mixed-initative machine learning, visualization, and interaction techniques in which computers do what they are best at (sifting through huge volumes of data and spotting outliers) while humans do what they are best at (recognizing patterns, testing hypotheses, and inducing schemas). This research addresses two classes of tasks: first, attention routing -- using machine learning to direct an analyst's attention to interesting nodes or subgraphs that do not conform to normal behavior. Second, sensemaking -- helping analysts build in-depth representations and mental models of a specific areas or aspects of a graph. Evaluation of the tools will involve both controlled laboratory studies as well as long-term field deployments.

As large graphs appear in many settings -- national security, intrusion detection, business intelligence (recommendation systems, fraud detection), biology (gene regulation), and academia (scientific literature) -- the potential benefits of new tools for making sense of graphs is far reaching. Project results, including open-source software and annotated data sets, will be disseminated via the project web site (http://kittur.org/large_graphs/) and incorporated into educational activities.

1.2. Keywords

Data mining, HCI, graph mining.

1.3. Funding agency

2. Research Highlights & Major Activities


Figure 1. The CrowdScape interface. (A) is a scatter plot of aggregate behavioral features. Brush on the plot to filter behavioral traces. (B) shows the distribution of each aggregate feature. Brush on the distribution to filter traces based a range of values. (C) shows behavioral traces for each worker/output pair. Mouseover to explore a particular worker's products. (D) encodes the range of worker outputs. Brush on each axis to select a subset of entries. Next to (B) is a control panel where users can switch between parallel coordinates and a textual view of worker outputs (left buttons), put workers into groups (colored right buttons), or find points similar to their colored groups (two middle button sets).

2.1. Interactive Visualization

One area in which it is challenging to make sense of large data is understanding how people interact with an interface: each mouse movement, click, scroll, keypress, etc. generates potentially useful information but can quickly become overwhelming. Mining such data could be especially useful for the growing field of crowdsourcing, in which employers have little control or visibility into how crowd workers are accomplishing tasks and often have quality control issues. We developed a novel system (CrowdScape) which helps researchers to visualize the behavioral traces of crowd workers and interactively group them into clusters using machine learning. Each worker's behavior -- clicks, scrolls, typing, delays, etc. -- is summarized in a compact row, allowing many workers to be easily compared and made sense of, with dynamical queries providing an interactive overview and filtering mechanism. Furthermore, once a worker's behavior has been identified as high or low quality, CrowdScape uses machine learning models to propagate these labels to similar work. This research won the Best Paper award at UIST 2012.

Another significant contribution is our novel decomposition of canonical graph visualization techniques (e.g., PivotGraph, semantic substrate) into reusable, atomic interactive operations/building blocks (e.g., ranking horizontally, grouping nodes into super nodes). The user can then flexibly combine these operations to summon graph visualization techniques on demand and to potentially creating new ones. This investigation has led to an InfoVis'14 paper and formed the foundation of the thesis of CS PhD student Chad Stolper (Georgia Tech).

2.2. Multitouch visualization

In the case of highly multivariate data, it is difficult with desktop-based systems to examine more than two dimensions of variables due to their structured approach. Tablets provide a different set of affordances compared to desktops, and might allow us to interact with data in new ways. We have used multitouch gestures on a tablet combined with physics-driven models to create new interaction techniques that help users to explore more dimensions and make deeper analyses. Figures 1 and 2 below demonstrate some of the utility of these techniques. This research was published as an extended abstract at CHI 2013, where it was also shown as a demo and part of the video showcase.

2.3. Graph mining algorithms

We are also working on automatically determining important structures in a graph. The idea is to use the so-called Minimum Description Language (MDL) to describe 'important' subgraphs. A subgraph is 'important', if we can compress it easily: for example a clique of n nodes can be easily compressed; similarly for a chain, and for a star. Ms. Danai Koutra is working on the topic, developing scalable heuristics to solve the combinatorial problem of subgraph selection. Once such subgraphs are chosen, we plan to show them to the user, in a compressed form, for example, using a 'box' glyph, to represent a clique, or a 'star' glyph to represent a star.

Some of the research highlights include the TimeCrunch paper in kdd'15 for summarizing and understanding time-evolving graphs, and the 'Perseus' graph visualization demo in vldb'15. The over-arching ideas are (a) the use of MDL (minimum description language) to summarize and understand large static and time evolving graphs and (b) to present only a few items to the user each time ('Perseus') and (c) to strive for attention routing: show the user a quick summary (through MDL), and draw attention to the most blatant outliers ('Perseus').

Another highlight is our 2015 survey paper that summarizes the latest sensemaking challenges and opportunities in scalable graph exploration and visualization [Pienta'15], pinpointing the state of the art accomplishments (including some of our own) at the intersection of data mining and HCI and the open problems that the data mining and HCI research communities may want to join forces to solve.

Through this project, we also discover a new, simple, and practical way to scale up interactive analytics to billion-scale graphs on a single PC that uses the universally available virtual memory / memory mapping (MMap) capability on all operating systems to store graph data that would otherwise be too big to fit in RAM. Our approach significantly outperforms other existing approaches. We published our initial results at IEEE BigData'14. It could become a promising way to scale up general machine learning and data mining techniques without requiring developers and users to "re-learn" or reimplementing them using custom software frameworks.

Figure 3. Wikipedia editors in a "edit war" on the spelling of "Kiev". Figure 4: The 'Perseus' graph visualization system.


2.4. IDEA workshop series and journal

We have co-organized the IDEA workshop (Interactive Data & Visual Analytics) at KDD'15, '14, and '13 (co-PIs Chau and Faloutsos, http://poloclub.gatech.edu/idea2015/) as a concrete, highly visible way to bridge the data mining and HCI communities. IDEA is one of the largest workshops at KDD, attended by 100-200 attendees annually. Notably, HCI and visualization key figures like Ben Shneiderman (UMD) and Marti Hearst (UC Berkeley) gave keynotes at IDEA. We are also organizing our first TKDD special issue on IDEA.

3. PEOPLE INVOLVED

In addition to the PIs, the following graduate students work on the project.

4. Outreach

Research results have been disseminated via lectures in undergraduate and graduate classes at Georgia Tech and CMU and other universities. Over 10 talks were given nation wide and internationally at top venues (e.g., NIPS, CHI, SDM, distinguished lectures by Faloutsos at U of Toronto, USC, UIC), a tutorial on graph mining in ICDM'14, and outreach events for undergraduate students, and the general public.

Dissertation

  1. Exploring and Making Sense of Large Graphs Danai Koutra, Ph.D. Dissertation, August 2015. Winner of the ACM SIGKDD dissertation award, 2016.

Publications

  1. EdgeCentric: Anomaly Detection in Edge-Attributed Networks Neil Shah, Alex Beutel, Bryan Hooi, Leman Akoglu, Stephan Gunnemann, Disha Makhija, Mohit Kumar, Christos Faloutsos. International Conference on Data Mining (ICDM) Cybersecurity Workshop, Barcelona, Spain, December 2016
  2. Linearized and single-pass belief propagation, Wolfgang Gatterbauer, Stephan Guennemann, Danai Koutra, and Christos Faloutsos. PVLDB, 8(5), p. 581-592, 2015
  3. TimeCrunch: Interpretable Dynamic Graph Summarization Neil Shah, Danai Koutra, Tianmin Zou, Brian Gallagher, Christos Faloutsos. Conference on Knowledge Discovery and Data Mining (KDD), Sydney, Australia, August 2015
  4. Perseus: An Interactive Large Scale Graph Mining and Visualization Tool Danai Koutra, Di Jin, Yuanchi Ning, Christos Faloutsos. VLDB 2015, Kohala Coast, Hawaii, Aug. 2015
  5. Interactive Querying over Large Network Data: Scalability, Visualization, and Interaction Design. Robert Pienta, Acar Tamersoy, Hanghang Tong, Alex Endert, Duen Horng (Polo) Chau. ACM Conference on Intelligent User Interfaces (IUI). Atlanta, GA, USA. March 29 - April 1, 2015.
  6. Scalable Graph Exploration and Visualization: Sensemaking Challenges and Opportunities. Robert Pienta, James Abello, Minsuk Kahng, Duen Horng Chau. International Conference on Big Data and Smart Computing (BigComp). Jeju Island, Korea. February 9-12, 2015.
  7. Spotting Suspicious Link Behavior with fBox: An Adversarial Perspective Neil Shah, Alex Beutel, Brian Gallagher, Christos Faloutsos. ICDM 2014, Shenzhen, CN, Dec. 2014.
  8. GLO-STIX: Graph-Level Operations for Specifying Techniques and Interactive eXploration. Charles D. Stolper, Minsuk Kahng, Zhiyuan Lin, Florian Foerster, Aakash Goel, John Stasko, and Duen Horng Chau. IEEE InfoVis 2014, November 9-14, Paris, France.
  9. MMap: Fast Billion-Scale Graph Computation on a PC via Memory Mapping. Zhiyuan Lin, Minsuk Kahng, Kaeser Md. Sabrin, Duen Horng Chau, Ho Lee, and U Kang. Proceedings of IEEE BigData 2014 conference. Oct 27-30, Washington DC, USA.
  10. Towards Scalable Graph Computation on Mobile Devices. Yiqi Chen, Zhiyuan Lin, Robert Pienta, Minsuk Kahng, Duen Horng (Polo) Chau. IEEE BigData 2014 Workshop on Scalable Machine Learning: Theory and Applications. Oct 27, 2014. Washington DC, USA.
  11. Net-Ray: Visualizing and Mining Billion-Scale Graphs. U Kang, Jay-Yoon Lee, Danai Koutra, and Christos Faloutsos. PAKDD 2014, Tainan, Taiwan.
  12. Influence Propagation: Patterns, Model and Case Study Yibin Lin, Agha Ali Raza, Jay-Yoon Lee, Danai Koutra, Roni Rosenfeld, Christos Faloutsos. PAKDD 2014, Tainan, Taiwan, May 2014.
  13. Com2: Fast Automatic Discovery of Temporal (Comet) Communities Miguel Araujo, Spiros Papadimitriou, Stephan Guennemann, Christos Faloutsos, Prithwish Basu, Ananthram Swami, Evangelos E. Papalexakis, Danai Koutra. PAKDD 2014, Tainan, Taiwan, May 2014.
  14. VoG:Summarizing and Understanding Large Graphs Danai Koutra, U Kang, Jilles Vreeken, Christos Faloutsos. SDM 2014, Philadelphia, PA, April 2014.
  15. Glance: Rapidly Coding Behavioral Video with the Crowd Walter S. Lasecki, Mitchell Gordon, Danai Koutra, Malte Jung, Steven P. Dow and Jeff P. Bigham. ACM Symposium on User Interface Science and Technology (UIST'14), Oct. 2014.
  16. Graph-based Anomaly Detection and Description: A Survey. Leman Akoglu, Hanghang Tong, Danai Koutra. Data Mining and Knowledge Discovery (DAMI), April 2014.
  17. BiG-ALIGN: Fast Bipartite Graph Alignment. Danai Koutra, Hanghang Tong, David Lubensky. ICDM 2013: 389-398
  18. Are all brains wired equally? Danai Koutra, Yu Gong, Sephira Ryman, Rex Jung, Joshua Vogelstein, Christos Faloutsos. OHBM 2013, Seattle, WA, June 2013.
  19. TouchViz: (Multi)Touching multivariate data. Jeffrey Rzeszotarski, Aniket Kittur. CHI 2013 Video showcase.
  20. TouchViz: (Multi)Touching multivariate data. Jeffrey Rzeszotarski, Aniket Kittur. CHI 2013 Demo.
  21. TouchViz: (Multi)Touching multivariate data. Jeffrey Rzeszotarski, Aniket Kittur. CHI 2013 Extended Abstracts
  22. CrowdScape: Interactively visualizing user behavior and output. Jeffrey Rzeszotarski, Aniket Kittur. UIST 2012: Proceedings of the ACM Symposium on User Interface Software and Technology. New York: ACM Press.

Last updated: Jan. 30, 2017, by Christos Faloutsos.