PEGASUS (Peta Graph Mining Project)


Pegasus logo designed by Maria Tsiarli

Goals

The goal of this project is to create a software library using HADOOP that performs typical graph mining tasks in big graphs, currently handling hundreds of giga-bytes and heading towards tera- and peta-byte sized graphs. For some information click here. Currently, PEGASUS provides software for 0) degree distributions and PageRank, 1) diameter estimation, 2) connected components and 3) triangle counting.

Code

You can get the binaries from here. If you want the triangle counting code (JAVA, HADOOP and MATLAB) accompanying the paper "Doulion: Counting Triangles in Massive Graphs with a Coin", KDD '09, send an email to graphminingtoolbox@gmail.com.

Related Publications

Collaborators

U Kang and Christos Faloutsos.



Triangles

Goals

My personal interest in this problem, i.e., computing the number of triangles in the graph, began in the context of social networks where triangles have a natural interpretation: "friends of friends tend to become friends themselves". Two frequently computed metrics in complex network analysis, involve computing the number of triangles in a graph, namely the clustering coefficient and the transitivity ratio. My goal was to develop algorithms that compute triangles fast approximately in large graphs.

Short Description of Contributions


The first contribution was effectively an observation: we can take advantage of the properties of the spectrum of the adjacency matrix of a typical "power-law" graph and make a good estimate of the number of triangles using few eigenvalues and eigenvectors. Furthermore, in that same paper, two power laws are observed and a theorem on the number of triangles of the so-called Kronecker graphs (see Leskovec et.al.) is proved.

Given the experimental nature of the spectral approach (to the best of my knowledge, there is no paper explaining why the bulk of the eigenvalues is almost symmetric around 0) the goal was to develop an algorithm for the triangle counting problem that works for all graphs. The first contribution, was a simple sampling idea, i.e., choose a random subset of the edges, let's say in expectation of size pm, where m is the number of edges p is the percentage of edges we keep, reweight each edge with 1/p and count weighted triangles. This idea appeared in the KDD '09 paper, with experimental evidence that the estimator is good even if the graph is very sparse with respect to the number of triangles. The natural question of what is the minimum sparsification value for the Doulion sparsification idea was answered by M.N. Kolountzakis, Gary L. Miller and me and can be found in our preprint in Arxiv.

For more details, take a look in the related publications.

Related Publications



Tensor Mining

Goals

The goal of this project is to new develop tensor mining algorithms.

Related Publications

  • Two heads better than one: pattern discovery in time-evolving multi-aspect data
    Jimeng Sun, Charalampos E. Tsourakakis, Evan Hoke, Christos Faloutsos and Tina Eliassi-Rad
    In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2008)
    Invited Journal paper
  • Two heads better than one: pattern discovery in time-evolving multi-aspect data
    Jimeng Sun, Charalampos E. Tsourakakis, Evan Hoke, Christos Faloutsos and Tina Eliassi-Rad
    In Data Mining and Discovery Journal
  • MACH: Fast Randomized Tensor Decompositions
    Charalampos E. Tsourakakis
    Preprint (all preprints are subject to changes)

Collaborators

Jimeng Sun, Christos Faloutsos

Graph Mining: Detecting "Bridges"

Related Publications

  • BridgeFind: Fast Detection of Top-k Bridges in Large Graphs
    Hanghang Tong, Aditya Prakash, Charalampos E. Tsourakakis Tina Eliassi-Rad, Christos Faloutsos and Duen Horng Chau
    Under submission


  • Graph Similarity

    Ongoing work with Ioannis Koutis, Gary Miller and Christos Faloutsos.

    Manifold Learning

    Ongoing work. Check our reading group by clicking at Don's web page .

    Spectral Rounding for Vision and Machine Learning


    Image Segmentation produced using the Spectral Rounding Algorithm

    Ongoing work with Dave Tolliver. A recent paper by Dave Tolliver and Gary Miller called Spectral Rounding introduced a way to segment an image by reweighting iteratively the edges of the graph in a specific way so that it "breaks up" in a "natural" way into k components, i.e. the multiplicity of the eigenvalue 0 is k (see paper to learn how this idea actually works). We are investigating further this method.







Past Projects / Working Experience

Web Wrapper Verification




Description of Problem

Web wrappers are specialized programs that extract data from web sources automatically and bring that data into a structured form. Wrappers extract information accordingly to a set of rules based on the layout of the web pages. This works because web pages of a certain web site have usually the same layout since they are created by combining data coming from a database with a specific template. Web wrappers are important since they are used for integrating data coming from different web sources. In this way, the user can have a unified view of the information contained in independent web sources. For example, let's assume that we are interested in finding special offers from different travel offices. Probably we would use a search machine (Google, Yahoo) to find several travel offices and then check their offers. This process can become tedious and inefficient. As an alternative we could use web wrappers. If we would like to automate the aforementioned tedious process, we would create different web wrappers, one per each web source (in general) and then the wrappers would extract automatically the desirable information and bring it to a structured format (for example XML). Because the extraction process is based on the layout of the web pages, when a site changes its layout or even its content, the wrapper will (probably) not extract the desirable information. Thus, the need of a system that detects automatically the failure of a wrapper (wrapper verification task) and generates a new wrapper, adapted to the changed web source that extracts the desired information (wrapper reinduction task) is raised. These two tasks are known as the wrapper maintenance problem.

The main focus of my undergraduate thesis -conducted under the supervision of Timos Sellis and Georgios Paliouras- was to build a content-based wrapper verification system. Building up on existing work, several improvements were achieved. For the details see the technical report.

Related Technical Report

VeWRA: An Algorithm for Wrapper Verification
Charalampos E. Tsourakakis, Georgios Paliouras
CMU ML Tech Report CMU-ML-09-100, 2009

SHARE PROJECT



Before coming for graduate studies in USA, I worked for in the National Foundation of Research Demokritos in Athens on the SHARE Project. My main task was to create an ontology based web service for rescue and search operations. Tasks involved JAVA programming, using the Protege ontology editor and learning about Description Logics. I was happy to work during that period with Georgios Paliouras and Stasinos Konstantopoulos.

Click here to see the film trailer.