Project Proposal:

Parallel Triangle Counting Solver

Yicheng Qin; Shu-Hao Yu

Summary

We plan to create optimized implementations of triangle counting of big data on GPU/multi-core CPU platforms, and develop multiple-machine version if possible. Moreover, we will make a detailed analysis on performance characteristics.

Background

Triangle counting is a famous challenge in graph. One of its applications is the study on social network. Through it, researchers learn more about the characteristics of social networks, and how people group together.

Triangle is a basic geometric figure. A triangle in the graph means that three points on the triangle are adjacent to everyone else. Based on that, triangle counting means that counting the number of different triangles in a graph.

Above picture is referenced from http://www.vertica.com/2011/09/21/counting-triangles/. You can easily learn that there are three triangles in the graph: {Pachu, Andrew, Matt}, {Ben, Chuck, Stephen} and {Chuck, Stephen, Rajat}.

Challenge

It seems to be a easy question, but when it scales up, it becomes rather difficult. The first challenge is its big scale. Normally, it can be used on the graph which consists of large amounts of points, which means that we need to consider the cost of transfer time and memory size. The second challenge is to find a better algorithm. The most native way is to find all point triples and check their correctness. The way to improve it tends to use extra space, which should be balanced between time and cost, especially when taking the previous challenge into consideration. The third challenge is that to verify the existence of a triangle, it needs to traverse the whole data set. Whenever you split the problem, it may cause some difficulties on synchronization.

Resources

There have been many researches on it. Siddharth Suri and Sergei Vassilvitskii tried to solve the problem through MapReduce. They have published a paper for it, named 'Counting Triangles and the Curse of the Last Reducer'.
Moreover, Stephen Walkauskas compares three approaches on it, including Hadoop, PIG and Vertica. He presented the result on his blog: http://www.vertica.com/2011/09/21/counting-triangles/.

GraphLab uses its own infrastructure to develop the toolkit for triangle counting, which tends to regard it as a graph.

The code of the project should mainly be written by ourselves, and all these resources give us many thoughts.

For multiple machines, we may use OpenMPI for communication. Alex published an article about how to utilize multiple cores on different machines, and it broadens our sightseeing.Goals/Deliverables

PLAN TO ACHIEVE: Use GPU computation to gain 10X speedup than original algorithm. Build competitive GPU implementation against GraphLab on single machine.

HOPE TO ACHIEVE: Port to multiple machines to outperform GraphLab on multiple machines.

We will show the computational analysis in the end of this project. First we will see the speedup using GPU. Then we will compare with existing libraries.

Platform

We mainly use GHC machines which have NVIDIA GPU on it.
After that, we will try to use multiple machines on GHC.
Proposed Schedule

[Please do not modify the schedule on this page
after the proposal process (it is your proposed schedule, and it will
be useful to compare it to your actually project logs at the end of
the project). Update the schedule on your project main page if your
goals or timeline changes over the course of the project.]

Week | What We Plan To Do |

Apr 1-7 | Do topic choice |

Apr 8-14 | Research related work on Triangle Counting problem and submit the proposal |

Apr 15-21 | Multicore/GPU parallel solution on small data set |

Apr 22-28 | Improve the algorithm to support large data set |

Apr 29-May 5 | Speed up the algorithm by multiple machines |

May 6-11 | Leave for additional work |