Different proximity measurements on graphs

From ScribbleWiki: Analysis of Social Media

Jump to: navigation, search

On the very detailed level (i.e., node level) of link analysis, we want to figure out the relationship between two nodes on the graph, such as proximity, association, correlation and causality. For proximity, the goal is to measure the closeness (a.k.a, relevance, or similarity) between two nodes.

For many types of real graphs (i.e. social network), traditional graph theoretic distance (such as shortest path, maximum flow) is often not enough to describle the multiple-facet relationship between two nodes.

One of the most widely used proximity measurement is probably the so-called random walk with restart, the main idea behind Google's PageRank algorithm. Given a weighted directed graph, we can define a random walk on this paper. In order to measure the proximity for each node j wrt a quer node i, we can perform a random walk with restart to the query node i: image a random particle starting from node i and doing random surfing on the graph. At each step, it will either randomly jump to its neighborhood with the probability that is proportional to the edge weights between them, or returns to the starting point i. The proximity for each node j wrt the query node i is defined as the stead-state probablity that random particle will finally stay at node j. This measurement will naturally incorporate multiple connection between two nodes of interest; and it also considers the "quality" for each individual connection (punishing longer connection, higher degree nodes along the connection etc).

We can also define proximity by treating the graph as a big electric network, such sink-augmented effective conductance, circle free efficive conductance, hitting time/commute time/resistance distance and so on. For circle free efficive conductance, it turns out the it can be computed the k-shortest simple paths in the graph. For resistance, the math is beautiful and neat, i.e. all the resistance distances are determined the pseudo-inverse of the graph laplacian. Note that, if we augment the graph with a universal sink, we can compute all these scores by random walk with restart.

Another type of proximity measurement is based on so-called "reinforcement" assumption. The basic assumption is two objects are similar with eath other if their friendes are similar with each other. SimRank is the earliest one in this category, followed by SimFusion and Link and Attribute Reinforcement. One potential problem with this line of work is the compuational issue. Take SimRank as an example, the algorithm needs to iterate sparse matrix-matrix multiplication untile convergence. Also, no matter what we ask, the algorithm is always giving all possible proximity (n^2) scores. Therefore, SimRank is more suitable for clustering task, rather than querying. In LinkClus, the authors try to solve this issue by introducing a hierarchical structure (SimTree) and propose a multi-resolutions alg.

Other possible condidates includes Katz, which can be viewed as an un-normalized version of random walk with restart; and matrix-forest-based algorithm which has a nice interpretation from matrix-forest theorem.

In statistical machine learning community. Note that by so-called eigenspectrum transformation (either parametric or non-parametric) for the graph laplacian, we can also get a good proximity measurement. Furthermore, random walk with restart can be viewed as a special parametric transformation (1/(1-c\rimda)) on the eigenvalues.

Views
Personal tools
  • Log in / create account