Bansal et al, VLDB 2007
From ScribbleWiki: Analysis of Social Media
The full pdf may be found here.
In this paper, authors make the following contributions:
- They present fast algorithms to identify sets of correlated keywords-- since there O(10^6) keywords, this is important when identifying associated keywords.
- Formalize the idea of stable clusters of keywords-- that is, that are incident over long periods of time.
- Present streaming (on-line) versions of these algorithms-- since having timely analysis of blogs without re-calculating is important.
- Evaluation of these algorithms.
They generate clusters using co-occurrence for keywords. Correlation is also measured, for the strength of the co-occurrence. They do this quickly by identifying biconnected components-- that is, components which cannot be easily disconnected by removal of some vertex. Keywords forming biconnected components may be considered strongly correlated.
They identify whether clusters are stable or not by number of paths-- each path has a timestamp on it, so if there are many paths, then this cluster is stable over time. They present a breadth-first algorithm for computing this.
They show that the algorithms they present have faster running time than other graph approximation algorithms that could have been used. Qualitatively, they showed some emergent clusters, which seemed to make sense. (Apple products, Somalian politics).
This paper provides some fast algorithms for keyword analysis. Furthermore, these can be computed on-line, an important part of blog analysis.