Kleinberg, KDD 2002
From ScribbleWiki: Analysis of Social Media
Bursty and Hierarchical Structure in Streams
Summarized by - Yichia
In document streams such as email and news articles, frequency of a topic may increase sharply and span for a period of time; such phenomenon is called burst of activity. This paper develops an approach for modeling such “bursts” in document streams.
Under the assumption that document streams consist of bursts of low intensities interleaving with bursts of high intensities, the author of this paper proposed an infinite-state automaton to model bursty streams. In the automaton, states correspond to different bursts of intensities, whereas state transitions correspond to burst of activity. Thus, transition to a high intensity state is associated with the generation of bursty events. Each transition is assigned a cost which controls the sensitivity of bursty detection of the model. Then, given a document stream, our goal is to find a low cost state sequence; this can be accomplished by dynamic programming. Finally, under this framework the identified bursts are naturally organized into a nested and hierarchical structure which provides a kind of visualization on the overall stream.
Experiments show that this “automata” approach for capturing burst of activity can make sense for many datasets, including emails, titles of conference papers, web logs, as well as U.S. Presidential State of the Union Addresses. In general, the results suggest that the framework can be applied to identify certain messages as key events in streams. For example, the analysis of email archive exhibits that bursts of high intensities in streams are highly correlated with significant events such as deadlines and scheduled events.
Although the author argued the bursty words recognized by the framework are very different from the words with high frequency, I have a feeling that the set of bursty words will be really similar to the set of words with high TFIDF. Nonetheless, this paper does introduce a well-defined model for structuring and analyzing communicating streams between people. Indeed, this is an authoritative paper in pioneering an approach to model bursty structure in document streams. Many papers extend Kleinberg’s framework and idea of bursts to other kinds of document streams. Some of these are listed below:
- Kumar et al, WWW 2003
- Gruhl et al, KDD 2005
- Dubinko et al, WWW 2006
- Leskovec, Kleinberg, Faloutsos, KDD 2005
(I will work on this paper as one of my contributions - Yichia) - done