Glance et al, WWW 2004
From ScribbleWiki: Analysis of Social Media
The paper describes the application of data mining, information extraction and NLP algorithms for discovering trends across a subset of approximately 100,000 weblogs. The analysis covering key persons, key phrases, and key paragraphs are published daily to BlogPulse.com.
This is one of the first papers to discuss blogs and trends in blogs. Being one of the first few papers to study the blogosphere, it describes the various characteristics of blogs as well as variations in terms of content, features, format etc. It describes in some detail the crawling, differencing and indexing process for generating the corpus of 100,000 blogs.
The pipeline for finding the key phrases is as follows: Find the key bigrams based on a combined 'informativeness' and 'phraseness' score, filter the phrase list to top N, use this list of Top N phrases as input to a seeded phrase finder to get more candidates, filter the candidates based on constituent type eg noun phrase, rerank the candidates based on 'burstiness'.
Key paragraphs are extracted by obtaining clusters of phrases and then selecting the paragraph that uses the largest majority of the phrases in the cluster. These key paragraphs, called 'BlogBites', provide context for the clusters of key phrases.
Key persons are extracted by restricting the candidate entries in the key phrases selection process to be person names and following the same pipeline.
The authors have also mentioned the features of searching the blogs and trending of search queries available as a direct by-product of their indexing task. These tools can be used for restricted market intelligence by giving an indication of the 'buzz' of a topic/product in the blogosphere.