Kolari, AAAI 2006
From ScribbleWiki: Analysis of Social Media
SVMs for the Blogosphere: Blog Identification and Splog Detection
The paper tries to deal with two tasks for blogosphere: (1) blog identification, which classifies a webpage as 'blog' vs 'non-blog'; (2) spam blog detection, which classifies a blog page as 'normal' vs. 'spam'. Most blogs can be recognized by a simple procedure of URL matching since the majority of the blogs are hosted by well known hosting services. In the paper, they focus on those self-hosted blogs and those blogs from much less popular hosting systems for task (1). As for task (2), they focus on splog (as opposed to comment blog and trackback spam).
For both tasks, the authors treat them as the classification problems. Besides bag-of-words types of features, they also suggest using bag-of-anchors, bag-of-urls, and bag-of-ngrams. Then, they use two kinds of representations of features: normalized TF as well as binary features (absence or presence). The raw features are filtered by feature selection which uses mutual information criteria; and then are fed into SVM where they suggest using linear kernels.
The authors tested the algorithms on some datasets from MeMeta and reported very encouraging results. Those misclassified samples include (1) blogs in other languages (not English), (2) blogs which do not allow direct comments, and (3) blogs without posts.