Mishne et al, WWW 2005

From ScribbleWiki: Analysis of Social Media

Jump to: navigation, search

Blocking blog spam with language model disagreement

Authors: Gilade Mishne, David Carmel, and Ronny Lempel

Paper: Blocking Blog Spam with Language Model Disagreement (pdf)

The authors propose a simple method for determining if a blog comment is spam: Because comment spam isn't intended to fool humans it generally violates the language of its context. So, they compare the language models of the comment and the page to which it links to that of the blog itself. If there's a great enough difference, the comment is classified as spam. Considering the model requires no training, updating, or network knowledge, it performs pretty well (83%, with a 68% baseline).

To calculate the distance between two language models, they use the Kullback-Leibler divergence: Image:kl_divergence.gif

This uses a maximum likelihood model smoothed with a distribution of words found on the Internet (Source: Berkeley/Stanford):

Image:smoothing.gif

They assume KL-divergence scores are drawn from one of two distributions: spam and legitimate text. Setting a threshold as a vertical separator between distributions creates the classification cutoff: moving it left (lower) reduces false negatives (unidentified spam); moving it right (higher) reduces false positives.

Pros and cons of method

Pros:

  • No training
  • No hard-coded rule sets that need updating
  • Doesn't require full web connectivity (unlike network analysis techniques)
  • Can be deployed retrospectively
  • Hard for spammer to choose comment language similar to both the blog and the spam site

Cons:

  • Spammers can just copy blog text (although this is detectable by search engines when they do it on multiple sites)
  • Doesn't work well on short posts unless language model includes other out-link page text (but this introduces model drift)

Experiment

The authors coded a corpus of 50 random blog posts with 1024 comments, 68% of which was spam and 32% clean. They varied the threshold by multiplying it by a factor between 0.75 and 1.25. Best performance was with a threshold multiplier of 1.1, with 83% accuracy. It performed worst on short blog posts. Expanding the language model of posts to include linked-to pages helped the classification of comments to those posts but hurt the overall performance by 2-5%.

Views
Personal tools
  • Log in / create account