Boosting Methods in Spam Detection

From ScribbleWiki: Analysis of Social Media

Jump to: navigation, search

Boosting Methods in Spam Detection

Asymmetric Gradient Boosting with Application to Spam Filtering

In this paper, the authors propose a new asymmetric boosting method, Boosting with Different Costs (BDC). Different from traditional boosting methods, which assume the same cost for misclassified instances from different classes and focus on good performance with respect to the overall accuracy, BDC is more generic, and is designed to be more suitable for problems where the major concern is a low false positive (or negative) rate, such as spam filtering.

BDC is based on the MarginBoost framework. MarginBoost is a special variation of a more general class of boosting algorithms based on gradient descent in function space. Many important boosting algorithms can be reformulated in the MarginBoost framework, such as AdaBoost and LogitBoost. Under the MarginBoost framework, BDC uses two cost functions for ham and spam respectively. When the margin is positive, which corresponds to correct classification, both functions will output a small cost. On the other hand, when the margin is negative, the cost for ham continues to increase, whereas the cost for spam almost remains unchanged at very negative margins. The advantages of such design include: 1) Since BDC assigns small weights to extremely misclassified spam, it is able to discard some noisy spam. 2) The weight of misclassified ham is always high, which ensures that the combined classifier will have a low false positive rate.

In BDC, we need to set the value for two parameters. Empirical study on a synthetic data set shows that the performance of BDC is not very sensitive to specific parameter values as long as they are within a reasonable range. To demonstrate the performance of BDC, the authors used the Hotmail Feedback Loop data to compare BDC with state-of-the-art spam filtering techniques. BDC with decision stumps as the base classfiers performs the best of all the methods reported in the paper. The improvement of BDC over LogitBoost or LogitBoost with stratification is more obvious with complicated decision trees than with simple ones, showing that BDC is more resistant to overfitting than LogitBoost based methods.

[Jingrui's Slides]

Personal tools
  • Log in / create account