From ScribbleWiki: Analysis of Social Media
Techniques Used on Image Spam
This paper aims to detect spam images by classification. They use average color, color saturation, edge by Sobel detector, prevalent color coverage, and random pixel test as features; and use logistic regression, naive Bayes and decision tree as learning algorithms.
The first finding is that many effective features are time-consuming (to extract them). Then the authors propose using a feature selection scheme which balances effectiveness and efficiency. More specifically, the score function for a feature is defined as a linear combination of mutual information (which measures the effectiveness) and the average time to extract it (which measures the efficiency). In this way, the resulting algorithm still has high accuracy but it is much faster than the original one.
Another interesting point is that they suggest the so-call just in time feature extraction. Based on the observation that not all features are needed to classify a single image (for example, if we use decision tree and there exists a high precision feature that identifies an image as spam, then we do not need worry about the other features for THAT image). In this way, further speed-up is achieved.
As for the experimental part, the interesting point is as follows. Since there is no formal definition on what is a spam image (some legitimate email may also contain some images for ad purposes), the authors suggest treating all images from a spam email as spam image; and those from ham as normal images).
The paper describes an interesting problem: given a few detected spam images, how to find its near duplicated ones.
The authors observed that by the nature of image spam, the machine usually generates a template image and then uses some randomized techniques to generate a lot of other similar images which will be sent to end-users. Since they use randomized technique, each spam image is unique and traditional methods based on OCR will not work.
The authors suggest using a bunch of filters. Each filter uses one type of feature (such as color histogram, Haar wavelet and orientation histogram). It measures the minimum distance between a detected spam image and all the normal images (setting this minimum distance as the threshold r). Then if a test image falls into the sphere with the known spam image as center and the threshold r as radius, it will be flaged as a spam image. The authors also suggest several different ways to combine/aggregate different filters, including AND, OR and VOTE. The authors test the algorithms on a benchmark dataset and report very promising result (0.001% false positive rate
Other interesting points include (1) different ways to create a template spam image; (2) different randomized techniques; and (3) an on-line benchmark image spam dataset at here