Teng et al, ICMLC 2004
From ScribbleWiki: Analysis of Social Media
E-Mail Authorship Mining Based On SVM For Computer Forensic
Authors: Gui-Fa Teng, Mao-Sheng Lai, Jian-Bin Ma, Ying Li
The ability to identify original author of e-mail misuse can help to prosecute an offender, and the authors of this paper focus on this particular appliation of authorship attribution. Various e-mail features (eg. linguistic features, header features, and structural characteristics) are used as features with SVM, with a co-locatation based kernel to classify or attribute authorship or e-mail messages to an author.
The authors adopted Vector Space Model (VSM) to store document information, representing each document as a vector of term and weight pairs. The weight of the vector is calculated in a standard fashion (term frequency - inverse document frequency). Not much is said about their feature selection process, except that they adopted chi-squared as the feature selection criteria.
Particular Attributes Used
- The From message
- The To message
- Whether or not have title
- Whether or note have attachments
- Whether or not have reply
- Uses a greeting acknowledgement
- Uses a farewell acknowledgement
- Contain signature text
- Mean sentence length
- Mean paragraph length
- Number of blank lines / total number of lines
The paper gives a brief overview of SVM. The authors used LibSVM for their evaluation, in a 'one against all' binary classification model. They do not publish results, but only say that their preliminary work is promising. Further proposed research on the topic includes combining SVM with other ML alcgorithms, additional feature extraction, and authorship characterization.