Language and Gender Author Cohort Analysis of E-mail for Computer Forensics

Authors: Olivier de Vel, Malcolm Corney, Alison Anderson, and George Mohay

The authors of this paper focus on using two determining two characteristics of an author: gender and language use.

They state that women's language makes more frequent use of emotionally intensive adverbs and adjectives, and ehtir language is more punctuated with attentuated assertions, apologies, questions, personal orientation and support. Conversely, while women tend to react to the contributions of others in this manner, men have a more proactive stance, by directing speech at solving problems. They tend to make strong assertions, rhetorical questions, and challenges. The authors sum up their assumptions by saying that men's on-line conversation resembles "report talk", while women favour "rapport talk".

The main reason the authors decided to use an SVM classifier was so that they would not have to deal with reducing the number of features to avoid over-fitting. As they are working with a great number of dimensions, this is very helpful.

The e-mail corpus for this study came from an academic institution with over 15,000 users. The authors narrowed their source down to 342 authors, and confirmed the gender and language background of each author. A cohort of about 5000 messages was then generated for gender characterization, and one for language usage characterization. They evalutated the text based on 221 different stylistic and structural attributions. There is not much discussion as to why they selected this feature set.

The conclusions the authors generate are expected, and a bit shallow. They find that the minimum word count and number of e-mails increase, so does their performance in correctly identifying the background of the author. They also found that using the full set of features generates optimal performance, and that removing feature set has a significant negative impact. They authors plan to continue their research by extending their process to involve more cohort characteristics.

