Sanderson et al, ELMNP 2006
From ScribbleWiki: Analysis of Social Media
Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation
Authors: Conrad Sanderson and Simon Guenter
This paper evaluates the usefulness of sequence kernel based approaches for the task of authorship attribtion, compares their performances with two probabilistic approaches based on Markov chains of characters and works; and discusses the application of each approach to short text analysis.
They use Markov chains to estimate the likelihood of the text having been written by an author A, and the generic author G, basically providing a measure of confidence for the authorship attribution to A, as well as a measure of how far this article deviates from the writing style of all authors. One issue with their Markov approaches is that they might have heavy biases towards a particular author when working with a small corpus, as there might be a small set of chains, and when a previously unseen chain is encountered it may have too much influence.The authors' discussion of SVMs is brief. They state that binary SVMs are traditional, but that sequence-based kernels are emerging, proposing one kernel belonging to this family. They utilise weights dependent only on the length of each sequence, each falling into one of the following functions: specific length, bounded range, bounded linear decay, and bounded linear growth. To allow comparison of different length texts they also propose a normalized version of the kernel.
They conclude that character-based sequences are more valuable than word-based sequences, as the latter could have much higher dimensionality and be far sparser.The authors conduct a large amount of evaluation, showing that Markov character chains generally outperform the other methods discussed, but with character-based sequence kernels coming in at a close second. In their studies, Markov character chains were particularly useful for small numbers of characters, making them valuable for short text authorship attribution.