TAŞ et al, 2007

From ScribbleWiki: Analysis of Social Media

Jump to: navigation, search

Author Identification for Turkish Texts

Authors: Tufan TAŞ, Abdul Kadir GÖRÜR

Paper: [1]



This paper presents a fully automated approach to the identification of unrestricted text using stylometry. The authors apply 35 style markers to each author, with the author group consisting of 20 different writers.


The author-based corpus for their experiments contained two sets: a test set and a training set. The training set consisted of 20 different texts for each of the authors, while the test set had 5 different texts for each author. The articles were not selected consecutively, in an attempt to elimate some of the context.


As the text to be analyised was in Turkish, the authors were presented with some interesting challenges. For example, Turkish has a very different morphological and grammatical structure than Indo-European languages. To overcome issues like this, the system attempts to identify the part-of-speech and other attributes by applying 8 different rules in a certain order. They use k-fold cross validation to measure classification performance, with a k-value of 10.


The authors apply a large number of algorithms to the data, and compiled a table of their results. The basic trend is that using statistic data greatly improves the performance of the test, as compared to the baseline versions. The most successful model was Naive Bayes Multinomial, which achived a success rate of 80%, in conjunction attribute elimination using CFS Subset Evaluator with Rank Search method.

Views
Personal tools
  • Log in / create account