Luyckx et al, CLIN 2004

From ScribbleWiki: Analysis of Social Media

Jump to: navigation, search

Shallow text analysis and machine learning for authorship analysis

Authors: Kim Luyckx and Walter Daelemans

Paper: [1]



The authors of this paper work with a corpus consisting of newspaper articles about national current affairs by different journalists. With this narrow corpus many features are kept roughly constant, allowing them to focus on the use of syntax-based and token-based features as predictors for an author's style. The main idea is that these stylistic characters are not under the author's conscious control, and are therefore good clues for authorship attribution.


In this paper, authorship attribution is viewed as a text categorization problem. Applications based on specific features of the authors are not explored. The authors classify documents based on four categories of features: token-level features (e.g. word length, syllables, n-grams), syntax-based features (e.g. part-of-speech tags, rewrite rules), features based on vocabulary richness (e.g. type-token ration, hapax legomena), and common word frequences. They particularly compare the usefulness of token-level, lexical, and syntax-based features.


The authors go into a good amount of detail on the specifics of the features they use.


Feature sets used:

  • pos: the frequency distribution of parts-of-speech (POS)
  • verb B: the frequency distribution of basic verb forms
  • verb: the frequency distribution of verb forms
  • pat num: the frequency distribution of specific Noun Phrase patterns
  • function: the frequency distribution of the fourty most frequent function words
  • lex: the frequency distribution of the twenty most informative words accord-ing to the Rainbow program
  • read: the readability score
  • all: a combination of all features
  • syntax: a combination of all syntax-based features and the token-level feature read


Example list of POS tags in feature set and mean frequency per text in different author classes.

POS tag Explanation Frequency
A-class B-class O-class

ADJ

adjectives

35

39

41

BW

adverbs

35

30

34

LET

punctuation

79

64

73

LID

articles

59

63

66

N

nouns

121

118

137

SPEC

proper nouns

24

23

20

TSW

interjections

0.3

0.1

0.14

TW

numerals

8

7

14

VG

conjunctions

20

18

25

VNW

pronouns

50

38

48

VZ

prepositions

66

68

78

WW

verbs

81

76

89


The authors used the memory-based learner TiMBL to do their evaluations using the different feature sets, finding that syntax-based features were the best category for attribution, with an F-score of 57.3%. When using all the features sets in conjunction, they achieved a mean F-score of 72.6%.


Data sets

Author classes

Average

A-class

B-class

O-class
pos 43.3% 54.9% 44.9% 47.7%
verb B 53.8% 43.8% 27.6% 41.7%
verb 43.6% 46.9% 34.5% 41.7%
pat num 53.2% 50.0% 35.6% 46.3%
function 65.7% 55.7% 43.1% 54.8%
lex 44.4% 59.4% 51.2% 51.7%
read 62.9% 53.3% 36.4% 50.9%
all 77.6% 74.7% 65.5% 72.6%
syntax 59.4% 61.7% 50.9 % 57.3%
Views
Personal tools
  • Log in / create account