Pang et al, EMNLP 2002
From ScribbleWiki: Analysis of Social Media
Thumbs up? Sentiment Classification using Machine Learning Techniques.
An influential paper that's the first to propose applying supervised machine learning techniques to the problem of sentiment classification, without using any prior knowledge. The idea is simple: just treat sentiment classification as plain topic-based text classification, with the two "topics" being positive sentiment and negative sentiment.
Three classification algorithms were studied: Naive Bayes, maximum entropy, and support vector machines, with a standard bag-of-features framework. Pang et al tried different features and their combinations, e.g. unigram, bigram, POS, position. Experiments on classifying movie reviews show that unigram presence (vs. unigram count) features works best.
Despite its simplicity, the result reported in this paper is much better than that in Turney_ACL_2001. The improved classification accuracy demonstrates the powerfulness of automatic machine learning, but is also attributable to the availability of labeled data in Pang's case. On the other hand, the standard techniques do not perform as well on sentiment classification as on traditional topic-based categorization, which shows the difficulty of sentiment classification.
Later work by Pang et al. (Pang & Lee, 2004) extends the work in this paper by classifying document only on subjective sentences, and utilization of pair-wise interaction information between nearby sentences. (Pang & Lee, 2005) further extends the binary classification of sentiment to a multi-point scale (multi-class classification).