Mullen and Malouf, AAAI 2006
From ScribbleWiki: Analysis of Social Media
This page maintained by: Mahesh Joshi
A Preliminary Investigation into Sentiment Analysis of Informal Political Discourse
Tony Mullen, Rob Malouf
As the title suggests, this paper presents an initial data analysis and results of some classification experiments for sentiment analysis in the domain of online political discussions, such as those in discussion boards on the web.
Sentiment analysis is still an emerging research area. In particular, sentiment analysis for data that is conversational in nature, such as those in online discussion boards or spoken dialog transcripts (see Thomas et al., EMNLP 2006), has only been explored very recently. Traditionally sentiment analysis has focused mainly non-interactive datasets such as reviews of some entity such as a movie or a product. Therefore, it is good to see a discussion of the issues and challenges in dealing with such data. The domain of politics in social media is also fairly recent, and the authors have discussed the challenges along that dimension as well.
The authors have used an online discussion board dataset from the website http://www.politics.com/ (now apparently changed to http://www.justplainpolitics.com/), where users discuss a range of political issues. The dataset is attractive from the perspective of supervised machine learning since it contains self-described political affiliations of users (in their user profile), which can be used as labels for classification. In this paper the authors have manually merged the different set of labels into a broader “left” vs “right” taxonomy for classification. The discussions are organized in threads, one thread per topic.
Similar to traditional sentiment analysis, one challenge in processing this data is the complex nature of language that people use in expressing different opinions.
The key differences that authors point out are the following:
- Users express opinion about several different issues or topics of political interest
- Determining political affiliation is not just a favorability judgment, but spans a whole spectrum of political ideologies
When classifying the users, the authors have presented some discussion of how they chose the classes, and the problems thereof. At the fine-grained level of political affiliations, the problem of skewed distribution makes supervised learning difficult, and so the authors only perform classification into a broader “left” vs “right” taxonomy.
On the language side, the problems that worsen for such data are:
- Spelling errors
- This is complicated further by use of what the authors refer to as “pointed re-spellings”, such as Raygun for Reagan (as a comment on the former U.S. president’s support for a futuristic missile defense program)
- Lack of grammatical structure
Given these challenges, the classification results using a simple naïve Bayes classifier with bag-of-word features are fairly modest – approximately 60% accuracy on the entire set of users in the "left" or "right" category.
In the discussion section, the authors have mentioned one possible cause for this low performance – which is that use of topical words as features might be causing problems since irrespective of the political affiliation; many users will tend to talk about common issues such as “abortion” or “Iraq”.
Finally, the authors have performed a preliminary analysis of the conversational aspect of the data and calculated the percentage of times users of different affiliations quote others. They found that users quote other users who have an opposing affiliation more often, than they quote other users with same affiliation:
“Left users quote right users 62.2% of the time, and right users quote left users 77.5% of the time.”
Utilization of this quoting behavior is a part of another paper from the same authors: Graph-based User Classification for Informal Online Political Discourse. [ Malouf and Mullen, WICOW 2007 ]