Polling Made Easy: Aggregating Political Sentiments Using Topic Modeling

Carnegie Mellon University
11-742: Information Retrieval Lab
Fall 2008
Tae Yano

Abstract:

People's sentiments toward candidates are key information in predicting the outcome of elections. Traditionally, such information is approximated by professional pollsters who interview human subjects and conduct statistical analysis on the gathered data. Although the resulting poll is useful and for the most part trustworthy, the procedure is costly and in many cases susceptible to manipulations. In this paper I propose an alternative, an automated mean of collecting such "aggregated" political sentiments without human intervention, thereby eliminating both the traditional surveying cost as well as the human bias.

In recent years, numbers of online medias maintain comment sections as part of their public contents. I propose that the posed problem above is effectively approached by leveraging such free resource. Casual observation reveals that those comments are often of subjective nature because readers can express their opinion in relatively free of consequences. Furthermore, since composed in short span of time, they tend to be more direct and casual. Therefore, I hypothesis that readers' sentiments in current political issues should be observable in the comment section of political articles in relatively straightfoward manner. I further make the following claim:

Readers' aggregated sentiment toward an entity is captured by its mention's co-occurrence pattern with sentimental words within the pool of comments.

In another word, I conjecture that words of negative connotation systematically co-occur with the mention of his/her name if the readers, as a group, lean against the candidate, and of positive connotation if they lean toward the candidate. In this project, I attempt to attest the validity of this claim and its applicability to political opinion polling. Crucial point here is that we are after an "aggregated" sentiment, not the sentimental values in the finer granularity, for which the previous studies suggest more linguistically sophisticated means be apt.

I propose a corpus-wide sentiment scoring method based on above hypothesis, then apply the method to reader comments in newspapers collected anew for this study, and compare those results against the results from conventional political polling from Real Clear Politic (http://www.realclearpolitics.com/), which publishes week-by-week update of polling data gathered from various source. The performance of our method is assessed by how closely our results follow the change in the conventional polling data.

In designing a sentiment scoring method based on the above hypothesis, there are two immediate questions: First is how to identify the subjective words, and the second is how to induce the word co-occurrence patterns from a pool of comments. The first problem is solved by utilizing the existing subjectivity lexicons. There are several such dictionaries, most of them free of charge. I chose the one from NLP group at University of Pittsburgh (http://www.cs.pitt.edu/mpqa/), since it has been used in several reputable sentimental analysis studies in past years.

To approximate the word co-occurrence patterns in the reader comment collection, I propose a text profiling based on Latent Dirichlet Allocation. Topic modeling such as LDA or pLSA are more often used as a dimensionality reduction tool, but the technique is more naturally applicable here because, in essence, "topics" in topic modeling are nothing but patterns of word co-occurrence unique to the given corpus. It also is an unsupervised method, therefore require no annotation cost for training. A basic LDA model is implemented and trained on the corpus with several combination of hyper parameters.

Given a set of topics (i.e., multinomial distribution over words discovered via LDA training) sentiment scores are computed in the following manner: First, probability mass of the known negative/positive words are summed over for each topic. Then, for each candidate, the aggregated mass of the subjective words are weighted by his/her assigned probability mass. Those scores are summed across all the distributions to get the final scores. I believe that if my initial hypothesis bare any utility, the method will produce reasonable approximation of how a candidate is viewed within the community of readers. The method will be tested on the reader comments from highly partisan political discussion group (to prove the sanity of our concept), as well as the aforementioned newspaper corpus. I will also conducts the set of experiments with a proximity based sentiment scoring method as a baseline.

Timeline:

Item
Start date
Target date
Completed data
Project Design and SchedulingSep/12Oct/01TBC
Literature ReviewSep/10Oct/01TBC
Corpus PreparationSep/20Oct/01TBC
Experiment: designOct/05Oct/10TBC
Software: design and implementationOct/07Oct/20TBC
Experiment: implementation Oct/19Nov/04TBC
Final Report DraftOct/25Nov/05TBC
Follow-up Experiment: desing and implementation Nov/04Nov/15TBC
Final Result and AnalysisNov/15Nov/22TBC
Fiinal ReportNov/22Nov/30TBC
PresentationNov/25TBCTBC


Notes on relevant studies (TBC):

Sentiment analysis and opinion mining are gathering a lot of scientific, as well as commercial, research interest in the recent years. The majority of works deal with user reviews and comments on commercial products such as movies, appliances, or restaurants. The domain of political discourse is relatively understudied by the community in comparison, despite seemingly obvious potential therein. Opinion polling professionals such as The Gallop Group has been gathering the public "sentiment" on political events for over a century, in basically the same manner. Even with the advent of the Internet, which reduced the communication overhead significantly, poll taking still is a costly business. Automated tools to conduct text analysis for the purpose would be a great utility.

Political text in general is not particularly easy material for text analysis. Mullen and Malouf ("A preliminary Investigation into Sentiment Analysis of Informal Political Discourse") well summarized the difficulty in this domain, especially of sentiment analysis; "Word-based models succeed to a surprising extent but fall short in predictable ways when attempting to measure favorability toward entities. Pragmatic considerations, sarcasm, comparisons, rhetorical reversals and other rhetorical devices tend to undermine much of the direct relationship between the words used and the opinion expressed."

There are a few interesting ideas on sentiment summarization which may fair well against such adversity. A few of them are application of topic models, approaching the problem as a unsupervised topic classification paradigm. Particularly inspiring is the one by Ivan Titov and Ryan McDonald ("A Joint Model of Text and Aspect Ratings for Sentiment Summarization"), Mei and Ling et.al ("Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs"), Branavan, Chen, Eisenstein and Barzilay ("Learning Document-Level Semantic Properties from Free-text Annotations"). Although their approaches are quite diverse, the underlying assumption common to all those papers are that aggregated sentiment (or sensitive approximation to that) can be discovered by leveraging the word-level co-occurrence pattern, which can be captured by the generative probabilistic models.

There are many approaches in defining and apploximating "co-occurrence" in text corpora but not one definite answer. Many earlier studies have tried variety of method based on distributional evidence, In supervised setting mutual information or chi-squire independence test are used to test the cohesivness between the words.

Progress Report:

Interim Dicussion