Zhang and Yu, TREC 2007
From ScribbleWiki: Analysis of Social Media
UIC at TREC 2007 Blog Track
The UIC was one of the top performing teams in the 2007 Trec Blog Track opinion detection task. Their method is described below.
Opinion Detection Task
For the opinion detection task, they applied a three-step approach: (1) Topical Information Retrieval, (2) Opinion Identification, and (3) Document re-ranking.
The UIC topical retrieval module first attempts to identify entities mentioned in the query. This is done through a combination of techniques, including dictionary lookups of phrase candidates in WordNet or Wikipedia and extracting noun phrases by parsing the query with the Collins parser.
After these entities are extracted from the query, the original query is expanded in several ways. The first of these is by attempting to find a Wikipedia page dedicated to each entity and extracting synonyms from that page. The second is standard Pseudo- or Blind-relevance feedback where the original query is run on the Blog corpus and discriminative terms are extracted from the top docuemtns returned. The third expansion method is pseudo-relevance feedback using documents returned from Google instead of the blog corpus. The terms gathered from the various expansion methods are combined with the original query for the the document retrieval step.
UIC also integrated a document filtering step in their retrieval phase to eliminate potential splog web pages or other pages unlikely to be relevant. This was a rule-based filter, removing documents with sentences deemed too long (> 300 words), documents containing pornographic terms, or non-English documents.
UIC's opinion identification module was a sentence-level binary (subjective/objective) SVM classifier with a carefully chosen feature set. If a document has at least one sentence labeled subjective, the document is given that label.
The features used to train the SVM classifier were unigrams and bigrams extracted from the previous year's opinion identification task data and from the web. Online review sites like rateitall.com and epinions.com were crawled to extract opinionated languaged associated with product reviews with a high (very positive) or low (very negative) rating. Wikipedia was used as a source of objective language. From this dataset, the chi-squared statistic was calculated to identify the strength of association between the terms and the opinionated training set. All terms with a confidence p <0.001 were extracted for use in the SVM classifier.
After the opinion identification, the original documents retrieved were re-ranked to favor those that expressed a strong opinon. First, sentences are eliminated if the expanded query terms do no occur within a 5-sentence window around sentences classified as opinionated. Next, documents are re-ranked according to the following formula, essentially averaging the original document relevance score and the confidences assigned by the SVM classifier:
OSim = a IR-Sim(d, Q) + (1-a) \sum_s Opinon-Score(s)
where Q is the query, d is the document, s are the subjective-classified sentences in the document, IR-sim(d, Q) is the document score from the original retrieval, and Opinion-Score(s) is the confidence assigned by the subjective-objective SVM classifier. For their submission, a was set to 0.5.
Polarity Classification Subtask
The UIC system treated the polarity classification task as a standard text classification problem, but using two binary SVM classifiers to identify positive and negative opinions. Their polarity classification system is similar to the opinion detection system, only differing in how the training data is constructed. Instead of selecting all the text from strongly opinionated reviews (high or low), they only select the text from the strongly positive reviews to train their positive opinion classifier and do the opposite to train their negative opinion classifier. For both classifiers, they used the same set of objective data as they used above.
The final classification of documents as positive/negative/mixed opinion is done through simple rules and thresholding of the poarity classification confidence scores assigned by the SVM classifier.