Stoyanov and Cardie, EMNLP 2006
From ScribbleWiki: Analysis of Social Media
Partially Supervised Coreference Resolution for Opinion Summarization through Structured Rule Learning
The paper can be found here.
This paper tackles the problem of source coreference resolution to improve opinion mining from text. The authors claim that for creating high quality opinion summaries from documents, source coreference resolution or the clustering of noun phrases that relate to the same opinion entity is an extremely important task. They tackle this problem by using partially supervised techniques. In a broad way, the training data that the authors use do not have all noun phrases marked as coreference chains but only the ones that contain the entities of interest.
The task of source coreference resolution is to decide which mentions of opinion sources refer to the same entity. The training task can be divided into three parts:
- Source to NP Mapping: In this step, each document is preprocessed to POS tag each word and find the NPs. The NPs are then augmented with named entity information.
- Feature Vector Creation: For every pair of NPs, feature vectors are created. The features are typical for the task of coreference resolution and is described in previous work.
- Classifier Construction: Using the feature vectors from the above step, a training set is created that contains one training example per document. Each training example contains feature vectors for all pairs of NPs in the document, including the ones that do not map to the sources, together with available coreference information for the source noun phrases, i.e. the noun phrases to which the sources of opinion are mapped. The training instances are provided as input to a classifier which learns to produce a clustering of noun phrases for unseen documents.
The authors comment that this task of partially supervised coreference resultion is different from semi supervised techniques because the latter can be defined as clustering of items in presence of limited supervised information like pairwise constraints or labeled points. Also, in the latter there is no training phrase in contrast to the present framework. The paper derives from recent work on semi supervised clustering by taking advantage of the complexity of rich structure dependencies introduced by the clustering problem.
The RIPPER algorithm is used for rule learning in the authors' system. In a ruleset creation phrase, each rule is grown until rule's accuracy reaches 1.0 and then it is applied to the pruning data and if it reduces accuracy, the rule is removed. In another phase, the entire training data is used to first grow a replacemet rule and a revised rule for each rule in the ruleset. For each rule, the algorithm considers the original rule, the replacement rule and the revised rule, and keeps the rule that contains the smallest description length in the context of the ruleset.
The authors modify the RIPPER algorithm and names it StRIP, whose full form is Structured Ripper. RIPPER is extended such that every time it makes a decision about a rule, it considers its effect on the overall clustering of items as opposed to considering the instances that the rule classifies as positive/negative in isolation. More details are provided in the paper. In the partially supervised case, the unlabeled data are not used while computing the performance of the rules.
For evaluation, the MPQA corpus is used. It contains 535 documents from World Press. Phrase level opinion is marked in these documents and the annotation scheme is documented in previous work. The authors split the data into a training and a test set of 400 documents and 135 documents respectively. The present system is compared with three supervised baseline systems that are very competitive. The B^3 measure is used along with F1 to compute the performance. It shows that StRIP outperforms the baselines, thus exemplifying the fact that unlabeled NPs can help improve the task of coreference resolution.