Opinion Mining Dataset - ACL-IJCNLP 2009 - Joshi & Rosé

Download Op-Hu-Liu-Subset-v1.0.tar.bz2. Please refer to the README file below for details (README also included in the compressed tarball).

README

Opinion Mining Dataset - ACL-IJCNLP 2009 - Joshi & Rosé
Version 1.0

0. Introduction

This dataset is a subset of the opininon mining datasets released by Dr. Bing Liu's group from University of Illinois at Chicago. Their dataset is available from http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

This subset consists of 200 review comments (sentences in most cases) each for 11 different products. This subset was used for the experiments conducted in the following paper:

Mahesh Joshi and Carolyn Rosé. (2009) Generalizing Dependency Features for Opinion Mining. In Proceedings of ACL-IJCNLP 2009, Short Papers track.

If you use this dataset in any of your experiments, please cite the above paper in your work.

1. Format

This dataset (HL-11prods-2200comments.xml) has classification labels (from the manual annotation process done by Dr. Bing Liu's group) for the "opinion" class - which marks whether or not a review comment consists of any subjective evaluation of one or more features of the product or the product itself.

The format of the file is pseudo-XML. Each review comment is represented by an <instance>...</instance> tag in the file. The complete set of 2,200 instances is enclosed in an outermost level <instances>...</instances> tag.

The <instance> tag has two attributes - "id" and "subpop." "id" is a unique identifier given to each instance. "subpop" is a string that identifies the product name for which the review comment was written. Within each <instance> tag, the "cname" attribute in the <class> tag contains the classification label - POS stands for the opinion class, and NEG for the non-opinion class. The <text>...</text> tag contains the actual text of the review comment.

2. Regarding evaluation in the paper

For the experiments reported in Joshi and Rosé [1], leave-one-product-out evaluation was performed. Essentially, it was an 11-fold cross-validation, where the entire set of 200 comments for each product were in the test fold once, while the 2,000 review comments for the other ten products were in the training fold.

3. Contact Information

If you have any questions, comments or concerns, please contact Mahesh Joshi (maheshj cs cmu edu --- insert an AT and two DOTs at the right places to complete the email address).

* References

[1] Mahesh Joshi and Carolyn Rosé. (2009) Generalizing Dependency Features for Opinion Mining. In Proceedings of ACL-IJCNLP 2009, Short Papers track.