Restaurant Reviews Dataset

This data has been collected by me (in a project with Noemie Elhadad) from http://newyork.citysearch.com/ in August 2006. Out of 17843 Restaurants, only 5531 had reviews which gives us a total of 52077 reviews. Maximum number of reviews is 242 (to give better idea for distribution: 25 restaurants >=100 reviews, 103 restaurants >=10 reviews). Here is disribution of ratings (Columns = 1: Rating, 2: Review counts, 5: Percent) and cuisines (Columns = 1: Cuisine, 2: Restaurant Count, 4: Review Count - note than one restaurant can have multiple cuisines).

Data

*.xml files (example) are plain extracted data in XML based on this schema
*.pos files (example) are *.xml files tokenize and part of speech tagged using OpenNLP
*.cnk files (example) are *.pos files chunk parsed using OpenNLP
*.gnk files (example) are *.cnk files after some tokenization is performed. More details here
Filenames are the identifier that City Search is using so the review from http://newyork.citysearch.com/review/41955207 are in 41955207.*
A few example Perl scripts to Manipulate these files are here.

And you can download all of the above here.

References

OpenNLP (http://opennlp.sourceforge.net/) which is create on top of Maximum Entropy tools (http://maxent.sourceforge.net/) inspired from thesis work of Adwait Ratnaparkhi (UPenn 1998). The models are trained on Penn Treebank.