This is the README file for the "movie$-data v1.0" dataset which consists of movie metadata and critics' reviews of 1718 movies released during the years 2005 to 2009, from the following sources: Austin Chronicle (www.austinchronicle.com) -- 462 reviews Boston Globe (www.boston.com) -- 731 reviews LA Times (www.calendarlive.com) -- 625 reviews Entertainment Weekly (www.ew.com) -- 1039 reviews New York Times (www.nytimes.com) -- 1375 reviews Variety (www.variety.com) -- 1454 reviews Village Voice (www.villagevoice.com) -- 1396 reviews The data collection process involved scraping data from www.metacritic.com to get movie metadata and URLs of reviews, www.the-numbers.com for financial information (opening weekend gross revenue, number of screens on which the movie opened, and budget), and the above web sites for actual reviews. The process is detailed in the file doc/documentation.txt. File and directory layout ========================= After extracting movies-data-v1.0.tar.gz, you should have the following directory structure: * movies-data-v1.0/README -- this file that you are currently reading * movies-data-v1.0/doc * movies-data-v1.0/doc/documentation.txt -- file describing the details about the data collection and cleaning process * movies-data-v1.0/metacritic+starpower+holiday -- directory containing 2,080 XML files, one for each movie, for the entire set of movies which we scraped from www.metacritic.com * movies-data-v1.0/metacritic+starpower+holiday+revenue+screens+reviews -- directory containing 1,718 XML files (subset of the 2,080 files above), one for each movie that had all the required information from www.the-numbers.com that we used in our experiments (opening weekend revenue and number of screens), as well as at least one critic review on one of the seven review websites we crawled. * movies-data-v1.0/reviews movies-data-v1.0/reviews/www.austinchronicle.com movies-data-v1.0/reviews/www.boston.com movies-data-v1.0/reviews/www.calendarlive.com movies-data-v1.0/reviews/www.ew.com movies-data-v1.0/reviews/www.nytimes.com movies-data-v1.0/reviews/www.variety.com movies-data-v1.0/reviews/www.villagevoice.com -- These directories contain the text of the critics' reviews for the movies (not all review sites have reviews for all of the 1,718 movies). * movies-data-v1.0/traindevtest_splits movies-data-v1.0/traindevtest_splits/train -- file containing list of movies used as the training set movies-data-v1.0/traindevtest_splits/dev -- file containing list of movies used as the development set movies-data-v1.0/traindevtest_splits/test -- file containing list of movies used as the test set * movies-data-v1.0/7domains-train-dev.tl.xml movies-data-v1.0/7domains-train-test.tl.xml -- XML formatted files containing the movie reviews and opening weekend revenue for each movie. The file 7domains-train-dev.tl.xml contains the training and the development set movies, and the file 7domains-train-test.tl.xml contains the training and test set movies. Please refer to doc/documentation.txt for information on the format of these XML files. * movies-data-v1.0/perscreen-7domains-train-dev.tl.xml movies-data-v1.0/perscreen-7domains-train-test.tl.xml -- Similar to the above XML files, except these contain the "per screen" opening weekend revenue as the target variable to be predicted. If you make use of this data, please cite the following publication: Mahesh Joshi, Dipanjan Das, Kevin Gimpel, and Noah A. Smith. Movie Reviews and Revenues: An Experiment in Text Regression. In Proceedings of NAACL-HLT (Short paper track), Los Angeles, CA, June 2010. This data set release was prepared by: Mahesh Joshi - maheshj@cs.cmu.edu Dipanjan Das - dipanjan@cs.cmu.edu Kevin Gimpel - kgimpel@cs.cmu.edu Language Technologies Institute Carnegie Mellon University 05/03/2010 DISCLAIMER ========== We have made every effort on our part to create a clean dataset for academic research purposes. However, WE PROVIDE NO GUARANTEES WHATSOEVER ABOUT THE QUALITY OR CONTENT OF THE DATA. If you use this data in your research, please bear in mind that the results and conclusions you draw using this data are entirely your own responsibility. We welcome corrections to the dataset, please contact Mahesh Joshi (maheshj@cs.cmu.edu) for any corrections. USES ==== This dataset is provided for academic research purposes alone, any commercial use of this dataset is prohibited.