This is the README file for the "movie$-data v1.0" dataset which consists 
of movie metadata and critics' reviews of 1718 movies released during the 
years 2005 to 2009, from the following sources:

Austin Chronicle (www.austinchronicle.com) -- 462 reviews
Boston Globe (www.boston.com) -- 731 reviews
LA Times (www.calendarlive.com) -- 625 reviews
Entertainment Weekly (www.ew.com) -- 1039 reviews
New York Times (www.nytimes.com) -- 1375 reviews
Variety (www.variety.com) -- 1454 reviews
Village Voice (www.villagevoice.com) -- 1396 reviews

The data collection process involved scraping data from www.metacritic.com 
to get movie metadata and URLs of reviews, www.the-numbers.com for 
financial information (opening weekend gross revenue, number of screens 
on which the movie opened, and budget), and the above web sites 
for actual reviews. The process is detailed in the file doc/documentation.txt.

File and directory layout
=========================

After extracting movies-data-v1.0.tar.gz, you should have the following
directory structure:

* movies-data-v1.0/README
  -- this file that you are currently reading

* movies-data-v1.0/doc
* movies-data-v1.0/doc/documentation.txt
  -- file describing the details about the data collection and cleaning process

* movies-data-v1.0/metacritic+starpower+holiday
  -- directory containing 2,080 XML files, one for each movie, for the entire
     set of movies which we scraped from www.metacritic.com

* movies-data-v1.0/metacritic+starpower+holiday+revenue+screens+reviews
  -- directory containing 1,718 XML files (subset of the 2,080 files above), 
     one for each movie that had all the required information from 
     www.the-numbers.com that we used in our experiments (opening weekend 
     revenue and number of screens), as well as at least one critic review
     on one of the seven review websites we crawled.

* movies-data-v1.0/reviews
  movies-data-v1.0/reviews/www.austinchronicle.com
  movies-data-v1.0/reviews/www.boston.com
  movies-data-v1.0/reviews/www.calendarlive.com
  movies-data-v1.0/reviews/www.ew.com
  movies-data-v1.0/reviews/www.nytimes.com
  movies-data-v1.0/reviews/www.variety.com
  movies-data-v1.0/reviews/www.villagevoice.com

	-- These directories contain the  text of the critics' reviews for 
	   the movies (not all review sites have reviews for all of the 1,718 movies).

* movies-data-v1.0/traindevtest_splits
  movies-data-v1.0/traindevtest_splits/train
	-- file containing list of movies used as the training set

  movies-data-v1.0/traindevtest_splits/dev
	-- file containing list of movies used as the development set

  movies-data-v1.0/traindevtest_splits/test
	-- file containing list of movies used as the test set

* movies-data-v1.0/7domains-train-dev.tl.xml 
  movies-data-v1.0/7domains-train-test.tl.xml
  -- XML formatted files containing the movie reviews and opening
	   weekend revenue for each movie. The file 7domains-train-dev.tl.xml
		 contains the training and the development set movies, and
		 the file 7domains-train-test.tl.xml contains the training
		 and test set movies. Please refer to doc/documentation.txt
		 for information on the format of these XML files.

* movies-data-v1.0/perscreen-7domains-train-dev.tl.xml 
  movies-data-v1.0/perscreen-7domains-train-test.tl.xml
  -- Similar to the above XML files, except these contain the
	   "per screen" opening weekend revenue as the target variable
		 to be predicted.


If you make use of this data, please cite the following publication:

Mahesh Joshi, Dipanjan Das, Kevin Gimpel, and Noah A. Smith.
Movie Reviews and Revenues: An Experiment in Text Regression.
In Proceedings of NAACL-HLT (Short paper track), Los Angeles, CA, June 2010.

This data set release was prepared by:

Mahesh Joshi - maheshj@cs.cmu.edu
Dipanjan Das - dipanjan@cs.cmu.edu
Kevin Gimpel - kgimpel@cs.cmu.edu
Language Technologies Institute
Carnegie Mellon University
05/03/2010

DISCLAIMER
==========

We have made every effort on our part to create a clean dataset for
academic research purposes. However, WE PROVIDE NO GUARANTEES
WHATSOEVER ABOUT THE QUALITY OR CONTENT OF THE DATA. If you use
this data in your research, please bear in mind that the results
and conclusions you draw using this data are entirely your own
responsibility.

We welcome corrections to the dataset, please contact Mahesh Joshi
(maheshj@cs.cmu.edu) for any corrections.

USES
====

This dataset is provided for academic research purposes alone,
any commercial use of this dataset is prohibited.