blog-data Corpus
-----------
http://www.ark.cs.cmu.edu/blog-data
Version 1.0 released May 29, 2009.

If you publish research based on these data, please cite the following
paper:

 Predicting Response to Political Blog Posts with Topic Models 
 Tae Yano, William W. Cohen, and Noah A. Smith 
 In Proceedings of the North American Association for Computational Linguistics Human Language Technologies Conference, Boulder, CO, May/June 2009.

@InProceedings{yano_cohen_smith_09,
  author = {Yano, Tae. and  Cohen, William. and Smith, Noah A.},
  title = {Predicting response to political blog posts with topic models},
  booktitle = {Proceedings of the North American Association for Computational Linguistics Human Language Technologies Conference},
  year = {2009}
}
 
 http://www.cs.cmu.edu/~nasmith/papers/yano+cohen+smith.naacl09.pdf

The details are in the section 3.1 of the paper. This data set is provided "AS iS" for research purpose.

The distribution contains tokened and standardized texts from blog posts and comments
used for the topic modeling experimented reported in the paper above.

The texts are organized by the site: 

my -- Matthew Yglesias
dk -- Daily Kos
cb -- Carpetbagger Report
rs -- Red State
rwn -- Right Wing News

For each site, there are two sets of data, distilled and hbc_data:

"distilled" directory contains the blog texts (post and comment) extracted from the raw html texts. We eliminated all html directives (such as link or image). Also the punctuation marks and digits are replaced with unique meta tags. 

"hbc_data" directory contains the above material converted for HBC (Hierarchical Bayesian Compiler). Words are converted to unique word ids. Each post (and comment section) is represented in one line in the output file. Certain word pruning is applied at this point, for the purpose of our experiments. See the published paper for the detail of what pruning was applied at this stage.

Included python scripts convert "distilled" data set to "hbc" data set. Refer to the included readme_first file to how to run those script and which order.

Please direct any question to taey@cs.cmu.edu.