blog-data Corpus ----------- http://www.ark.cs.cmu.edu/blog-data Version 1.0 released May 29, 2009. If you publish research based on these data, please cite the following paper: Predicting Response to Political Blog Posts with Topic Models Tae Yano, William W. Cohen, and Noah A. Smith In Proceedings of the North American Association for Computational Linguistics Human Language Technologies Conference, Boulder, CO, May/June 2009. @InProceedings{yano_cohen_smith_09, author = {Yano, Tae. and Cohen, William. and Smith, Noah A.}, title = {Predicting response to political blog posts with topic models}, booktitle = {Proceedings of the North American Association for Computational Linguistics Human Language Technologies Conference}, year = {2009} } http://www.cs.cmu.edu/~nasmith/papers/yano+cohen+smith.naacl09.pdf The details are in the section 3.1 of the paper. This data set is provided "AS iS" for research purpose. The distribution contains tokened and standardized texts from blog posts and comments used for the topic modeling experimented reported in the paper above. The texts are organized by the site: my -- Matthew Yglesias dk -- Daily Kos cb -- Carpetbagger Report rs -- Red State rwn -- Right Wing News For each site, there are two sets of data, distilled and hbc_data: "distilled" directory contains the blog texts (post and comment) extracted from the raw html texts. We eliminated all html directives (such as link or image). Also the punctuation marks and digits are replaced with unique meta tags. "hbc_data" directory contains the above material converted for HBC (Hierarchical Bayesian Compiler). Words are converted to unique word ids. Each post (and comment section) is represented in one line in the output file. Certain word pruning is applied at this point, for the purpose of our experiments. See the published paper for the detail of what pruning was applied at this stage. Included python scripts convert "distilled" data set to "hbc" data set. Refer to the included readme_first file to how to run those script and which order. Please direct any question to taey@cs.cmu.edu.