May 16, 2008 Dear Little Green Footballers, We are Noah Smith and William Cohen, two faculty members at Carnegie Mellon University. In collaboration with our Ph.D. student Tae Yano, we have been studying text on the political blogosphere. Tae has been posting various intermediate results of her research on her web page for discussion among ourselves, and forgot to modify the read permissions on the directory to make it invisible outside CMU. There's nothing sensitive here - just research code and some statistics we've calculated on blog data, including Little Green Footballs and many other political blogs. We took the data down this morning while deciding how to respond to you, but will probably put it back up soon (perhaps with more understandable comments). We are researchers in the fields of Machine Learning and Natural Language Processing (that's what "NLP" stands for in this context, as itellu3times guessed, not Neurolinguistic Programming). In a nutshell, this means we are computer scientists who develop algorithms that use text data in order to automate tasks involving text in languages like English. As many of you guessed, this often involves statistical analysis of text data. If the idea that computer programs are "watching" your posts and comments and counting words is disconcerting, consider that this sort of automated statistical analysis is what makes search engines work (and probably what led you to find our files). We would like to apologize for the confusion, but it looks like you certainly had fun. Your thread was fascinating. We were indeed originally motivated by the Adamic and Glance paper (http://www.blogpulse.com/papers/2005/AdamicGlanceBlogWWW.pdf) that Dan G uncovered. Natalie Glance, the second author of that paper, is a good friend of one of us (William) - we taught a seminar last fall on the vast academic literature on analyzing blogs and other social media (http://socialmedia.scribblewiki.com/Main_Page) - and we got interested in the question of understanding how political communities form and evolve. Dad o Blondes nailed some of the financial reasons for being interested in blogs in general (http://littlegreenfootballs.com/showc/520/5294884), and these are part of the reasons why we hope funding agencies will eventually get interested in our work - if it progresses far enough! - but we picked this problem mostly because we read political blogs and it seemed like fun. Our work might lead to better tools for summarizing what's happening in the blogosphere (think Memeorandum) or directing readers to stories that are most interesting to them (think Netflix's recommendation scheme for movies, applied to blogs). Everything we crawled was public, and we crawled in accordance with robots.txt, so we didn't think to ask permission from LGF or any of the other sites (some liberal, some conservative) we crawled. We apologize to anyone who feels their privacy was violated, and if you send me an email we'll ask Tae to strip your comments from her data. (On the other hand, you guys did find pictures of Noah's cats and a pdf describing Tae's knitting project, so maybe we're even? :-) We weren't looking for sockpuppets, although that is a cool idea. We aren't trying to build an automated army of message-machine robot commenters, although we can see the financial value of one. We did spend some time looking at using fancy programs to predict why people post - e.g., to direct attention to another posting vs giving original commentary. We did some work to find out what are the differences between the sorts of things people say in comments vs. blog postings in red vs. blue blogs. Some of the raw data is in Tae's directory, and there's also some tagclouds for RedState vs DailyKos (but not LGF, sorry!) in http://www.cs.cmu.edu/~wcohen/cloud/ if you're interested. We did look at "comment prediction" - trying to predict which particular posts individuals will comment to (as a proxy for predicting what they'd actually be interested in reading). Some of you have pondered whether your comments about our work affect the scientific validity of our work. Because this isn't a social science experiment, and the participants are not "subjects," we don't think your reference to our work will have much of an effect on what we learn, especially since we're looking at very large amounts of data gathered over a long period of time and from many blogs. None of the work Tae did, and none of the LGF data, has been published anywhere yet. One of William's other students actually published a paper on something technically similar to comment prediction - finding what posts a blogger would link to, which Charles dug out a reference to (http://www.cs.cmu.edu/~nmramesh/icwsm.pdf - look at Table 5 for a mockup of the sort of tool one could build with this.) Yet another student of William's has looked at trying to predict political affiliation of a blogger from who he/she links to (http://www.cs.cmu.edu/~wcohen/postscript/icwsm-2007-frank-abstract.pdf) So - apologies again if our work seemed invasive, but like most academics, our motivations are obscure, but not evil. Except of course that we happen to be liberals. :-) Feel free to drop by our home pages (http://www.cs.cmu.edu/~wcohen and http://www.cs.cmu.edu/~nasmith) or William's blog (http://wcohen.blogspot.com/) to hear about how our work progresses. Best regards, Noah A. Smith, Assistant Professor William W. Cohen, Associate Research Professor School of Computer Science Carnegie Mellon University