Tae Yano

Graduate Research Assistant
Language Technologies Institute
Carnegie Mellon University

About me:

I am a Ph.D. candidate in the Language Technologies Institute of the School of Computer Science at Carnegie Mellon University. I research natural language processing (NLP), statistical text analysis, probabilistic topic modeling, and their interaction with social science with my advisors, Noah Smith and William Cohen. A lot of stuff we do falls into the area of text-driven prediction. In recent works we focus mostly on the problems arising in American politics and government.

Prior to Carnegie Mellon, I was at Columbia University where I obtained MS in Computer Science. There I worked with Becky Passonneau at Columbia' CCLS on CLiMB Project. Before becoming a full-time graduate student I was a software engineer, building system applications for large scale document devices.

Research Interest:

Understanding a large volume of text is difficult for humans, which poses a unique challenge on the face of the recent flood of information. There seems to be so much information, yet we seem not to know where to begin to read.

I think statistical NLP is uniquely equipped to make a social impact in this context. Its fundamental pursuit is, in short, to understand linguistic phenomenon and language artifacts (e.g., documents) by taking advantage of evidences in large numbers. We hope our research will bear both practical and scholastic importance in this context.

Some of our attempts are outlined in my dissertation proposal: Text as Actuator: Text-Driven Response Modeling and Prediction in Politics

Refereed Publications:

Tae Yano, Noah A. Smith, and John D. Wilkerson.
Textual Predictors of Bill Survival in Congressional Committees
In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL 2012) Montreal, Quebec, July 2012.

Jacob Eisenstein, Tae Yano, William W. Cohen, Noah A. Smith, and Eric P. Xing.
Structured Databases of Named Entities from Bayesian Nonparametrics
In Proceedings of the EMNLP Workshop on Unsupervised Learning in NLP Edinburgh, UK, July 2011.

Justin Cranshaw and Tae Yano.
Seeing a Home away from the Home: Distilling proto-Neighborhood from Incidental Data with Topic Modeling
In Proceedings of the Workshop on Computational Social Science and the Wisdom of Crowds, Annual Conference on Neural Information Processing System (NIPS). Vancouver, B.C., Canada. Dec 2010

Tae Yano, Philip Resnik,and Noah A. Smith.
Shedding (a Thousand Points of) Light on Biased Language
In Proceedings of the NAACL-HLT Workshop on Creating Speech and Language Data With Mechanical Turk. Los Angeles, CA. June 2010

Tae Yano and Noah A. Smith.
What's Worthy of Comment? Content and Comment Volume in Political Blogs with Topic Models
In Proceedings of the International AAAI Conference on Weblogs and Social Media 2010. Washington D.C. May, 2010

Tae Yano and William Cohen, Noah A. Smith.
Predicting Response to Political Blog Posts with Topic Models
In Proceedings of the North American Association for Computational Linguistics Human Language Technologies Conference (NAACL). Boulder, CO. May/June, 2009

Rebecca Passonneau, Tom Lippincott, Tae Yano, Judith Klavans.
Relation between Agreement Measures on Human Labeling and Machine Learning Performance: Results from an Art History Domain
In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC). Marrakesh, Morroco. May/Jun, 2008

Judith Klavans, Carolyn Sheffield, Eileen Abels, Joan Beaudoin, Laura Jenemann, Jimmy Lin, Tom Lippincott, Rebecca Passonneau, Tandeep Sidhu, Dagobert Soergel, and Tae Yano.
Computational Linguistics for Metadata Building: Aggregating Text Processing Technologies for Enhanced Image Access
In Proceedings LREC Workshop on Language Resources for Content-Based Image Retrieval (OntoImage 2008). Marrakesh, Morroco. May/Jun, 2008

Rebecca Passonneau, Tae Yano, Tom Lippincott, Judith Klavans
Functional Semantic Categories for Art History Text: Human Labeling and Preliminary Machine Learning
In Proceedings of the workshop on Metadata Mining for Image Understanding, 3rd International Conference on ComputerVision Theory and Applications (VISAPP).
Funchal, Madeira Portugal. Jan, 2008


Tae Yano
KP: A knitting language
Term project report, Programming Languages and Translator (COMS4115)
Columbia University, New York, NY. Fall 2005

Tae Yano and Moonyoung Kang
Taking advantage of Wikipedia in Natural Language Processing
Term project report, Language and Statistics II (11-762)
Carnegie Mellon University, Pittsburgh, PA. Fall 2008


I released some of the data I collected along the way. Please follow the term of use, and cite our papers if you end up using them for your paper.

Political Blog Corpora
Data from five American political blogs during 2007 to 2008. README.txt

Congressional Bill Corpus
51,762 U.S. Congressional bills from the 103rd to 111th Congresses (1993 to 2010), each annotated with whether it survived (i.e., was recommended by) the Congressional committee process. README.txt