william yang wang

Software and data

1. The earnings calls dataset:

* William Yang Wang, and Zhenhao Hua, "A Semiparametric Gaussian Copula Regression Model for Predicting Financial Risks from Earnings Calls", in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), long paper, Baltimore, MD, June 22-27, ACL. Preprint version: PDF BIB

Download: dataset.

2. The Yelp computational branding analytics (CBA) data:

* William Yang Wang, Ed Lin, John Kominek, "This Text has the Scent of Starbucks: A Laplacian Structured Sparsity Model for Computational Branding Analytics", to appear in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), full paper, Seattle, WA, USA, Oct. 18-21, ACL. PDF BIB

Download: dataset.

3. The Columbia Summarization Corpus (CSC):

* William Yang Wang, Kapil Thadani, and Kathleen R. McKeown, "Identifying Event Descriptions using Co-training with Online News Summaries", in Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, Nov. 8-13, ACL-AFNLP. PDF BIB


The Columbia Summarization Corpus (CSC) was retrieved from the output of the Newsblaster online news summarization system that crawls the Web for news articles, clusters them on specific topics and produces multidocument summaries for each cluster. We collected a total of 166,435 summaries containing 2.5 million sentences and covering 2,129 days in the 2003-2011 period. Additional references of the Columbia Newsblaster summarizer can be found on the website of Columbia NLP group publication page. The CSC corpus can be used, but not limited to the following areas:

* Event Mining
* Language generation
* Summarization
* Information retrieval
* Information extraction
* Sentiment analysis and opinion mining
* Question answering
* Text mining and natural language processing applications
* Language modeling for text processing
* Lexicon and ontology development
* Machine learning (supervised, semi-supervised, and unsupervised learning)

Click here to download the CSC corpus.