william yang wang

Software and data

0. ProPPR: a scalable probabilistic first-order logic.

*William Yang Wang, Kathryn Mazaitis, Ni Lao, and William W. Cohen, "Efficient Inference and Learning in a Large Knowledge Base: Reasoning with Extracted Information using a Locally Groundable First-Order Probabilistic Logic", to appear in Machine Learning Journal (MLJ 2015), Springer. Preprint version: PDF BIB


1. The meme descriptions dataset:

* William Yang Wang and Miaomiao Wen, "I Can Has Cheezburger? A Nonparanormal Approach to Combining Textual and Visual Information for Predicting and Generating Popular Meme Descriptions", to appear in the 2015 Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL HLT 2015), long paper, Denver, CO., USA, May 31-June 5, ACL. Preprint version: PDF BIB

Download: dataset.

2. The earnings calls dataset:

* William Yang Wang, and Zhenhao Hua, "A Semiparametric Gaussian Copula Regression Model for Predicting Financial Risks from Earnings Calls", in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), long paper, Baltimore, MD, June 22-27, ACL. Preprint version: PDF BIB

Download: dataset.

3. The Yelp computational branding analytics (CBA) data:

* William Yang Wang, Ed Lin, John Kominek, "This Text has the Scent of Starbucks: A Laplacian Structured Sparsity Model for Computational Branding Analytics", to appear in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), full paper, Seattle, WA, USA, Oct. 18-21, ACL. PDF BIB

Download: dataset.

4. The Columbia Summarization Corpus (CSC):

* William Yang Wang, Kapil Thadani, and Kathleen R. McKeown, "Identifying Event Descriptions using Co-training with Online News Summaries", in Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, Nov. 8-13, ACL-AFNLP. PDF BIB

The Columbia Summarization Corpus (CSC) was retrieved from the output of the Newsblaster online news summarization system that crawls the Web for news articles, clusters them on specific topics and produces multidocument summaries for each cluster. We collected a total of 166,435 summaries containing 2.5 million sentences and covering 2,129 days in the 2003-2011 period. Additional references of the Columbia Newsblaster summarizer can be found on the website of Columbia NLP group publication page. The CSC corpus can be used, but not limited to the following areas:

* Event Mining
* Language generation
* Summarization
* Information retrieval
* Information extraction
* Sentiment analysis and opinion mining
* Question answering
* Text mining and natural language processing applications
* Language modeling for text processing
* Lexicon and ontology development
* Machine learning (supervised, semi-supervised, and unsupervised learning)

Click here to download the CSC corpus.