Software and data9. Annotated Annoying Behaviors
William Yang Wang and Diyi Yang, "That's So Annoying!!!: A Lexical and Frame-Semantic Embedding Based Data Augmentation Approach to Automatic Categorization of Annoying Behaviors using #petpeeve Tweets", to appear in Proceedings of the 2015 Conference o n Empirical Methods in Natural Language Processing (EMNLP 2015), short paper, Lisbon, Portugal, Sept. 17-21, ACL. PDF BIB DATA
8. Information Extraction Tutorial at Peking University.
*CIPS Summer School IE Course Homepage Slides:PPTX PDF July 25, 2015
7. Three Wikipedia Datasets for Joint IE and Reasoning.
*William Yang Wang and William W. Cohen, "Joint Information Extraction and Reasoning: A Scalable Statistical Relational Learning Approach", to appear in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and The 7th International Joint Conference of the Asian Federation of Natural Language Processing (ACL-IJCNLP 2015), long paper for oral presentation, Beijing, China, July 26-31, ACL. PDF BIB
6. ProPPR: a scalable probabilistic first-order logic.
*William Yang Wang, Kathryn Mazaitis, Ni Lao, and William W. Cohen, "Efficient Inference and Learning in a Large Knowledge Base: Reasoning with Extracted Information using a Locally Groundable First-Order Probabilistic Logic", to appear in Machine Learning Journal (MLJ 2015), Springer. Preprint version: PDF BIB
5. A large European family dataset for relational learning.
* William Yang Wang, Kathryn Mazaitis, and William W. Cohen, "A Soft Version of Pre dicate Invention Based on Structured Sparsity", to appear in Proceedings of the 24th Inte rnational Joint Conference on Artificial Intelligence (IJCAI 2015), full paper for oral presentation, Buenos Aires, Argentina, July 25-31, IJCAI. Preprint version: PDF BIB
4. The meme descriptions dataset:
* William Yang Wang and Miaomiao Wen, "I Can Has Cheezburger? A Nonparanormal Approach to Combining Textual and Visual Information for Predicting and Generating Popular Meme Descriptions", to appear in the 2015 Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL HLT 2015), long paper, Denver, CO., USA, May 31-June 5, ACL. Preprint version: PDF BIB
3. The earnings calls dataset:
* William Yang Wang, and Zhenhao Hua, "A Semiparametric Gaussian Copula Regression Model for Predicting Financial Risks from Earnings Calls", in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014), long paper, Baltimore, MD, June 22-27, ACL. Preprint version: PDF BIB
2. The Yelp computational branding analytics (CBA) data:
* William Yang Wang, Ed Lin, John Kominek, "This Text has the Scent of Starbucks: A Laplacian Structured Sparsity Model for Computational Branding Analytics", to appear in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), full paper, Seattle, WA, USA, Oct. 18-21, ACL. PDF BIB
1. The Columbia Summarization Corpus (CSC):
* William Yang Wang, Kapil Thadani, and Kathleen R. McKeown, "Identifying Event Descriptions using Co-training with Online News Summaries", in Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, Nov. 8-13, ACL-AFNLP. PDF BIB
The Columbia Summarization Corpus (CSC) was retrieved from the output of the Newsblaster online news summarization system that crawls the Web for news articles, clusters them on specific topics and produces multidocument summaries for each cluster. We collected a total of 166,435 summaries containing 2.5 million sentences and covering 2,129 days in the 2003-2011 period. Additional references of the Columbia Newsblaster summarizer can be found on the website of Columbia NLP group publication page. The CSC corpus can be used, but not limited to the following areas:
* Event Mining
* Language generation
* Information retrieval
* Information extraction
* Sentiment analysis and opinion mining
* Question answering
* Text mining and natural language processing applications
* Language modeling for text processing
* Lexicon and ontology development
* Machine learning (supervised, semi-supervised, and unsupervised learning)
Click here to download the CSC corpus.