Research-Related Software, Code, and Data

In here I put references to code, tools, and other research data I make publicly available.
If you have any questions about this content, or if there's any other data you would like to use, please contact me at: dma [at] cs.cmu.edu

Ngram-based Comment Completion Eclipse Plugin

This eclipse plugin enables word-completion within comments based on an n-gram model trained on multiple open-source JAVA projects and data from StackOverflow. Word completion works in a similar way to code completion tools built into standard code editors. While writing the comment you will be prompted for suggestions based on the implementation of the class you are currently commenting.

Plugin update site:

http://www.cs.cmu.edu/~dmovshov/software/commentCompletionPlugin/

Corpus:

The tokenized open-source code used to create the model can be found on github.
StackOverflow data was downloaded from clearbits.

Relevant Paper:

Dana Movshovitz-Attias and William Cohen, Natural Language Models for Predicting Programming Comments, ACL, 2013

Sep 4, 2013

Abbreviation Alignment HMM

This is an abbreviation extractor based on a Hidden Markov Model. With this code you can extract abbreviations and their definitions from a text corpus. The Abbreviation Alignment HMM code is a part of the second-string open source package.

Main Classes:

  • com.wcohen.ss.AbbreviationAlignment is an implementation of the abbreviation alignment metric.
  • com.wcohen.ss.expt.ExtractAbbreviations is a utility for extracting abbreviations from a text corpus using our method.

Code:

github code (within the second-string package).

Data: Abbreviations Extracted from PubMed

Using this method we extracted 1.4 million abbreviations from a corpus of 200K full text PubMed articles. The extracted abbreviations are available here. Each line in this file contains:
  1. The probability of the abbreviation as given by the HMM
  2. The ID of the document in our corpus
  3. The Medline ID of the original text
  4. The Short Form of the abbreviation
  5. The Long Form of the abbreviation

Relevant Paper:

Dana Movshovitz-Attias and William Cohen, Alignment-HMM-based Extraction of Abbreviations from Biomedical Text, BioNLP in NAACL, 2012

Sep 18, 2012