Dana Movshovitz-Attias

List of Publications

PhD Thesis: Grounded Knowledge Bases for Scientific Domains
Dana Movshovitz-Attias, August 2015
Committee: William Cohen, Tom Mitchel, Roni Rosenfeld, Alon Halevi
[pdf] [Thesis oral presentation]

KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts
Dana Movshovitz-Attias and William Cohen, 2015, Association for Computational Linguistics (ACL)
[pdf] [data] [ACL presentation] [bibtex]

Discovering Subsumption Relationships for Web-Based Ontologies
Dana Movshovitz-Attias, Steven Euijong Whang, Natalya Noy, and Alon Halevy, 2015, Proc. 18th International Workshop on the Web and Databases (WebDB) at ACM Sigmod
Winner of the WebDB Best Paper Award.
[pdf] [WebDB presentation] [bibtex]

Grounded Discovery of Coordinate Term Relationships between Software Entities
Dana Movshovitz-Attias and William Cohen, 2015, arXiv preprint arXiv:1505.00277
[pdf] [arXiv link] [bibtex]

Natural Language Models for Predicting Programming Comments
Dana Movshovitz-Attias and William Cohen, 2013, Association for Computational Linguistics (ACL)
[pdf] [corpus] [code (as Eclipse plugin)] [ACL presentation] [bibtex]

Analysis of the Reputation System and User Contributions on a Question Answering Website: StackOverflow
Dana Movshovitz-Attias*, Yair Movshovitz-Attias*, Peter Steenkiste and Christos Faloutsos, 2013, ASONAM
[pdf] [bibtex]

Alignment-HMM-based Extraction of Abbreviations from Biomedical Text
Dana Movshovitz-Attias and William Cohen, 2012, BioNLP in NAACL
[pdf] [github code (within the second-string package)] [code description and downloadable data] [abbreviations extracted from PubMed] [BioNLP presentation] [bibtex]

Bootstrapping Biomedical Ontologies for Scientific Text using NELL
Dana Movshovitz-Attias and William Cohen, 2012, BioNLP in NAACL
[pdf] [tech report] [BioNLP presentation] [bibtex]

Detection of Peptide‐Binding Sites on Protein Surfaces: The First Step Towards the Modeling and Targeting of Peptide‐Mediated Interactions
Assaf Lavi, Chi Ho Ngan, Dana Movshovitz‐Attias, Tanggis Bohnuud, Christine Yueh, Dmitri Beglov, Ora Schueler‐Furman, Dima Kozakov, 2013, Proteins: Structure, Function and Bioinformatics
[pdf] [bibtex]

Can Self-Inhibitory Peptides Be Derived from the Interfaces of Globular Protein-Protein Interactions?
Nir London, Barak Raveh, Dana Movshovitz-Attias and Ora Schueler-Furman, 2010, Proteins: Structure, Function and Bioinformatics
[pubmed] [bibtex]

On The Use of Structural Templates for High-Resolution Docking
Dana Movshovitz-Attias, Nir London and Ora Schueler-Furman, 2010, Proteins: Structure, Function and Bioinformatics
[pdf] [pubmed] [bibtex]
Poster presented at the 11th Israeli Bioinformatics Symposium at Tel-Aviv University, Israel, 4/2008.

The Structural Basis of Peptide-Protein Binding Strategies
Nir London, Dana Movshovitz-Attias and Ora Schueler-Furman, 2010, Structure
[pdf] [pubmed] [bibtex]
Poster presented at the 12th Israeli Bioinformatics Symposium at Weizmann Institute, Israel, 4/2009.

Software, Code, and Data

Code, tools, and research-related data.
If you have questions about this content, or if there is other data you would like to use, please contact me at: dma [at] cs.cmu.edu

KnowledgeBase-LDA (KB-LDA) Data

Dataset based on StackOverflow that was used to train the KB-LDA model from our ACL2015 paper.

Training Data:

The KB-LDA dataset [tar] was extracted from StackOverflow, parsed and cleaned. SVO and concept-instance relations were extracted based on the full data. We also include a clean list of the noun and verb tokens used from a sample of ~60K documents. The included files are:

so2013_svo_clean.csv: Subject-verb-object tuples (37k) extracted from StackOverflow corpus.; Format: id,subject,verb,object,count
so2013_hypernyms_clean.csv: Concept-instance pairs (17k) extracted from StackOverflow corpus.; Format: id,concept,instance,count
so2013_document_nouns_clean.csv: Document nouns (1.3m) from a sample of the StackOverflow corpus.; Format: quetion_id,noun,count
so2013_document_verbs_clean.csv: Document verbs (880k) from a sample of the StackOverflow corpus.; Format: quetion_id,verb,count

Learned Software Knowledge Base:

The following software knowledge base [zip] was learned with KB-LDA, and is evaluated in the paper. The data includes the learned topics, topic names (concepts), a topic hierarchy, and the top 100 learned relations. The README file details the format of the files and how the data was extracted.

Paper:

Dana Movshovitz-Attias and William Cohen, KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts, ACL, 2015

Update: July 23, 2015

Ngram-based Comment Completion Eclipse Plugin

This eclipse plugin enables word-completion within comments based on an n-gram model trained on multiple open-source JAVA projects and data from StackOverflow. Word completion works in a similar way to code completion tools built into standard code editors. While writing the comment you will be prompted for suggestions based on the implementation of the class you are currently commenting.

Plugin update site:

Corpus:

The model was created by tokenizing a corpus repository of open-source code. The tokenized corpus can be found on github. The dataset includes 9 open-source projects, including original source files and tokenizations of the code and comments.

StackOverflow data was downloaded from clearbits.

Paper:

Dana Movshovitz-Attias and William Cohen, Natural Language Models for Predicting Programming Comments, ACL, 2013

Software Update: Sep 4, 2013

Abbreviation Alignment HMM

This is an abbreviation extractor based on a Hidden Markov Model. With this code you can extract abbreviations and their definitions from a text corpus. The Abbreviation Alignment HMM code is a part of the second-string open source package.

Main Classes:

com.wcohen.ss.AbbreviationAlignment is an implementation of the abbreviation alignment metric.
com.wcohen.ss.expt.ExtractAbbreviations is a utility for extracting abbreviations from a text corpus using our method.

Code:

github code (within the second-string package).

Data: Abbreviations Extracted from PubMed

Using this method we extracted 1.4 million abbreviations from a corpus of 200K full text PubMed articles. The extracted abbreviations are available here.

Each line in this file contains:

The probability of the abbreviation as given by the HMM
The ID of the document in our corpus
The Medline ID of the original text
The Short Form of the abbreviation
The Long Form of the abbreviation

Paper:

Dana Movshovitz-Attias and William Cohen, Alignment-HMM-based Extraction of Abbreviations from Biomedical Text, BioNLP in NAACL, 2012

Software Update: Sep 18, 2012

CMU

Courses and TA experience

TA at CMU

Spring 2014

11-761: Language and Statistics, given by Roni Rosenfeld

Fall 2012

15-381: Artificial Intelligence: Representation and Problem Solving, given by Ariel Procaccia and Emma Brunskill

Courses at CMU

Fall 2012

Computer Networks with Peter Steenkiste

Spring 2012

Semantics of Programming Languages with Steve Brookes
Language and Statistics with Roni Rosenfeld
Machine Learning with Large Datasets with William Cohen

Dana Movshovitz-Attias

Welcome!

About me

List of Publications

Software, Code, and Data

KnowledgeBase-LDA (KB-LDA) Data

Training Data:

Learned Software Knowledge Base:

Paper:

Ngram-based Comment Completion Eclipse Plugin

Plugin update site:

Corpus:

Paper:

Abbreviation Alignment HMM

Main Classes:

Code:

Data: Abbreviations Extracted from PubMed

Paper:

CMU

TA at CMU

Spring 2014

Fall 2012

Courses at CMU

Fall 2012

Spring 2012

Fall 2011

Spring 2011

Fall 2010

Contact

Email

Office

Office Phone

Address

CV