Dana Movshovitz-Attias

Ph.D., Computer Science Department
Carnegie Mellon University

Welcome!

I'm now at Google Research. I received a Ph.D. from the Computer Science Department, in the School of Computer Science at Carnegie Mellon University. My PhD adviser was William Cohen.

I am interested in the intersection of Natural Language Processing, Information Retrieval, and Machine Learning. My research experience includes the following topics:
grounded language learning, learning semantic relations, topic models, mining software repositories and software-focused corpora, bootstrapping on biomedical ontologies, knowledge base population, bootstrap learning and semantic drift, seed set refinement, text alignment with Hidden Markov Models, social media analysis, and computational biology.

Before coming to CMU, I got my M.Sc. and B.Sc. degrees in the Computer Science and Computational Biology program at the School of Computer Science and Engineering of The Hebrew University of Jerusalem. During that time, I did research at the Furman Lab (Dept. of Molecular Genetics and Biotechnology), and my adviser was Prof. Ora Schueler-Furman. In this group, we used computational methods to understand protein-protein interactions from a structural bioinformatics perspective. More specifically, we made predictions of the structural changes that take place in proteins during docking.

Apart from doing research I had a chance to get some great industry experience working for IBM, Facebook and Google.

You can find my full research and work history in my CV.

About me

One of the things I enjoy most is hiking and traveling around the world. So far one of my favorite hiking locations has been New-Zealand, and I plan to return! I have an awesome husband, who was also a CSD PhD student at CMU.

List of Publications

google scholar profile


PhD Thesis: Grounded Knowledge Bases for Scientific Domains
Dana Movshovitz-Attias, August 2015
Committee: William Cohen, Tom Mitchel, Roni Rosenfeld, Alon Halevi
[pdf] [Thesis oral presentation]

KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts
Dana Movshovitz-Attias and William Cohen, 2015, Association for Computational Linguistics (ACL)
[pdf] [data] [ACL presentation] [bibtex]

Discovering Subsumption Relationships for Web-Based Ontologies
Dana Movshovitz-Attias, Steven Euijong Whang, Natalya Noy, and Alon Halevy, 2015, Proc. 18th International Workshop on the Web and Databases (WebDB) at ACM Sigmod
Winner of the WebDB Best Paper Award.
[pdf] [WebDB presentation] [bibtex]

Grounded Discovery of Coordinate Term Relationships between Software Entities
Dana Movshovitz-Attias and William Cohen, 2015, arXiv preprint arXiv:1505.00277
[pdf] [arXiv link] [bibtex]

Natural Language Models for Predicting Programming Comments
Dana Movshovitz-Attias and William Cohen, 2013, Association for Computational Linguistics (ACL)
[pdf] [corpus] [code (as Eclipse plugin)] [ACL presentation] [bibtex]

Analysis of the Reputation System and User Contributions on a Question Answering Website: StackOverflow
Dana Movshovitz-Attias*, Yair Movshovitz-Attias*, Peter Steenkiste and Christos Faloutsos, 2013, ASONAM
[pdf] [bibtex]

Alignment-HMM-based Extraction of Abbreviations from Biomedical Text
Dana Movshovitz-Attias and William Cohen, 2012, BioNLP in NAACL
[pdf] [github code (within the second-string package)] [code description and downloadable data] [abbreviations extracted from PubMed] [BioNLP presentation] [bibtex]

Bootstrapping Biomedical Ontologies for Scientific Text using NELL
Dana Movshovitz-Attias and William Cohen, 2012, BioNLP in NAACL
[pdf] [tech report] [BioNLP presentation] [bibtex]

Detection of Peptide‐Binding Sites on Protein Surfaces: The First Step Towards the Modeling and Targeting of Peptide‐Mediated Interactions
Assaf Lavi, Chi Ho Ngan, Dana Movshovitz‐Attias, Tanggis Bohnuud, Christine Yueh, Dmitri Beglov, Ora Schueler‐Furman, Dima Kozakov, 2013, Proteins: Structure, Function and Bioinformatics
[pdf] [bibtex]

Can Self-Inhibitory Peptides Be Derived from the Interfaces of Globular Protein-Protein Interactions?
Nir London, Barak Raveh, Dana Movshovitz-Attias and Ora Schueler-Furman, 2010, Proteins: Structure, Function and Bioinformatics
[pubmed] [bibtex]

On The Use of Structural Templates for High-Resolution Docking
Dana Movshovitz-Attias, Nir London and Ora Schueler-Furman, 2010, Proteins: Structure, Function and Bioinformatics
[pdf] [pubmed] [bibtex]
Poster presented at the 11th Israeli Bioinformatics Symposium at Tel-Aviv University, Israel, 4/2008.

The Structural Basis of Peptide-Protein Binding Strategies
Nir London, Dana Movshovitz-Attias and Ora Schueler-Furman, 2010, Structure
[pdf] [pubmed] [bibtex]
Poster presented at the 12th Israeli Bioinformatics Symposium at Weizmann Institute, Israel, 4/2009.

Software, Code, and Data

Code, tools, and research-related data.
If you have questions about this content, or if there is other data you would like to use, please contact me at: dma [at] cs.cmu.edu

KnowledgeBase-LDA (KB-LDA) Data

Dataset based on StackOverflow that was used to train the KB-LDA model from our ACL2015 paper.

Training Data:

The KB-LDA dataset [tar] was extracted from StackOverflow, parsed and cleaned. SVO and concept-instance relations were extracted based on the full data. We also include a clean list of the noun and verb tokens used from a sample of ~60K documents. The included files are:
so2013_svo_clean.csv
Subject-verb-object tuples (37k) extracted from StackOverflow corpus.
Format: id,subject,verb,object,count
so2013_hypernyms_clean.csv
Concept-instance pairs (17k) extracted from StackOverflow corpus.
Format: id,concept,instance,count
so2013_document_nouns_clean.csv
Document nouns (1.3m) from a sample of the StackOverflow corpus.
Format: quetion_id,noun,count
so2013_document_verbs_clean.csv
Document verbs (880k) from a sample of the StackOverflow corpus.
Format: quetion_id,verb,count

Learned Software Knowledge Base:

The following software knowledge base [zip] was learned with KB-LDA, and is evaluated in the paper. The data includes the learned topics, topic names (concepts), a topic hierarchy, and the top 100 learned relations. The README file details the format of the files and how the data was extracted.

Paper:

Dana Movshovitz-Attias and William Cohen, KB-LDA: Jointly Learning a Knowledge Base of Hierarchy, Relations, and Facts, ACL, 2015

Update: July 23, 2015

Ngram-based Comment Completion Eclipse Plugin

This eclipse plugin enables word-completion within comments based on an n-gram model trained on multiple open-source JAVA projects and data from StackOverflow. Word completion works in a similar way to code completion tools built into standard code editors. While writing the comment you will be prompted for suggestions based on the implementation of the class you are currently commenting.

Plugin update site:

Corpus:

The model was created by tokenizing a corpus repository of open-source code. The tokenized corpus can be found on github. The dataset includes 9 open-source projects, including original source files and tokenizations of the code and comments.

StackOverflow data was downloaded from clearbits.

Paper:

Dana Movshovitz-Attias and William Cohen, Natural Language Models for Predicting Programming Comments, ACL, 2013

Software Update: Sep 4, 2013

Abbreviation Alignment HMM

This is an abbreviation extractor based on a Hidden Markov Model. With this code you can extract abbreviations and their definitions from a text corpus. The Abbreviation Alignment HMM code is a part of the second-string open source package.

Main Classes:

  • com.wcohen.ss.AbbreviationAlignment is an implementation of the abbreviation alignment metric.
  • com.wcohen.ss.expt.ExtractAbbreviations is a utility for extracting abbreviations from a text corpus using our method.

Code:

github code (within the second-string package).

Data: Abbreviations Extracted from PubMed

Using this method we extracted 1.4 million abbreviations from a corpus of 200K full text PubMed articles. The extracted abbreviations are available here.

Each line in this file contains:
  1. The probability of the abbreviation as given by the HMM
  2. The ID of the document in our corpus
  3. The Medline ID of the original text
  4. The Short Form of the abbreviation
  5. The Long Form of the abbreviation

Paper:

Dana Movshovitz-Attias and William Cohen, Alignment-HMM-based Extraction of Abbreviations from Biomedical Text, BioNLP in NAACL, 2012

Software Update: Sep 18, 2012

CMU

Courses and TA experience

TA at CMU

Spring 2014

  • 11-761: Language and Statistics, given by Roni Rosenfeld

Fall 2012

  • 15-381: Artificial Intelligence: Representation and Problem Solving, given by Ariel Procaccia and Emma Brunskill

Courses at CMU

Fall 2012

  • Computer Networks with Peter Steenkiste

Spring 2012

  • Semantics of Programming Languages with Steve Brookes
  • Language and Statistics with Roni Rosenfeld
  • Machine Learning with Large Datasets with William Cohen

Fall 2011

Spring 2011

Fall 2010

Contact

Email
dma [at] cs.cmu.edu
Office
GHC 7513
Office Phone
+1-412-268-3066
Address
Computer Sciences Department,
Carnegie Mellon University,
5000 Forbes Avenue,
Pittsburgh, PA 15213
CV
PDF