Emma Strubell
I earned my Ph.D. from UMass Amherst working in the Information Extraction and Synthesis Laboratory with Andrew McCallum. Previously, I earned a B.S. in Computer Science from the University of Maine with a minor in math, where I applied models from mathematical biology to the spread of internet worms with Professor David Hiebeler in his Spatial Population Ecological and Epidemiological Dynamics Lab. I've also spent time as an intern and visiting researcher at Amazon, IBM, Google and Facebook AI Research.
Research Interests
I am interested in developing new machine learning techniques to facilitate fast and robust natural language processing.
Core natural language processing (NLP) tasks such as part-of-speech tagging, syntactic parsing and entity recognition have come of age thanks to advances in machine learning. For example, the task of semantic role labeling (annotating who did what to whom) has seen nearly 40% error reduction over the past decade. NLP has reached a level of maturity long-awaited by domain experts who wish to leverage natural language analysis to inform better decisions and effect social change. By deploying these systems at scale on billions of documents across many domains practitioners can consolidate raw text into structured, actionable data. These cornerstone NLP tasks are also crucial building blocks to higher-level natural language understanding (NLU) that our field has yet to accomplish, such as whole-document understanding and human-level dialog.
In order for NLP to effectively process raw text across many domains, we require models that are both robust to different styles of text and computationally efficient. The success described above has been achieved in those limited domains for which we have expensive annotated data; models that obtain state-of-the-art accuracy in these data-rich settings are typically neither trained nor evaluated for accuracy out-of-domain. Users also have practical concerns about model responsiveness, turnaround time in large-scale analysis, electricity costs, and consequently environmental conservation, but the highest accuracy systems also have high computational demand. As hardware advances, NLP researchers tend to increase model complexity in step.
My research enables a diversity of domain experts to leverage NLU at large scale with the goal of informing decision-making and practical solutions to far-reaching problems. Towards this end, I provide fundamental advances in computational efficiency and robustness. To facilitate computational efficiency I design new training and inference algorithms cognizant of strengths in the latest tensor processing hardware, and eliminate redundant computation through joint modeling across many tasks. I will enable high accuracy across diverse natural language domains by developing joint models where parameter sharing improves generalization, paired with novel methods for adversarial training that will enable transfer to new domains and languages without labeled data. I will apply my research broadly to low-level NLP as well as high-level NLU tasks. In conjunction with these new machine learning techniques, I will collaborate with domain experts to make a positive mark on society.
Publications
2019
- Energy and Policy Considerations for Deep Learning in NLP. Emma Strubell, Ananya Ganesh and Andrew McCallum. Annual Meeting of the Association for Computational Linguistics (ACL short). Florence, Italy. July 2019.
- The Materials Science Procedural Text Corpus: Annotating Materials Synthesis Procedures with Shallow Semantic Structures. Sheshera Mysore, Zach Jensen, Edward Kim, Kevin Huang, Haw-Shiuan Chang, Emma Strubell, Jeffrey Flanigan, Andrew McCallum, Elsa Olivetti. LAW XIII 2019: The 13th Linguistic Annotation Workshop (ACL WS). Florence, Italy. July 2019.
- Inorganic Materials Synthesis Planning with Literature-Trained Neural Networks. Edward Kim, Zach Jensen, Alexander van Grootel, Kevin Huang, Matthew Staib, Sheshera Mysore, Haw-Shiuan Chang, Emma Strubell, Andrew McCallum, Stefanie Jegelka, and Elsa Olivetti. arXiv pre-print 1901.00032, in submission.
2018
- Linguistically-Informed Self-Attention for Semantic Role Labeling. Emma Strubell, Patrick Verga, Daniel Andor, David Weiss and Andrew McCallum. Conference on Empirical Methods in Natural Language Processing (EMNLP). Brussels, Belgium. October 2018. Best long paper award. [bibtex] [code] [slides] [video]
- Syntax Helps ELMo Understand Semantics: Is Syntax Still Relevant in a Deep Neural Architecture for SRL? Emma Strubell and Andrew McCallum. Proceedings of the Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP (ACL WS). Melbourne, Australia. July 2018. [bibtex]
- Simultaneously Self-attending to All Mentions for Full-Abstract Biological Relation Extraction. Patrick Verga, Emma Strubell and Andrew McCallum. Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). New Orleans, Louisiana. June 2018. [bibtex] [code]
- Multi-Task Learning For Parsing The Alexa Meaning Representation Language. Vittorio Perera, Tagyoung Chung, Thomas Kollar and Emma Strubell. Thirty-Second AAAI Conference on Artificial Intelligence (AAAI). New Orleans, Louisiana. February 2018. [bibtex]
2017
- Automatically Extracting Action Graphs From Materials Science Synthesis Procedures. Sheshera Mysore, Edward Kim, Emma Strubell, Ao Liu, Haw-Shiuan Chang, Srikrishna Kompella, Kevin Huang, Andrew McCallum and Elsa Olivetti. NIPS Workshop on Machine Learning for Molecules and Materials (NIPS WS). Long Beach, California. December 2017. Spotlight talk. [bibtex] [poster] [slides]
- Attending to All Mention Pairs for Full Abstract Biological Relation Extraction. Patrick Verga, Emma Strubell, Ofer Shai, and Andrew McCallum. 6th Workshop on Automated Knowledge Base Construction (AKBC). Long Beach, California. December 2017. [bibtex]
- Fast and Accurate Entity Recognition with Iterated Dilated Convolutions. Emma Strubell, Patrick Verga, David Belanger, and Andrew McCallum. Conference on Empirical Methods in Natural Language Processing (EMNLP). Copenhagen, Denmark. September 2017. [bibtex] [code] [poster]
- Dependency Parsing with Dilated Iterated Graph CNNs. Emma Strubell and Andrew McCallum. 2nd Workshop on Structured Prediction for Natural Language Processing (EMNLP WS). Copenhagen, Denmark. September 2017. [bibtex] [slides]
- Machine-learned and codified synthesis parameters of oxide materials. Edward Kim, Kevin Huang, Alex Tomala, Sara Matthews, Emma Strubell, Adam Saunders, Andrew McCallum and Elsa Olivetti. Nature Scientific Data. 4. 2017. [bibtex]
- An epidemiological model of internet worms with hierarchical dispersal and spatial clustering of hosts. David E. Hiebeler, Andrew Audibert, Emma Strubell and Isaac J. Michaud. Journal of Theoretical Biology. 418: 8--15. 2017. [bibtex]
2016
- Extracting Multilingual Relations under Limited Resources: TAC 2016 Cold-Start KB construction and Slot-Filling using Compositional Universal Schema Haw-Shiuan Chang, Abdurrahman Munir, Ao Liu, Johnny Tian-Zheng Wei, Aaron Traylor, Ajay Nagesh, Nicholas Monath, Patrick Verga, Emma Strubell and Andrew McCallum. Text Analysis Conference (Knowledge Base Population Track) '16 Workshop (TAC KBP). Gaithersburg, Maryland, USA. November 2016. [bibtex]
- Multilingual Relation Extraction using Compositional Universal Schema. Patrick Verga, David Belanger, Emma Strubell, Benjamin Roth and Andrew McCallum. Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT). San Diego, California. June 2016. [bibtex] [code]
2015
- Building Knowledge Bases with Universal Schema: Cold Start and Slot-Filling Approaches Benjamin Roth, Nicholas Monath, David Belanger, Emma Strubell, Patrick Verga and Andrew McCallum. Text Analysis Conference (Knowledge Base Population Track) '15 Workshop (TAC KBP). Gaithersburg, Maryland, USA. November 2015. [bibtex]
- Learning Dynamic Feature Selection for Fast Sequential Prediction. Emma Strubell, Luke Vilnis, Kate Silverstein and Andrew McCallum. Annual Meeting of the Association for Computational Linguistics (ACL). Beijing, China. July 2015. Outstanding paper award. [video] [slides] [poster] [bibtex]
2014
- Training for Fast Sequential Prediction Using Dynamic Feature Selection. Emma Strubell, Luke Vilnis, and Andrew McCallum. NIPS Workshop on Modern Machine Learning and NLP (NIPS WS). Montreal, Quebec, Canada. December 2014. [bibtex]
- Minimally Supervised Event Argument Extraction using Universal Schema. Benjamin Roth, Emma Strubell, Katherine Silverstein and Andrew McCallum. 4th Workshop on Automated Knowledge Base Construction (AKBC). At NIPS '14, Montreal, Quebec, Canada. December 2014. [bibtex]
- Universal Schema for Slot-Filling, Cold-Start KBP and Event Argument Extraction: UMassIESL at TAC KBP 2014. Benjamin Roth, Emma Strubell, John Sullivan, Lakshmi Vikraman, Katherine Silverstein, and Andrew McCallum. Text Analysis Conference (Knowledge Base Population Track) '14 Workshop (TAC KBP). Gaithersburg, Maryland, USA. November 2014. [bibtex]
2012
- Modeling the Spread of Biologically-Inspired Internet Worms. Emma Strubell. Undergraduate honors thesis. University of Maine Honors College, Orono, Maine, USA. May 2012. [bibtex]
/etc
In my spare time, I enjoy cooking (with a focus on making vegetables delicious), fermenting (kombucha, kimchi, yogurt, sourdough), enjoying the outdoors (backpacking and rock climbing), and training my dog.
In search of a fast Scala lexer, I forked JFlex and added the ability to emit Scala code. JFlex-scala, and its corresponding maven and sbt plugins, are available on Maven Central. For an example of its use, check out the tokenizer in FACTORIE.
I am also co-author of Plant Jones. He is a semi-intelligent plant who tweets negatively about water when he's thirsty, and positively when he's not. His code is available here.
In my junior year of college I wrote and presented a tutorial on quantum algorithms aimed for undergraduate students in computer science, available here, along with slides part 1 and part 2.
Gentoo Linux user since 2005.