Home
Papers
Refereed
-
Nathan Schneider, Behrang Mohit, Kemal Oflazer,
and Noah A. Smith (2012). Coarse lexical semantic annotation with supersenses: an Arabic case study.
ACL.
[paper]
“Lightweight” semantic annotation of text calls for a simple representation,
ideally without requiring a semantic lexicon to achieve good coverage in the language and domain.
In this paper, we repurpose WordNet’s supersense tags for annotation, developing specific guidelines
for nominal expressions and applying them to Arabic Wikipedia articles in four topical domains.
The resulting corpus has high coverage and was completed quickly with reasonable inter-annotator agreement.
@InProceedings{arabic-sst-annotation,
author = {Schneider, Nathan and Mohit, Behrang and Oflazer, Kemal and Smith, Noah A.},
title = {Coarse Lexical Semantic Annotation with Supersenses: An {A}rabic Case Study},
booktitle = {Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics},
month = {July},
year = {2012},
address = {Jeju, South Korea},
publisher = {Association for Computational Linguistics}
}
[data]
-
Behrang Mohit, Nathan Schneider, Rishav Bhowmick, Kemal Oflazer,
and Noah A. Smith (2012). Recall-oriented learning of named entities in Arabic Wikipedia.
EACL.
[paper] [supplement]
We consider the problem of NER in Arabic Wikipedia, a semisupervised domain adaptation setting
for which we have no labeled training data in the target domain. To facilitate evaluation, we obtain annotations for
articles in four topical groups, allowing annotators to identify domain-specific entity types in addition to standard categories.
Standard supervised learning on newswire text leads to poor target-domain recall. We train a sequence model and show that a
simple modification to the online learner—a loss function encouraging it to “arrogantly” favor recall over precision—substantially improves recall and F1.
We then adapt our model with self-training on unlabeled target-domain data; enforcing the same recall-oriented bias in the self-training stage yields marginal gains.
@InProceedings{mohit-arabic-ner,
author = {Mohit, Behrang and Schneider, Nathan and Bhowmick, Rishav and Oflazer, Kemal and Smith, Noah A.},
title = {Recall-Oriented Learning of Named Entities in {A}rabic {W}ikipedia},
booktitle = {Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics},
month = {April},
year = {2012},
address = {Avignon, France},
publisher = {Association for Computational Linguistics},
pages = {162--173},
url = {http://www.aclweb.org/anthology/E12-1017}
}
[data]
-
Desai Chen, Nathan Schneider, Dipanjan Das,
and Noah A. Smith (2010). SEMAFOR: Frame argument resolution with log-linear models.
SemEval. [paper] [slides]
This paper describes the SEMAFOR system’s performance in the SemEval 2010 task on linking events and their
participants in discourse. Our entry is based upon SEMAFOR 1.0 (Das et al., 2010), a frame-semantic probabilistic parser built from
log-linear models. The extended system models null instantiations, including non-local argument reference. Performance is evaluated
on the task data with and without gold-standard overt arguments. In both settings, it fares the best of the submitted systems with
respect to recall and F1.
@inproceedings{chen-schneider-das-smith-10,
author = {Desai Chen and Nathan Schneider and Dipanjan Das and Noah A. Smith},
title = {{SEMAFOR}: Frame Argument Resolution with Log-Linear Models},
booktitle = {Proceedings of the Fifth International Workshop on Semantic Evaluation (SemEval-2010)},
month = {July},
year = {2010},
address = {Uppsala, Sweden},
publisher = {Association for Computational Linguistics},
pages = {264--267},
url = {http://www.aclweb.org/anthology/S10-1059}
}
-
Dipanjan Das,
Nathan Schneider,
Desai Chen, and
Noah A. Smith (2010). Probabilistic frame-semantic parsing.
NAACL-HLT. [paper] [slides]
This paper contributes a formalization of frame-semantic parsing as a structure prediction problem and
describes an implemented parser that transforms an English sentence into a frame-semantic representation.
It finds words that evoke FrameNet frames, selects frames for them, and locates the arguments for each frame.
The system uses two feature-based, discriminative probabilistic (log-linear) models, one with latent variables to permit
disambiguation of new predicate words. The parser is demonstrated to significantly outperform previously published results.
@inproceedings{das-schneider-chen-smith-10,
author = {Dipanjan Das and Nathan Schneider and Desai Chen and Noah A. Smith},
title = {Probabilistic Frame-Semantic Parsing},
booktitle = {Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics},
month = {June},
year = {2010},
address = {Los Angeles, California},
publisher = {Association for Computational Linguistics},
pages = {948--956},
url = {http://www.aclweb.org/anthology/N10-1138}
}
[software]
-
Nathan Schneider (2010). English morphology in construction grammar.
CSDL. [poster]
- Nathan Schneider (2010). Computational cognitive morphosemantics: modeling morphological compositionality in Hebrew verbs with Embodied Construction Grammar.
BLS. [slides] [paper]
This paper brings together the theoretical framework of construction grammar and studies of verbs in Modern Hebrew
to furnish an analysis integrating the form and meaning components of morphological structure. In doing so, this work employs and extends
Embodied Construction Grammar (ECG; Bergen and Chang 2005), a computational formalism developed to study grammar from a cognitive linguistic
perspective. In developing a formal analysis of Hebrew verbs, I adapt ECG—until now a lexical/syntactic/semantic formalism—to
account for the compositionality of morphological constructions, accommodating idiosyncrasy while encoding generalizations at multiple
levels. Similar to syntactic constructions, morpheme constructions are related in an inheritance network, and can be productively
composed to form words. With the expanded version of ECG, constructions can readily encode nonconcatenative root-and-pattern morphology and
associated (compositional or noncompositional) semantics, cleanly integrated with syntactic constructions. This formal, cognitive study
should pave the way for computational models of morphological learning and processing in Hebrew and other languages.
Reports & Presentations
-
Nathan Schneider (5 October 2011). Casting a wider ’Net: NLP for the Social Web.
Invited talk, CMU Qatar Computer Science. [slides]
Natural language text dominates the information available on the Web.
Yet the language of online expression often differs substantially, in
both style and substance, from the language found in more traditional
sources such as news. Making natural language processing techniques
robust to this sort of variation is thus important for applications to
behave intelligently when presented with Web text.
This talk presents new research applying two sequence prediction
tasks—part-of-speech tagging and named entity detection—to text from
online social media platforms (Twitter and Wikipedia). For both tasks,
we adapt standard forms of annotation to better suit the linguistic
and topical characteristics of the data. We also propose techniques to
elicit more accurate statistical taggers, including linguistic
features inspired by the domain (for part-of-speech tagging of Twitter
messages) as well as modifications to the learning algorithm (for
named entity detection in Arabic Wikipedia).
-
Behrang Mohit,
Nathan Schneider,
Rishav Bhowmick,
Kemal Oflazer, and
Noah A. Smith (August 2011). Recall-oriented learning for named entity recognition in Wikipedia.
Technical Report CMU-LTI-11-012. [paper]
We consider the problem of NER in Arabic Wikipedia, a semi-supervised
domain adaptation setting for which we have no labeled training data in the target domain.
To facilitate evaluation, we obtain annotations for articles in four topical groups,
allowing annotators to identify domain-specific entity types in addition to standard categories.
Standard supervised learning on newswire text leads to poor target-domain recall.
We train a sequence model and show that a simple modification to the online learner—a loss function
encouraging it to “arrogantly” favor recall over precision—substantially improves recall and F1.
We then employ self-training on unlabeled target-domain data in order to adapt our model;
enforcing the same recall-oriented bias in the self-training stage yields additional gains.
@techreport{mohit-11-tr,
author = {Behrang Mohit and Nathan Schneider and Rishav Bhowmick and Kemal Oflazer and Noah A. Smith},
institution = {Carnegie Mellon University},
address = {Pittsburgh, Pennsylvania},
type = {Technical Report},
number = {CMU-LTI-11-012},
title = {Recall-Oriented Learning for Named Entity Recognition in {W}ikipedia},
year = {2011},
month = {aug},
url = {http://www.cs.cmu.edu/~nschneid/aner-tr.pdf}
}
-
Nathan Schneider,
Rebecca Hwa,
Philip Gianfortoni,
Dipanjan Das,
Michael Heilman,
Alan W. Black,
Frederick L. Crabbe, and
Noah A. Smith (July 2010). Visualizing topical quotations over time to understand news discourse.
Technical Report CMU-LTI-10-013. [paper]
We present the Pictor browser, a visualization designed to facilitate
the analysis of quotations about user-specified topics in large collections of news text.
Pictor focuses on quotations because they are a major vehicle of communication in the
news genre. It extracts quotes from articles that match a user’s text query, and groups these quotes into “threads” that illustrate
the development of subtopics over time. It allows users to rapidly explore the space of relevant quotes by viewing their content and
speakers, to examine the contexts in which quotes appear, and to tune how threads are constructed. We offer two case studies
demonstrating how Pictor can support a richer understanding of news events.
@techreport{das-schneider-chen-smith-10-tr,
author = {Nathan Schneider and Rebecca Hwa and Philip Gianfortoni and Dipanjan Das and Michael Heilman and Alan W. Black and Frederick L. Crabbe and Noah A. Smith},
institution = {Carnegie Mellon University},
address = {Pittsburgh, Pennsylvania},
type = {Technical Report},
number = {CMU-LTI-10-013},
title = {Visualizing Topical Quotations Over Time to Understand News Discourse},
year = {2010},
month = {jul},
url = {http://www.cs.cmu.edu/~nasmith/papers/schneider+etal.tr10.pdf}
}
-
Dipanjan Das,
Nathan Schneider,
Desai Chen, and
Noah A. Smith (April 2010). SEMAFOR 1.0: A probabilistic frame-semantic parser.
Technical Report CMU-LTI-10-001. [paper]
An elaboration on (Das et al., 2010), this report formalizes frame-semantic parsing as
a structure prediction problem and describes an implemented parser
that transforms an English sentence into a frame-semantic
representation. SEMAFOR 1.0 finds words that evoke FrameNet frames, selects
frames for them, and locates the arguments for each frame. The
system uses two feature-based, discriminative probabilistic
(log-linear) models, one with latent variables to permit
disambiguation of new predicate words. The parser is demonstrated to significantly outperform previously published
results and is released for public use.
@techreport{das-schneider-chen-smith-10-tr,
author = {Dipanjan Das and Nathan Schneider and Desai Chen and Noah A. Smith},
institution = {Carnegie Mellon University},
address = {Pittsburgh, Pennsylvania},
type = {Technical Report},
number = {CMU-LTI-10-001},
title = {{SEMAFOR} 1.0: A Probabilistic Frame-Semantic Parser},
year = {2010},
month = {apr},
url = {http://www.ark.cs.cmu.edu/SEMAFOR/das+schneider+chen+smith.tr10.pdf}
}
[software]
-
Reza Bosagh Zadeh and Nathan Schneider (December 2008). Unsupervised approaches to sequence tagging, morphology induction, and lexical resource acquisition.
LS2 course literature review. [paper] [slides]
We consider unsupervised approaches to three types of problems involving the prediction of
natural language information at or below the level of words: sequence labeling (including part-of-speech tagging);
decomposition (morphological analysis and segmentation); and lexical resource acquisition (building dictionaries
to encode linguistic knowledge about words within and across languages). We highlight the strengths and weaknesses
of these approaches, including the extent of labeled data/resources assumed as input, the robustness of modeling
techniques to linguistic variation, and the semantic richness of the output relative to the input.
Other reports to appear.
Research
Overview
My research interests are in the intersection of linguistics, cognitive science, and computer science/artificial intelligence. Fundamentally, I want to be able to describe and simulate human and artificial language learning, understanding, and use.

Reseach goals:
- understanding how languages convey meaning
- using computers to model, analyze, and reason about human language
- designing computer interfaces to exploit aspects of human cognition and artificial intelligence, including language processing
Specific problems of interest include:
- Statistical NLP: morphological, syntactic, and semantic parsing; machine translation; grammar learning; figurative language processing
- Cognitive linguistics: Construction Grammar; frame semantics; metaphor and metonymy; conceptual blending and mental spaces; usage-based theories of language learning
- Technology for linguistics: Use of technology to assist linguistic discovery and language revitalization
- Human-computer interaction: NLP-enabled user interfaces and information visualization
Frame-Semantic Parsing
The goal of this project was to build models to predict a sentence's frame-semantic structure. Predicting a frame-semantic parse involves finding and disambiguating frame-evoking expressions and matching roles of the evoked frames to arguments in the sentence. We have implemented a probabilistic frame parser for English which outperforms the previous state of the art.
Arabic NLP
The AQMAR project (a collaboration with CMU's Qatar campus) aims to advance the state of the art in NLP for Arabic text. We will develop tools for linguistic structure analysis, especially named entity recognition (NER) and semantic tagging, for use in the NLP community, with emphasis on domains other than news (namely, topics found in Arabic Wikipedia).
Exploring News Text
I am currently working on the RAVINE project, an effort combining NLP and information visualization technologies to build an interface facilitating efficient exploration and analysis of content from a large database of news articles. Our system scans articles to extract quotations (and their speakers) for display in an interactive graph. I have been primarily involved in designing the interface and in organizing a user study to evaluate its effectiveness.
Hebrew Morphology
Hebrew verbs use a root-and-pattern system, where a three-consonant root is lexicalized in one or more of seven verbal paradigms. Each verb, then, is a pairing of a root, a paradigm, and a meaning. An inflected verb's form is quite predictable, the meaning less so; many verbs have idiosyncratic meanings, but there are some regularities and tendencies which need to be accounted for, e.g. certain frequent alternations between paradigms for a common root. My analysis addresses the following questions:
- What are the forms and meanings of the morphological components of verbs—roots, paradigms, stems, and inflectional affixes?
- How do the forms and meanings of these constructions combine to yield actual verbs in sentences?
- How can these constructions be formalized in a structured representation that can be used for computational analysis?
I argue that construction grammar is an appropriate theoretical framework capable of accounting for the complexities of such a system. In particular, I use the Embodied Construction Grammar formalism to represent the necessary constructions in a manner suitable for automated analysis and simulation. Moreover, I argue that many features of the system are consistent with the notion of language as a best-fit cognitive phenomenon.
As part of an honors thesis under the supervision of Jerry Feldman, I designed a morphological extension to the Embodied Construction Grammar formalism and implemented this extension in the ECG parser.
Other projects
Picurís Tagger
As a machine learning course project, in Fall 2007 I worked with fellow student Will Chang to develop a statistical model that would aid linguistic analysis of texts in Picurís, a Northern Tiwa language of New Mexico. A database of 28 stories in the language was compiled, and students in a recent linguistics course began the painstaking process of identifying the meanings of morphemes (meaning-bearing word fragments) in the texts.
Our model is a Hidden Markov Model over syllables; it predicts (a) the grouping of syllables within each word into morphemes (segmentation), and (b) a tag for each morpheme indicating its category/"part of speech" (classification). Trained with the EM algorithm, the model makes reasonable predictions with just a few labeled examples.
Metonymy Classification
For a course project in Spring 2007 I worked with Srini Narayanan on the problem of identifying whether a given verb was being used metonymically or not. I developed and tested a classifier for metonymic vs. literal sentences. Further work is needed in determining the semantic categories for a particular verb’s literal arguments.
Software
NLP Tools
Education
Graduate
Ph.D. student since Fall 2008, Language Technologies Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania. My research is in statistical natural language processing; I am advised by Noah Smith. I have completed the following courses:
- Language & Statistics II (Noah Smith, Fall 2008)
- Grammar Formalisms (Lori Levin, Spring 2009)
- Information Extraction (William Cohen, Fall 2009)
- Reading the Web (Tom Mitchell, Fall 2009)
- Advanced NLP Seminar (Noah Smith, Spring 2009, Spring 2010, Spring 2011)
- NLP Lab (Spring 2009)
- Introduction to Computer Science Education (Leigh Ann Sudol, Jan.–Feb. 2010)
- Alternative Syntactic Theories: Construction Grammar (University of Pittsburgh, Yasuhiro Shirai, Fall 2008)
I have served as the TA for:
Undergraduate
In 2008 I graduated from the University of California, Berkeley with a double major in Computer Science and Linguistics. Courses include:
Computer Science
Linguistics
- The Mind and Language
- Advanced Cognitive Linguistics
- Modern Hebrew Linguistics
- Syntax and Semanatics
- Comparative and Historical Linguistics
- Phonology and Morphology
- Phonetics
- The Neural Basis of Thought and Language
- Neural Theory of Language Seminar
Languages
- עיברית מודרנית (Modern Hebrew) – 4 semesters' worth
- français (French) – 1 semester
- العربِيّة (Arabic) – 1 semester
Other
High School
I attended Sycamore High School in Cincinnati, Ohio, where I studied computer science for four years, Modern Hebrew for three, and participated in the orchestra (violin), spring musicals (stage crew), academic quiz team, world affairs council, and Scrabble Club.
Potpourri
Activities
Academic
Groups
Current:
Undergraduate:
Programming
My programming languages of choice are Python and Java; I've also used C, C++, C#, and Scheme. For the web I use JavaScript and PHP.
Hobbies
I play the violin and enjoy table tennis and photography.
Typography
Firefox Extensions
I have listed what to me are indispensible enhancements to the Firefox browsing experience, including:
- Adblock Plus
- Hide ads on web pages
- All-in-One Gestures
- Enables simple mouse movements for common browser actions
- Firebug
- Invaluable tool for web designers; includes a DOM browser, style information, and a JavaScript console
- Zotero
- This citation manager is priceless (good thing it's free!)—it imports citation information from online catalogs and electronic journals, stores snapshots of web pages, organizes notes and other metadata, and generates bibliographies/exports to BibTeX. I contributed a script which makes papers on the ACL Anthology visible to Zotero.
Links