click for contact info

Nathan Schneider

Abstract

Mug ShotI am a Ph.D. candidate in the Language Technologies Institute of the School of Computer Science at Carnegie Mellon University. My primary field of study is statistical Natural Language Processing (NLP), which develops techniques for computers to do intelligent things with human language. Under the supervision of Noah Smith I’m currently working on the RAVINE project, looking for ways to extract information from quotes in news articles.

Before coming to CMU I did my undergrad at UC Berkeley, where I majored in Computer Science and Linguistics. My research focus then was on computational cognitive linguistics, the study of human language understanding using computational methods and models.

Academic Interests

Computer Science

Natural language processing (NLP) and computational linguistics—including:

I would also like to learn more about:

Linguistics

I prefer the theoretical framework of cognitive linguistics; topics of interest include:

Research

Hebrew Verb Morphology & ECG

Hebrew verbs use a root-and-pattern system, where a three-consonant root is lexicalized in one or more of seven verbal paradigms. Each verb, then, is a pairing of a root, a paradigm, and a meaning. An inflected verb's form is quite predictable, the meaning less so; many verbs have idiosyncratic meanings, but there are some regularities and tendencies which need to be accounted for, e.g. certain frequent alternations between paradigms for a common root. My analysis addresses the following questions:

  1. What are the forms and meanings of the morphological components of verbs—roots, paradigms, stems, and inflectional affixes?
  2. How do the forms and meanings of these constructions combine to yield actual verbs in sentences?
  3. How can these constructions be formalized in a structured representation that can be used for computational analysis?

I argue that construction grammar is an appropriate theoretical framework capable of accounting for the complexities of such a system. In particular, I use the Embodied Construction Grammar formalism to represent the necessary constructions in a manner suitable for automated analysis and simulation. Moreover, I argue that many features of the system are consistent with the notion of language as a best-fit cognitive phenomenon.

As part of an honors thesis under the supervision of Jerry Feldman, I designed a morphological extension to the Embodied Construction Grammar formalism and implemented this extension in the ECG parser.

Picurís Tagger

In Fall 2007 I worked with fellow student Will Chang to develop a statistical model that would aid linguistic analysis of texts in Picurís, a Northern Tiwa language of New Mexico. A database of 28 stories in the language was compiled, and students in a recent linguistics course began the painstaking process of identifying the meanings of morphemes (meaning-bearing word fragments) in the texts.

Our model is a Hidden Markov Model over syllables; it predicts (a) the grouping of syllables within each word into morphemes (segmentation), and (b) a tag for each morpheme indicating its category/"part of speech" (classification). Trained with the EM algorithm, the model makes reasonable predictions with just a few labeled examples.

Metonymy Classification

In Spring 2007 I worked with Srini Narayanan on the problem of identifying whether a given verb was being used metonymically or not. I developed and tested a classifier for metonymic vs. literal sentences, and found that—at least for a particular category of verbs—deciding whether the subject is literal or metonymic essentially reduces to determining whether the subject refers to a person or not.

In future work I hope to look for a semi-supervised approach to identifying whether a noun is being used literally or metonymically, which would scale better than a fully-supervised classifier and thus be useful for a variety of natural language understanding applications. As identifying and categorizing the subject, object, and any other arguments can be done with existing lexical resources and tools, the chief difficulty is in determining the semantic categories for a particular verb’s literal arguments.

I plan to investigate whether this be done with an active learning approach, wherein the system attempts to cluster a novel verb with known ones (using a lexical resource as an ontology of verb senses), and asks the user to specify selectional restrictions if it can’t.

Selected Coursework

Computer Science

Linguistics

Languages

Other

Potpourri

Activities

Clubs

Programming

My programming languages of choice are Python and Java; I've also used C, C++, C#, and Scheme. For the web I use JavaScript and PHP.

Hobbies

I play the violin and enjoy table tennis and photography.

Typography

My favorite fonts include: Zapf Humanist/Optima, Segoe UI, Georgia, and Lucida Bright.

Random Unicode character:

Firefox Add-ons

IMHO, these are indespensible enhancements to the Firefox browsing experience:

CV

Curriculum Vitae[+PDF]

fancy view plain view
last updated 10 october 2008