I am a Ph.D. candidate in the Language Technologies Institute of the School of Computer Science at Carnegie Mellon University. My primary field of study is statistical Natural Language Processing (NLP), which develops techniques for computers to do intelligent things with human language. Under the supervision of Noah Smith I’m currently working on the RAVINE project, looking for ways to extract information from quotes in news articles.
Before coming to CMU I did my undergrad at UC Berkeley, where I majored in Computer Science and Linguistics. My research focus then was on computational cognitive linguistics, the study of human language understanding using computational methods and models.
Hebrew verbs use a root-and-pattern system, where a three-consonant root is lexicalized in one or more of seven verbal paradigms. Each verb, then, is a pairing of a root, a paradigm, and a meaning. An inflected verb's form is quite predictable, the meaning less so; many verbs have idiosyncratic meanings, but there are some regularities and tendencies which need to be accounted for, e.g. certain frequent alternations between paradigms for a common root. My analysis addresses the following questions:
I argue that construction grammar is an appropriate theoretical framework capable of accounting for the complexities of such a system. In particular, I use the Embodied Construction Grammar formalism to represent the necessary constructions in a manner suitable for automated analysis and simulation. Moreover, I argue that many features of the system are consistent with the notion of language as a best-fit cognitive phenomenon.
As part of an honors thesis under the supervision of Jerry Feldman, I designed a morphological extension to the Embodied Construction Grammar formalism and implemented this extension in the ECG parser.
In Fall 2007 I worked with fellow student Will Chang to develop a statistical model that would aid linguistic analysis of texts in Picurís, a Northern Tiwa language of New Mexico. A database of 28 stories in the language was compiled, and students in a recent linguistics course began the painstaking process of identifying the meanings of morphemes (meaning-bearing word fragments) in the texts.
Our model is a Hidden Markov Model over syllables; it predicts (a) the grouping of syllables within each word into morphemes (segmentation), and (b) a tag for each morpheme indicating its category/"part of speech" (classification). Trained with the EM algorithm, the model makes reasonable predictions with just a few labeled examples.
In Spring 2007 I worked with Srini Narayanan on the problem of identifying whether a given verb was being used metonymically or not. I developed and tested a classifier for metonymic vs. literal sentences, and found that—at least for a particular category of verbs—deciding whether the subject is literal or metonymic essentially reduces to determining whether the subject refers to a person or not.
In future work I hope to look for a semi-supervised approach to identifying whether a noun is being used literally or metonymically, which would scale better than a fully-supervised classifier and thus be useful for a variety of natural language understanding applications. As identifying and categorizing the subject, object, and any other arguments can be done with existing lexical resources and tools, the chief difficulty is in determining the semantic categories for a particular verb’s literal arguments.
I plan to investigate whether this be done with an active learning approach, wherein the system attempts to cluster a novel verb with known ones (using a lexical resource as an ontology of verb senses), and asks the user to specify selectional restrictions if it can’t.
My programming languages of choice are Python and Java; I've also used C, C++, C#, and Scheme. For the web I use JavaScript and PHP.
I play the violin and enjoy table tennis and photography.
My favorite fonts include: Zapf Humanist/Optima, Segoe UI, Georgia, and Lucida Bright.
Curriculum Vitae[+PDF]