Research
The linguistic structure inherent in natural language ought to facilitate solutions to natural language processing tasks from machine translation to speech recognition. Surprisingly, many modern approaches to natural language problems treat language as merely a sequence of meaningless tokens. I seek to improve natural language processing systems by infusing them with appropriate linguistic structure.
In my day to day work, I draw on linguistic structure to develop a language independent system for morphological analysis. The morphology of a natural language is the internal structure of its words. In English, for example, the word jumped contains two internal pieces
- jump, a root morpheme, signifies the type of action, this person didn't fly but rather jumped, and
- -ed, a suffix morpheme, marks that the jumping action occurred in the past
But morphological structure varies from language to language. Morphemes that carry the same piece of meaning differ between languages: the English root morpheme jump translates as saltar in Spanish. And more fundamentally, one language may use morphology to mark features that another language marks with a completely separate word: Verbs in Spanish change form (i.e. morphology) to mark both past and future tense: saltó for she jumped, saltará for she will jump; verbs in English change morphology only in the past tense: jumped, using a base form with a separate helping word in the future: will jump.
Since forms and features change with the language, a language independent morphological analysis system must be able to examine a new language and automatically infer the internal structure of its words. To infer morphological structure my system leverages the inherent organizational structure of morphology—the paradigm.
Paradigms are simply the conjugation tables that students memorize when learning a new language. An adult learning English knows that regular verbs in English appear in one of four forms: A base form like jump; a past tense form, jumped (also often used for passive constructions); a present tense third person singular form, (he) jumps; and a progressive form, jumping. Each time the student produces an English verb it must be in one of these four forms—there are no other forms of the word jump and distinct forms cannot be somehow merged (if someone was jumping yesterday, we can't say *jumpeding, or *jumpinged.) And so the verbal paradigm of jump consists of the four mutually exclusive word endings: NULL.ed.ing.s, where NULL indicates the lack of a suffix in the base form.
My language independent morphological analysis system, ParaMor, takes paradigms as the structure of morphology. From unannotated text in any language, ParaMor first searches for likely candidate paradigms and then employs the discovered paradigms to segment words into morphemes. To search for candidate paradigms, ParaMor tracks words that allow identical sets of morphemes to attach. For instance, ParaMor might notice that the word walk takes exactly the same suffixes as jump, NULL.ed.ing.s: walk, walked, walking, and walks. Thus the evidence for the NULL.ed.ing.s paradigm grows.
ParaMor placed well in Morpho Challenge 2007, a competition pitting language independent morphological analysis systems head to head. I entered ParaMor in the English and the German tracks. In English, ParaMor outperformed a state-of-the-art baseline system, Morfessor, placing 4th overall. In German, a combined ParaMor-Morfessor system placed 1st.
If language independent morphological analysis intrigues you, you can find more details on my work and references to other research in this area in the paper I presented at the SIGMORPHON workshop at ACL 2007.
* an asterisk denotes an ungrammatical construction