Data collection and analysis of Mapudungun morphology for spelling correction Lori Levin, Rodolfo Vega, Christian Monson, Ralf Brown, Ariadna Font Llitjos, Alon Lavie, Jaime Carbonell Language Technologies Institute Carnegie Mellon University Eliseo Cañulef, Rosendo Huisca Instituto de Estudios Indígenas Universidad de La Frontera Carolina Huenchullán, Claudio Millacura Programa de Educación Intercultural Bilingüe Ministerio de Educación de Chile This paper describes part of a three year collaboration between Carnegie Mellon University's Language Technologies Institute, the Programa de Educación Intercultural Bilingüe of the Chilean Ministry of Education, and Universidad de La Frontera (Temuco, Chile). In a previous paper (Levin et al. 2002) we provided an overview of the project. In this paper, we will focus on the preparation of corpora and lexica that will support a spelling corrector for Mapudungun. Mapudungun is the language of over 900,000 Mapuche people in Chile and Argentina. The language is polysynthetic with noun and verb incorporation. While the morphology of other parts of speech is relatively simple Mapudungun has a complex agglutinative verbal morphology. While some analyses divide Mapudungun verbal morphology into as many as 36 slots (Smeets, 1989). A typical complex verb form occurring in our corpus of spoken Mapudungun might consist of seven or eight morphemes. A verb begins with a stem and ends with an obligatory morpheme sequence marking, in the case of finite clauses, the person and number of the subject together with the mood of the verb or, in the case of non-finite clauses, adverbialization or nominalization. A number of morphemes may occur between the verb stem and the final verb morpheme cluster, including applicative, directional, aspectual, and tense markers. If incorporation occurs the incorporated noun or verb is placed immediately following the verb stem. The relative order of the verbal morphemes is generally fixed, and there are very few and simple morphophonemic changes at morpheme boundaries. Our project has scientific and social significance. The scientific novelty of the project is in the application of computational tools (such as morphological analysis, Example-Based Machine Translation, and Transfer Based MT) to a polysynthetic language. We are also working on new techniques for automatically learning transfer rules from word-aligned bilingual data (Carbonell et al. 2002; Probst et al. 2001, 2002a, 2002b, 2003; Lavie et al. to appear). The social significance of the project stems from the Chilean Ministry of Education's commitment to bilingual education in Spanish and Mapudungun for Mapuche children, where computer-based tools are a welcome part of the bilingual education program. Chile's electronic education network project, ENLACES, for example, provides computers and networking to all Chilean schools, including those in rural areas. The CMU-Chile project, Avenue-Mapudungun, is planning two tools for the near future: an on-line lexicon with examples of usage from the corpus of spoken Mapudungun, and a spelling checker for Mapudungun based on MySpell, the spell checking system used by the open source office suite OpenOffice. Following are details of the corpora and lexica: The Corpus of Spoken Mapudungun: In the last three years, the Chilean Ministry of Education and CMU's Avenue project have supported the collection of 170 hours of spoken Mapudungun. The recordings (all on the topic of health care) have been transcribed and translated into Spanish at the Instituto de Estudios Indígenas at Universidad de La Frontera. The corpus covers three dialects of Mapudungun: 120 hours of Nguluche, 30 hours of Lafkenche and 20 hours of Pewenche. The corpus is described in more detail in Levin et al. 2002. Full Form Word List: The 70,000 most frequent full form Mapudungun words (stem plus inflections) were extracted from the corpus of spoken Mapudungun. The 70,000 words were checked for spelling using spelling conventions that were devised by Mapuche linguists at Universidad de La Frontera. (There is not yet a universally accepted orthography for Mapudungun.) Bilingual Lexicon: Each entry in the bilingual lexicon consists of a full form Mapudungun word, segmentation of the word into morphemes, a gloss for each morpheme, a Spanish translation of the word, a Mapudungun sentence containing the word from the corpus of spoken Mapudungun, and a Spanish translation of the sentence. Currently, there are about 3000 words in the lexicon. The lexicon is in a very general text only format that can be re-configured for any computer-based lexicon interface. Spelling Checker: A spelling checker needs to strike a delicate balance between rendering itself useless through not marking incorrectly spelled words and annoying the user by marking correctly spelled words as incorrect. To create a helpful spelling checker for Mapudungun we will necessarily incorporate a knowledge of Mapudungun morphology. As a first pass at the spelling checker, we will treat each possible sequence of suffixes as one suffix. In order to obtain a list of possible suffix sequences, we will have native speakers separate the stem from the remainder of the verbal morphemes in our Full Form Word List of 70,000 Mapudungun word forms. We will then take the list of valid morpheme combinations and use it to interactively spell-check new word forms written by the user in OpenOffice. We are not building a comprehensive theoretical model of Mapudungun morphology at this stage for two reasons. First, while morpheme order in Mapudungun is generally fixed and while morphophonemic changes a few, there are exceptions to both of these rules. As this spelling corrector is intended to be actively used by native speakers, we do not wish to be guilty of prescriptive linguistics by telling the user that a word form is incorrectly spelled merely because it does not conform to our theoretical model of Mapudungun morphology. And second, because we wish to create a spelling checker for a major word processor, and as commercial word processors use proprietary spelling correction systems, we are required to use MySpell, the OpenOffice spelling corrector, which is limited to appending a single affix (or affix group) to a stem. Jaime Carbonell, Katharina Probst, Erik Peterson, Christian Monson, Alon Lavie, Ralf Brown, and Lori Levin: Automatic Rule Learning for Resource-Limited MT. In: Proceedings of AMTA 2002. (Copyright Springer Verlag) Alon Lavie, Stephan Vogel, Lori Levin, Erik Peterson, Katharina Probst, Ariadna Font Llitjos, Rachel Reynolds, Jaime Carbonell, Richard Cohen. Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario. To appear in TALIP. Lori Levin, Alon Lavie, Rodolfo Vega, Jaime Carbonell, Ralf Brown, Eliseo Canulef, Carolina Huenchullan. Data Collection and Language Technologies for Mapudungun. International Workshop on Resources and Tools in Field Linguistics. LREC. 2002. Katharina Probst, Ralf Brown, Jaime Carbonell, Alon Lavie, Lori Levin, and Erik Peterson. 2001.Design and Implementation of Controlled Elicitation for Machine Translation of Low-density Languages. In: Proceedings of the MT 2010 Workshop at MT Summit 2001. Katharina Probst: Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages. In: Proceedings of the ESSLLI 2002 Student Session. ps pdf Katharina Probst, Lori Levin: Challenges in Automated Elicitation of a Controlled Bilingual Corpus. In: Proceedings of TMI 2002. Katharina Probst, Lori Levin, Erik Peterson, Alon Lavie, Jaime Carbonell: MT for Resource-Poor Languages Using Elicitation-Based Learning of Syntactic Transfer Rules. To appear in: Machine Translation, Special Issue on Embedded MT. 2003. Ineke Smeets. A Mapuche Grammar. Ph.D. Dissertation. University of Leiden. 1989.