Data collection and analysis of Mapudungun morphology for spelling correction Lori Levin, Rodolfo Vega, Christian Monson, Ralf Brown, Ariadna Font Llitjos, Alon Lavie, Jaime Carbonell Language Technolgies Institute Carnegie Mellon University Eliseo Canulef, Rosendo Huisca Instituto de Estudias Indigenas Universidad de la Frontera This paper describes part of a three year collaboration between Carnegie Mellon University's Language Technologies Institute, the Chilean Ministry of Education, and Universidad de la Frontera (Temuco, Chile). In a previous paper (Levin et al., 2001) we provided an overview of the project. In this paper, we will focus on the preparation of corpora and lexica that will support a spelling corrector for Mapudungun. Mapudungun is the language of over 900,000 Mapuche people in Chile and Argentina. The language is polysynthetic with noun and verb incorporation. While the morphology of other parts of speech is relativly simple Mapudungun has a complex agglutinative verbal morphology--some analyses divide verbal morphology into as many as 36 slots (!!!!!!!!!!SMEETS!!!!!!!!!). A typical complex verb form occuring in our corpus of spoken Mapudungun might consist of seven or eight morphemes. A verb begins with a stem and ends with an obligatory morpheme sequence marking, in the case of finite clauses, the person and number of the subject together with the mood of the verb or, in the case of non-finite clauses, adverbialization or nominalization. A number of morphemes may occur between the verb stem and the final verb morpheme cluster, including applicative, directional, aspectual, and tense markers. If incorporation occurs the incorporated noun or verb is placed immediately following the verb stem. The relative order of the verbal morphemes is generally fixed, and there are very few and simple morphophonemic changes at morpheme boundaries. Our project has scientific and social significance. The scientific novelty of the project is in the application of computational tools (such as morphological analysis, Example-Based Machine Translation, and Transfer Based MT) to a polysynthetic language. We are also working on new techniques for automatically learning transfer rules from word-aligned bilingual data. (!!!!!!!!!CITE!!!!!!!!!!!!!!) The social significance of the project is related to the The Chilean Ministry of Education's commitment to bilingual education in Spanish and Mapudungun for Mapuche children. Computer-based tools are a welcome part of the bilingual education program. (Chile's ENLACES project provides computers and networking to all Chilean schools, even those in rural areas.) The CMU-Chile project, {\sc Avenue}-Mapudungun, is planning two tools for the near future: an on-line lexicon with examples of usage from the corpus of spoken Mapudungun, and a spelling checker for Mapudungun based on MySpell, the spell checking system used by the OpenOffice an open source office suite. A spelling checker needs to strike a delicate ballance between rendering itself useless by not marking incorrectly spelled words and annoying the user by marking correctly spelled words as incorrect. To create a helpful spelling checker for Mapudungun we will necessarily incorporate a knowledge of Mapudungun morphology. In particular we will have native speakers separate the stem from the rest of the verbal morphemes in our by-hand spell-checked Full Form Word List of 80,000 Mapudungun word forms. We will then take the list of valid morpheme combinations and use it to interactively spell-check new word forms written by the user. We are not building a comprehensive theoretical model of Mapudungun morphology at this stage for two reasons. First, while morpheme order is generally fixed and while morphophnemic changes a few, there are exceptions to both of these rules in Mapudungun. As this spelling corrector is intended to be actively used by native speakers we do not wish to be guilty of prescriptive linguistics by informing a user a form is incorrectly spelled simply because it does not conform to our theoretical model of Mapudungun morphology. And second, because we wish to create a spelling checker for a major word processor, and as comercial word processors use proprietary spelling correction systems, we are required to use MySpell, the OpenOffice spelling corrector, which is limited to appending a single affix (or affix group) to a stem. Following are details of the corpora and lexica: The Corpus of Spoken Mapudungun: In the last three years, the Chilean Ministry of Education and CMU's {\sc Avenue} project (!!!!CITATIONS!!!!) have supported the collection of 170 hours of spoken Mapudungun. The recordings (all on the topic of health care) have been transcribed and translated into Spanish at the Instituto de Estudias Indigenas at Universidad de la Frontera. The corpus covers two dialects of Mapudungun, 120 hours of ?????? and 50 hours of ????? The corpus is described in more detail in Levin et al. 2001. Full Form Word List: The 80,000 most frequent full form Mapudungun words (stem plus inflections) were extracted from the corpus of spoken Mapudungun. The 80,000 words were checked for spelling using spelling conventions that were devised by Mapuche linguists at Universidad de la Frontera. (There is not yet a universally accepted orthography for Mapudungun.) Bilingual Lexicon: Each entry in the bilingual lexicon consists of a full form Mapudungun word, segmentation of the word into morphemes, a gloss for each morpheme, a Spanish translation of the word, a Mapudungun sentence containing the word from the corpus of spoken Mapudungun, and a Spanish translation of the sentence. Currently, there are about ??????? words in the lexicon. The lexicon is in a very general text only format that can be re-configured for any computer-based lexicon interface.