Data collection and analysis of Mapudungun morphology for spelling correction Lori Levin, Rodolfo Vega, Christian Monson, Ralf Brown, Ariadna Font Llitjos, Alon Lavie, Jaime Carbonell Language Technologies Institute Carnegie Mellon University Eliseo Cañulef, Rosendo Huisca Instituto de Estudios Indígenas Universidad de La Frontera Carolina Huenchullán, Claudio Millacura Programa de Educación Intercultural Bilingüe Ministerio de Educación de Chile This paper describes part of a three year collaboration between Carnegie Mellon University's Language Technologies Institute, the Programa de Educación Intercultural Bilingüe of the Chilean Ministry of Education, and Universidad de La Frontera (Temuco, Chile). In a previous paper (Levin et al., 2001) we provided an overview of the project. In this paper, we will focus on the preparation of corpora and lexica that will support a spelling corrector for Mapudungun. Mapudungun is the language of over 900,000 Mapuche people in Chile and Argentina. The language is polysynthetic with noun incorporation. A verb consists of a stem plus ..... Our project has scientific and social significance. The scientific novelty of the project is in the application of computational tools (such as morphological analysis, Example-Based Machine Translation, and Transfer Based MT) to a polysynthetic language. We are also working on new techniques for automatically learning transfer rules from word-aligned bilingual data. (!!!!!!!!!CITE!!!!!!!!!!!!!!) The social significance of the project is related to the The Chilean Ministry of Education's commitment to bilingual education in Spanish and Mapudungun for Mapuche children. Computer-based tools are a welcome part of the bilingual education program. Chile's ENLACES project, an education electronic network, provides computers and networking to all Chilean schools, even those in rural areas. The CMU-Chile project, {\sc Avenue}-Mapudungun, is planning two tools for the near future: an on-line lexicon with examples of usage from the corpus of spoken Mapudungun, and a spelling checker for Mapudungun based on MYSPELL??? Following are details of the corpora and lexica: The Corpus of Spoken Mapudungun: In the last three years, the Chilean Ministry of Education and CMU's {\sc Avenue} project (!!!!CITATIONS!!!!) have supported the collection of 170 hours of spoken Mapudungun. The recordings (all on the topic of health care) have been transcribed and translated into Spanish at the Instituto de Estudios Indígenas at Universidad de La Frontera. The corpus covers three dialects of Mapudungun, 120 hours of Nguluche, 30 hours of Lafkenche and 20 hours of Pewenche. The corpus is described in more detail in Levin et al. 2001. Full Form Word List: The 80,000 most frequent full form Mapudungun words (stem plus inflections) were extracted from the corpus of spoken Mapudungun. The 80,000 words were checked for spelling using spelling conventions that were devised by Mapuche linguists at Universidad de La Frontera. (There is not yet a universally accepted orthography for Mapudungun.) Bilingual Lexicon: Each entry in the bilingual lexicon consists of a full form Mapudungun word, segmentation of the word into morphemes, a gloss for each morpheme, a Spanish translation of the word, a Mapudungun sentence containing the word from the corpus of spoken Mapudungun, and a Spanish translation of the sentence. Currently, there are about ??????? words in the lexicon. The lexicon is in a very general text only format that can be re-configured for any computer-based lexicon interface.