Language Technologies Institute
Student Research Symposium 2006

Improving Example Based Machine Translation through Morphological Analysis

Aaron Phillips

Example Based Machine Translation (EBMT) is limited by the quantity and scope of its training data. Even with a reasonably large corpus, we will not have examples that cover everything we want to translate. This problem is especially severe in Arabic due to its rich morphology. Arabic words are formed by combining a stem with affixes that represent information such as number, person, or gender. Due to data sparseness, we will not see all possible conjugations for all but the most common words.

Although the training data may lack an exact phrasal match, one can still derive the proper translation of many phrases by combining information from the corpus with an understanding of Arabic morphology. This talk will demonstrate a method that exploits the regular nature of Arabic morphology to increase the quality and coverage of machine translation. This method consists of two main parts: generalization and filtering.

First, the system generalizes each Arabic word by clustering words with similar meanings. As the stem usually contains the meaning of an Arabic word, this could be done simply by removing all affixes. However, Arabic text is usually written without vowel markings, making the stem of a word difficult to determine without the semantic context. Therefore, the clustering is done by analyzing all possible stems of a word in conjunction with how frequently each stem occurs in the Arabic treebank.

At runtime, the EBMT engine searches for Arabic-English phrases using the clusters described above. When a matching Arabic phrase is found, the morphological features of the source text are compared to the match. If the stem or morphological features are incompatible, we reject the match. The morphological features do not have to be identical in order to be compatible as some features will not affect the English translation. Additionally, changes that alter the English translation are allowed, if the change is easily defined. When this occurs, the system dynamically alters the English text to match the necessary change.

As a result of this work, the EBMT system is now able to effectively translate Arabic phrases it has never seen before based on morphologically similar phrases. This, in effect, extends the coverage of the training corpus which also increases accuracy of translation. Preliminary experiments have shown an increase in BLEU scores even when using a large training corpus (~1.4 million sentence pairs).

[1] Phillips, Aaron B. and Violetta Cavalli-Sforza. "Arabic-to-English Example Based Machine Translation Using Context-Insensitive Morphological Analysis." Journées d'Etudes sur le Traitement Automatique de la Langue Arabe (JETALA), Rabat, Morocco, June 2006.