Improving Example Based Machine Translation through Morphological Analysis
Example Based Machine Translation (EBMT) is limited by the quantity and scope of
its training data. Even with a reasonably large corpus, we will not have
examples that cover everything we want to translate. This problem is especially
severe in Arabic due to its rich morphology. Arabic words are formed by
combining a stem with affixes that represent information such as number, person,
or gender. Due to data sparseness, we will not see all possible conjugations for
all but the most common words.
Although the training data may lack an exact phrasal match, one can still derive
the proper translation of many phrases by combining information from the corpus
with an understanding of Arabic morphology. This talk will demonstrate a method
that exploits the regular nature of Arabic morphology to increase the quality
and coverage of machine translation. This method consists of two main parts:
generalization and filtering.
First, the system generalizes each Arabic word by clustering words with similar
meanings. As the stem usually contains the meaning of an Arabic word, this could
be done simply by removing all affixes. However, Arabic text is usually written
without vowel markings, making the stem of a word difficult to determine without
the semantic context. Therefore, the clustering is done by analyzing all
possible stems of a word in conjunction with how frequently each stem occurs in
the Arabic treebank.
At runtime, the EBMT engine searches for Arabic-English phrases using the
clusters described above. When a matching Arabic phrase is found, the
morphological features of the source text are compared to the match. If the stem
or morphological features are incompatible, we reject the match. The
morphological features do not have to be identical in order to be compatible as
some features will not affect the English translation. Additionally, changes
that alter the English translation are allowed, if the change is easily
defined. When this occurs, the system dynamically alters the English text to
match the necessary change.
As a result of this work, the EBMT system is now able to effectively translate
Arabic phrases it has never seen before based on morphologically similar
phrases. This, in effect, extends the coverage of the training corpus which also
increases accuracy of translation. Preliminary experiments have shown an
increase in BLEU scores even when using a large training corpus (~1.4 million
 Phillips, Aaron B. and Violetta Cavalli-Sforza. "Arabic-to-English Example
Based Machine Translation Using Context-Insensitive Morphological Analysis."
Journées d'Etudes sur le Traitement Automatique de la Langue Arabe (JETALA),
Rabat, Morocco, June 2006.