Language Technologies Institute
Student Research Symposium 2006

Translating Words You've Never Seen

Nguyen Bach with Bing Zhao, Stephan Vogel, and Ian Lane

Cross-lingual data-mining tasks, such as information retrieval, question answering, query translations and webdocument machine translation (e.g. Google translation), are becoming increasingly necessary in multi-lingual environments. However, current state-of-the-art statistical machine translation (SMT) systems are inadequate because they cannot translate named-entities which have not appeared during training. New named-entities, including person, organization, and location names are continually emerging on the World-Wide-Web; to realize effective cross-language datamining applications, the handling of these unknown named entities is crucial.

Named entities (NEs) can typically be translated by performing transliteration from source to target language. Transliteration involves mapping symbols from one writing system to another. Letters of the source language are typically transformed in the target language with similar pronunciation. Transliteration between languages which share similar alphabets and sound systems is usually easy, since letters generally remain the same. However, the task is significantly more difficult when the language pairs are considerably different, for example, English and Arabic, or English and Chinese, or English and Japanese. This work focuses on transliteration between Arabic and English.

In our proposed approach, we extend the SMT-based approaches by incorporating novel alignment technologies specialized for letter-based transliteration alignment at two levels. First, we propose a bi-stream HMM incorporating letter clusters to better model the vowel and nonvowel transliteration and position information to improve the letter-level alignment. Secondly, based on the letter alignment, we propose letter n-gram alignment models (block) to automatically learn the mappings from source letter n-gram to target letter n-gram. A few specific informative features for transliterations are explored, and a loglinear model is applied to combine these features to learn block-level transliteration-pairs from training data. Thirdly, by applying a spelling checker based on statistics returned from web search engines an improved transliteration accuracy was obtained.

The proposed transliteration framework obtained significant improvement compared to a strong baseline transliteration approach. Our proposed framework is general, and it can be easily configured to other language-pairs. We perform experiments in a blind test set and the new approach archives 52% accuracy in 1-best hypothesis. In the 5-best and 10-best cases, the accuracies of system archive the highest performances with 66% and 72.16% respectively.