|
General
Purpose
Submissions
Program
Past Years
|
Translating Words You've Never Seen
Nguyen Bach with Bing Zhao, Stephan Vogel, and Ian Lane
Cross-lingual data-mining tasks, such as information retrieval, question
answering, query translations and webdocument machine translation (e.g. Google
translation), are becoming increasingly necessary in multi-lingual
environments. However, current state-of-the-art statistical machine translation
(SMT) systems are inadequate because they cannot translate named-entities which
have not appeared during training. New named-entities, including person,
organization, and location names are continually emerging on the World-Wide-Web;
to realize effective cross-language datamining applications, the handling of
these unknown named entities is crucial.
Named entities (NEs) can typically be translated by performing transliteration
from source to target language. Transliteration involves mapping symbols from
one writing system to another. Letters of the source language are typically
transformed in the target language with similar pronunciation. Transliteration
between languages which share similar alphabets and sound systems is usually
easy, since letters generally remain the same. However, the task is
significantly more difficult when the language pairs are considerably different,
for example, English and Arabic, or English and Chinese, or English and
Japanese. This work focuses on transliteration between Arabic and English.
In our proposed approach, we extend the SMT-based approaches by incorporating
novel alignment technologies specialized for letter-based transliteration
alignment at two levels. First, we propose a bi-stream HMM incorporating letter
clusters to better model the vowel and nonvowel transliteration and position
information to improve the letter-level alignment. Secondly, based on the letter
alignment, we propose letter n-gram alignment models (block) to automatically
learn the mappings from source letter n-gram to target letter n-gram. A few
specific informative features for transliterations are explored, and a loglinear
model is applied to combine these features to learn block-level
transliteration-pairs from training data. Thirdly, by applying a spelling
checker based on statistics returned from web search engines an improved
transliteration accuracy was obtained.
The proposed transliteration framework obtained significant improvement compared
to a strong baseline transliteration approach. Our proposed framework is
general, and it can be easily configured to other language-pairs. We perform
experiments in a blind test set and the new approach archives 52% accuracy in
1-best hypothesis. In the 5-best and 10-best cases, the accuracies of system
archive the highest performances with 66% and 72.16% respectively.
|