Prev: Statistical Dictionaries

Generalized Example-Based Machine Translation

9. Applications of the Component Technologies

Generalized EBMT is not just useful as a stand-alone system for translating text, but it (and in some cases one of the underlying components) is also useful for other applications. Two which have already been implemented are speech-to-speech translation and cross-language information retrieval.

Speech Translation

A previous page alluded to using the EBMT system to translate a conversation in real time, and that is precisely how EBMT (as part of a multi-engine translation system) is used by the DIPLOMAT project. DIPLOMAT has used EBMT for translation between English and Croatian, Haitian Creole, Korean, and Spanish.

In the DIPLOMAT system, the Sphinx-II continuous speech recognizer is used to transcribe the user's spoken utterance into text. This text, after an opportunity to correct recognition errors, is translated and then synthesized in the other language using the Phonebox concatenative speech synthesizer developed at CMU.

DIPLOMAT is a bi-directional system for translating a conversation, so it uses two copies of the translator software. This not only permits the translation of each of the conversants' speech, but also gives us the opportunity of providing a back-translation. If the system's back-translation of its output correctly conveys the meaning of the original input, we have much greater confidence that it actually translated things correctly.

Cross-Language Retrieval

One increasingly important area of research in recent years has been cross-language information retrieval, where a query in one language is used to find documents in another language. Perhaps the most common way of crossing the language barrier is to translate the query (translating the entire document collection is usually impractical). But since queries tend to consist of isolated words with an occasional short phrase, rather than a full sentence or paragraph, full-blown machine translation isn't applicable to translating queries.

A common method of translating the query is to look up each word in a bilingual dictionary, and replacing it with every possible translation listed for that word. This produces a new query in the other language, which can then be used with standard monolingual retrieval systems. Statistically-generated dictionaries can be used with this method simply by replacing the general-purpose dictionary with the statistical one, but it is possible to do even better. In addition to being attuned to the actual usage of words in the training corpus (and thus sometimes listing translations which would not appear in a general-purpose dictionary), a statistically-generated dictionary also contains frequency information -- which we can use to give greater importance to the more common translations.

Now, one drawback of a statistically-generated dictionary is that it contains a lot of erroneous "translations", but it turns out that these "errors" are actually beneficial for information retrieval. Since the process of generating the dictionary picks words in the other language that are highly correlated with the source-language word, the incorrect translations will be terms that provide a useful expansion of the query (i.e. the dictionary may translate "Hillary" as both "Hillary" and "Clinton").

For full details on applying statistical dictionaries to Cross-Language Retrieval, see the paper " Automatically-Extracted Thesauri for Cross-Language IR".

Next: Back to Contents

[LTI Home Page] [EBMT Main Page] [Introduction] [Generalization] [Dictionaries]
(Last updated 04-Aug-99)