next up previous
Next: Integration and Interface Up: Components Previous: Synthesis

Machine Translation

Again due to the requirement of rapid development, data-driven approaches were preferred. Thus we used a Multi-Engine MT (MEMT) system [7], whose primary engines were an Example-Based MT (EBMT) engine [4] and a bilingual dictionary/glossary. Carnegie Mellon's EBMT system uses a ``shallower'' approach than many other EBMT systems; examples to be used are selected based on string matching and inflectional and other heuristics, with no deep structural analysis. The MEMT architecture uses a trigram language model of the output language to select among competing partial translations produced by several engines. It is used in this system primarily to select among competing (and possibly overlapping) EBMT translation hypotheses.

ebmt-arch2.pic
EBMT Architecture

For translation into Croatian, we incorporated a finite-state word reordering mechanism, applied during the language model-driven selection of partial translations, to place clitics in a cluster in the appropriate location. (Croatian syntax requires a very specific ordering of clitics in a cluster in a specific position in the sentence.)

The training corpus for the EBMT engine consisted of the translated chaplain dialogs plus pre-existing parallel text from the DIPLOMAT project [6] and newly-acquired parallel text from the web. The dictionary/glossary engine used both statistically-extracted translations and manually-created entries. The English trigram model already existed, and had been generated from newswire and broadcast news transcripts. Finally, the Croatian trigram model was built from the Croatian half of the EBMT corpus, some Croatian text found on the web, and the full text of some sixty novels and other Croatian literary works (in total, approximately six million words).


next up previous
Next: Integration and Interface Up: Components Previous: Synthesis
Alan W Black 2002-06-18