@make(article) @style(fontfamily=timesroman, size=11, spacing=1.2) @style(topmargin=1.5inch, bottommargin=1.5inch) @modify(hd3, below=.6, above=1) @modify(hd2, below=.6, above=1) @modify(hdx, below=.6, above=1) @modify(Itemize, rightmargin=0) @pageheading(immediate) @begin(center) System Demonstration @heading(THE PANGLOSS-LITE MACHINE TRANSLATION SYSTEM) Robert E. Frederking and Ralf D. Brown Center for Machine Translation Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213-3890 USA @end(center) @begin(format) @tabclear() @tabset(0.75inch) @b[System builders and contacts:] Same; ref+@@cs.cmu.edu, ralf+@@cs.cmu.edu @b[System category:] Research vehicle @b[System characteristics:] 4-8 seconds per spoken sentence, multiple or unrestricted domains @b[Resources:] See Figure 2-1. @b[Hardware and software:] Sun Sparcstations w/SunOS; Intel PCs w/ Microsoft Windows NT or 95 @b[Functionality description:] Employed in DIPLOMAT rapid-deployment speech-to-speech MT system; @\bidirectional in both Spanish/English and Serbo-Croatian/English; soon Korean/English @b[System internals:] Based on multi-engine MT, EBMT, glossaries, and statistical language modelling @end(format) @section(Pangloss-Lite Overview) The Pangloss-Lite (PanLite) machine translation system is a standalone C++ re-implementation of several major components from the Pangloss machine translation system [Nirenburg et al. 95]@foot[Pangloss was a joint project between three sites: the Computing Research Laboratory of New Mexico State University, the Information Sciences Institute of the University of Southern California, and the Center for Machine Translation of Carnegie Mellon University. It was funded by the U.S. Department of Defense.]. It incorporates the Pangloss Example-Based MT (EBMT) [Brown 96a] and Transfer-Based MT engines, and its statistical language modeller [Brown and Frederking 95], as well as a newly-implemented morphological analyzer, within the multi-engine MT architecture [Frederking and Nirenburg 94] developed during the course of the project. Due to improved design and the C++ implementation, PanLite runs very quickly. For example, the EBMT engine formerly required several minutes to translate a typical newswire sentence; it now requires about 15 seconds (and this with a much larger corpus). More details on performance are presented in section 2 below. To allow its use in the widest variety of applications, PanLite has been designed to translate strings provided either on the standard input or via network sockets, and to produce as output either the best composite string or the full chart of scored translated segments. The latter is necessary, for example, when the output will be supplied to an external graphical user interface (GUI) for post-editing. PanLite has already been included as the MT component of the prototype DIPLOMAT rapid-deployment speech-to-speech translation system (see section 3, below). A potential future application of PanLite is as a World Wide Web translation server. @subsection(Multi-Engine Machine Translation) The overall organization of PanLite is shown in Figure 1-1. PanLite employs a multi-engine MT architecture [Frederking and Nirenburg 94]: several MT engines, each employing a different MT technology, are applied in parallel to each input text. Each engine attempts to translate the entire input text, segmenting each sentence in whatever manner is most appropriate for its technology, and putting the resulting output segments into a shared chart data structure after giving each segment a score indicating the engine's internal assessment of the quality of the output segment. The output segments are indexed in the chart based on the positions of the corresponding input segments. Since the scores produced by the engines are not very reliable, we use statistical language modelling techniques adapted from speech recognition research to select the best overall set of outputs [Brown and Frederking 95]. @begin(figure) @begin(enumerate) Text input via standard input or sockets Morphological analysis Translation: results of morphological analysis passed to each MT engine; scored outputs placed into chart Language modeller selects ``best'' edges, and adds results to chart Output: either text composed of ``best'' edges or entire chart @end(enumerate) @caption(Structure of PanLite) @end(figure) In PanLite, the translation engines used are: @begin(itemize) @b[EBMT:] EBMT [Brown 96a] uses a sentence-aligned corpus to produce translations. When such a corpus is available, fairly high-quality MT for a new domain is available essentially immediately. EBMT is basically a more sophisticated version of Translation Memory, in that sub-sentential chunks of words are matched, allowing much greater coverage. Sentences that match in full are translated exactly, but sub-sentential chunks are matched with a variety of heuristics, which are reflected in the scores assigned to them. The greatly increased speed of the PanLite C++ implementation has allowed the entirety of the largest available corpora to be indexed and used for EBMT, something that had not been feasible previously. @b[Transfer-based MT:] This engine employs a very simple, very old technology: bilingual dictionaries and phrasal glossaries are used to translate pieces of source text. While this is a low-quality technique, the simplicity of the technique allows us to quickly and semi-automatically develop large databases, allowing an initial rapid-deployment of an MT system while more sophisticated KBMT engines are developed. Also, any available online bilingual dictionaries can be used immediately. Scores are currently statically assigned on a per-glossary basis, based on our overall confidence in the particular glossary. An important development in PanLite is the merging of the code implementing glossaries and EBMT, significantly simplifying further software development and maintenance. @b[Knowledge-Based MT:] Currently the PanLite system does not contain a Knowledge-Based MT (KBMT) engine, although a slot is already present to add one later. To be suitable for integration with the other engines, the KBMT system should preferably produce translations as quality-scored segments of sentences, as the Pangloss KBMT engine does, rather than only full sentences. @end(itemize) @subsection(Morphological analysis) PanLite is designed to use morphological analysis as in its predecessor Pangloss system, to produce stem forms and feature taggings for all the words of the input, before they are passed to the different engines. KBMT requires such analysis, and EBMT and transfer-based MT can also clearly benefit from it. Our group is currently producing a C++ version of the Morphe morphological analyzer [Leavitt 94], iCelos, for use in this system. Pending its completion, its output is augmented using a file containing an indexed list of stems or roots for each source language. @subsection(Language Modelling) As mentioned above, we use statistical language modelling to combine the segments produced by different engines, a technique borrowed from speech recognition work. There, acoustic recognizers produce many hypotheses for each word, with scores that are not very accurate. Quality is improved by applying a statistical language model to such results. The model is produced by analyzing large amounts of English text to see what the most probable sequences of words are in English. The model is then used to find the set of choices that produces the sequence most likely to be an English sentence, taking into account the scores of the component words. We use a trigram model of the target language, with backoff to bigrams and unigrams. That is to say, we use the probabilities of word triples when we have these available. When the trigram probability is unavailable, we use the probabilities of word pairs or single words. Because of the extremely large number of combinations of segment hypotheses, search becomes necessary, as described in [Brown and Frederking 95]. @section(PanLite System Details) Currently, versions of PanLite exist for translating unrestricted Spanish to English, Serbo-Croatian to English, English to Spanish, and English to Serbo-Croatian. The code is the same for each version, with just databases and configuration files changing. The sizes of the code and the various databases are presented in Figure 2-1. The FramepaC library [Brown 96b] provides frame-based and Lisp-like data structure capabilities. PanLite currently runs on Sun Sparcstations under SunOS and on Intel processors under Microsoft Windows NT or Windows 95, and the runtime databases are binary-compatible between platforms. Performance figures for the EBMT system on a Sun Sparcstation LX are illustrative: a sample Spanish newswire text of 15 sentences totalling 414 words and punctuation marks can be translated in just under four minutes (see also Figure 2-2). 20 texts averaging 450 words each, drawn from the ARPA MT evaluations, can be completely processed in about three hours, including dictionary lookups and statistical modeling (that is, all processing except the glossaries). Indexing the entire 280M Spanish-English EBMT corpus requires approximately 45 minutes on a Sparcstation LX when all files are located on local disks, and another 30 minutes to pack the index (not required, but improves speed at run time). Incremental addition of new data to the corpus proceeds at a rate of roughly six megabytes per minute. The bilingual Spanish-English corpus consists of 726,406 sentence pairs drawn primarily from the UN Multilingual Corpus [Graff and Finch 94], with a small admixture of texts from the Pan-American Health Organization and the ARPA MT evaluations (10250 sentence pairs stem from the PAHO corpus and 552 pairs from evaluations). The Serbo-Croatian/English corpus is currently much smaller at only 34,000 pairs, drawn from online parallel texts, scanned-in bilingual newspapers, and the glossaries. @begin(figure) @begin(format) @b[Code:] PanLite main program: 4,500 lines of code EBMT/glossary: 12,300 lines of code LM: 9,700 lines of code FramepaC: 50,600 lines of code (used by all three programs) Total object code size: about 1200K for SunOS and 900K for Windows NT. @b[Data:] PanLite: 39,800-word Serbo-Croatian stem list 12,300-word English root list 41,300-word Spanish root list EBMT: 280M Spanish-English corpus 280M English-Spanish corpus (inverse of S-E) 2.3M SerboCroatian-English corpus 2.3M English-SerboCroatian corpus (inverse of SC-E) 19,700-word English root/synonym list 56,900-word Spanish-Eng association dictionary 21,300-word Eng-SCro association dict 51,100-word SCro-Eng association dict Glossaries: 193,000-entry Spanish-English glossary 85,000-entry SerboCroatian-English glossary 129,000-entry English-SerboCroatian glossary (SC-E and E-SC glossaries contain an MRD) Language Modeller: 13M Serbo-Croatian model (from about 12M text) 60M English model (from about 450M text) 41M Spanish model (from about 135M text) @end(format) @caption(Code and database sizes) @end(figure) @begin{figure} @tabclear() @tabset(0.5in,2.0in) @begin{format} Croatian-English/English-Croatian: @\Sparcstation LX:@\10-15 seconds @\Windows NT/95 @\@ (Pentium-90):@\4-8 seconds Spanish-English/English-Spanish: @\Sparcstation LX:@\15-25 seconds @end{format} @tag{speed} @caption{Times to Translate Typical Sentences} @end{figure} @section{Rapid Deployment MT} The PanLite system is the translation component for the DIPLOMAT rapid-deployment, wearable speech-to-speech translation project. One of DIPLOMAT's goals is ``rapid-deployment'': being able to perform initial translations of a new language in a matter of days or weeks. The initial version of the DIPLOMAT bidirectional Serbo-Croatian/English prototype system, which we will be demonstrating on Toshiba laptops, was developed from scratch in less than three weeks. The language-pair-independence of the software was been further demonstrated recently: English-to-Spanish translation was brought up on July 29, 1996 in less than seven hours. During these seven hours, a single person using a single Sun Sparcstation inverted the existing Spanish-to-English corpus, dictionary, and glossaries; created new configuration files; created a Spanish language model; and indexed the EBMT corpus, the dictionary, and the glossaries. While the initial translations are of lower quality than the Spanish-to-English translations (due primarily to the poor quality of the inverted dictionary), they can be improved incrementally with some additional effort. Of course, this exercise finessed a number of difficult issues that the full project is addressing, especially the rapid development of the knowledgebases for a completely new language. But it does demonstrate the generality of the software, and that knowledgebase development @i[is] the primary remaining MT challenge. @section(References) [Brown 96a] Brown, R.D. 1996. ``Example-Based Machine Translation in the Pangloss System.'' In @i{Proceedings of the 16th International Conference on Computational Linguistics} (COLING-96). [Brown 96b] Brown, R.D. 1996. ``FramepaC User's Manual,'' Carnegie Mellon University Center for Machine Translation technical memorandum (in preparation). Current draft available as http://www.cs.cmu.edu/afs/cs.cmu.edu/user/ralf/pub/WWW/papers.html. [Brown and Frederking 95] Brown, R. and Frederking, R. 1995. ``Applying Statistical English Language Modeling to Symbolic Machine Translation.'' In @i{Proceedings of the Sixth International Conference on Theoretical and Methodological Issues in Machine Translation} (TMI-95), pp. 221-239. [Frederking and Nirenburg 94] Frederking, R. and Nirenburg, S. 1994. ``Three Heads are Better than One.'' Proceedings of the fourth Conference on Applied Natural Language Processing, ANLP-94, Stuttgart, Germany. [Graff and Finch 94] Graff, D. and Finch, R. 1994. ``Multilingual Text Resources at the Linguistic Data Consortium.'' In @i{Proceedings of the 1994 ARPA Human Language Technology Workshop}. Morgan Kaufmann. [Leavitt 94] Leavitt, J. 1994. ``Morphe: A Morphological Rule Compiler,'' Version 2.0a, CMU-CMT-94-MEMO. Carnegie Mellon University Center for Machine Translation technical memorandum. [Nirenburg et al. 95] Nirenburg, S., (ed.). 1995. ``The Pangloss Mark III Machine Translation System.'' Joint Technical Report, Computing Research Laboratory (New Mexico State University), Center for Machine Translation (Carnegie Mellon University), Information Sciences Institute (University of Southern California). Issued as CMU technical report CMU-CMT-95-145.