TRANSFER SYSTEM WORK FOR MT EVALUATION Lexicon Development Transer Lexicon development used two main resources: the LDC Chinese English dictionary and the British National Corpus word frequency list. For the small data track I used the top 10K words of the LDC dictionary, as sorted by word frequency. For the large data track that was increased to 50K words. For small and large track the dictionary was augmented with entries from the statistically-generated Chinese English glossary. Given the uneven quality of the statistical glossary, only entries that were not already in the LDC dictionary were used. Even though part of speech information for Chinese words is available, it was not used for the small or large data tracks, though apparently it could have been used for the large track. A further expirement using this additional data would be enlightening. Since no Chinese part of speech information was allowed and the transfer rules depended on POS, transfer rules were created by extrapolating back from the POS of the English glosses of the Chinese words. This information came from the BNC word list. For a Chinese word, all of its English translations were found in the BNC list. The POS for each English word were then used to add an entry into the transfer lexicon. Since the transfer engine originally would only use the first entry for any given Chinese word and part of speech, only one such entry was added to the lexicon for any word/POS combination. With the later language model enhanced version of transfer, this would like need to be changed to add all possibilities. English words or phrases not found in the BNC word list were added in as UNK or unknown part of speech. This method led to over-generation of entries, since many Chinese did not share the range of POS of their English translations. For example a Chinese word that might only act as a verb might have an English translation that could be a noun and verb. Both POS would be added into the lexicon. In addition to this automatic lexicon generation, I also added in lexicon entries for some closed class words such as determiners, punctuation, and dates. Chinese does not use inflectional morphology on its words. Nouns are the same for singular and plural, and verbs are not marked for tense or agreement. While not marked on the words themselves, these features can often be derived from other words or grammatical structures in the sentence. Unfortunately, the LDC glossary usually only has the base forms of each English word so even if enough information is available to select that a noun should be singular or plural, the noun form itself is not available in the transfer lexicon. The transfer engine also does not yet have access to a generational morphology component for English. Rule Development Rule development concentrated on the structures that were most easily identified and where word order was different between English and Mandarin. Expirements showed that due to the overgeneralizations in the transfer lexicon and the current simple nature of the rules, that transfer rules concentrating on local reordering rather than sentence-level reordering worked best. A few rules were added in to deal with domain specific constructions in the source texts. Theses included special rules to deal with datelines found in newswire texts and to switch name and title combinations from Chinese to English order. Most rules were non-domain specific however. Although I concentrated on noun phrase rules I also added some rules for other phrases. For example, in Mandarin the word for negation precedes auxiliary verbs, but follows it in English. I added a rule to switch the order. I also added a rule to translate verbs followed by certain aspect particles signalling completion using the past tense. For noun phrases, I added rules to remove the classifiers that follow numbers and determiners in Mandarin. This contextual removal should be better than a blanket removal of all classifiers, some of which have non-classifier meanings. Also, I had rules to remove the modifier particle used when attaching modifiers such as adjectives to nouns in Mandarin. Other rules switched prepositional phrases modifying nouns, Engine Work Many non-Chinese specific changes to the engine were necessitated by the evaluation. Before the evaluation the engine had only been used on sentences where all the words were in the lexicon and that would parse perfectly. This of course was not workable for a system designed for real-life input. In response I added the ability to produce partial translations by stringing together translations of individual translations and trying to maximize the length of the individual translations. Words not in the transfer lexicon now go through a series of checks to see if they could be numbers or other entities that could be translated algorithmically. As a last resort, the word is not translated but passed through as is. To work as an input to the multi-engine system I added a method to produce the lattice expected by MEMT. In the week before the evaluation deadline, faculty suggestions lead to a hybrid system where the transfer engine output all possible translations for the source constituents and let the MEMT language model pick the optimal full translation. This improved scores. I added into the transfer engine a flexible means of handling language specific operations such as number, money and date identification and translation. A base virtual class is available with generic methods to identify and tranlate these types of words. For specific languages this class can then be sub-classed and the methods supplemented with language-specific code. Final Statistics: Total Time on Chinese-specific rule and lexicon development: 71 hours Coverage on Small Track: 74.06% (TOT 24360 LEX 16391 GRA 7969 UNK 6319 COV 0.740599) Large Track: 76.99% (TOT 24360 LEX 15773 GRA 8587 UNK 5606 COV 0.769869) Results: Xfer+LM Small : 4.8404 EBMT+Xfer Small : 5.2170 Xfer+LM Large : 5.5203 Diachronic Results Development efforts are reflected in the various columns of this table. Num: The old segmentation did not preprocess numbers. I added in number handling to fill this gap. When the new preprocessed segmentation was introduced this was no longer necessary. StatDct: This indicates the inclusion of part of the statistical dictionary into the transfer lexicon. Only words that weren't already in the LDC glossary were included. Cvrge: Token translation coverage, the percentage of words that could be found in the transfer lexicon or handled algorithmically (e.g. numbers). When using the Language Model method I didn't have access to these numbers. new seg: Using the latest segmentation devised by Joy. LM: Signifies passing translations of individual source constituents to the MEMT system (just the Xfer engine) and letting the MEMT language model select the best whole translation. Ch POS: Signifies using the Chinese parts of speech obtained from the Chinese treebank to produce a cleaner transfer lexicon. Otherwise, the transfer lexicon is entirely dependent upon the English definitions for possible parts of speech. Full Lex: Initial experiments just used the first English entry for any given part of speech to construct the transfer lexicon. Full Lex signifies using all possible English translations for a given Chinese word. Date Small Large Num StatDct NIST v8 Bleu Cvrge new seg LM Ch POS Full Lex 5/?/2002 x 3.9497 0.0527 5/30/2002 x 4.2889 0.0592 6/3/2002 x 4.4762 0.0798 6/3/2002 x 5.1024 0.0931 6/4/2002 x x 4.5755 0.085 6/6/2002 x x 4.7486 0.0916 6/6/2002 x x 5.3867 0.1057 77.04 6/7/2002 x x 4.7782 0.0925 73.26 6/10/2002 x x x 4.9096 0.0949 74.78 6/10/2002 x x x 5.4265 0.1069 77.45 6/12/2002 x x 5.249 0.0841 x x 6/12/2002 x x 5.1775 0.089 74 x 6/12/2002 x x 5.8299 0.1036 76.46 x 6/12/2002 x x 5.903 0.0995 x x 6/20/2002 x x 5.2864 0.0939 68.23 x x 6/20/2002 x x 5.9545 0.1089 70.96 x x 6/21/2002 x x 5.2257 0.0842 x x x 6/27/2002 x x 5.8813 0.0997 x x x 6/27/2002 x x 5.3472 0.087 x x x x 6/27/2002 x x 6.0129 0.1018 x x x x