MLIM: Chapter 4

[This chapter is available as http://www.cs.cmu.edu/~ref/mlim/chapter4.html .]

[Please send any comments to Robert Frederking (ref+@cs.cmu.edu, Web document maintainer) or Ed Hovy or Nancy Ide.]

Chapter 4

Machine Translation

Editor: Bente Maegaard

Contributors:

Nuria Bel

Bonnie Dorr

Eduard Hovy

Kevin Knight

Hitoshi Iida

Christian Boitet

Bente Maegaard

Yorick Wilks

Abstract

Machine translation is probably the oldest application of natural language processing. Its 50 years of history have seen the development of several major approaches and, recently, of a new enabling paradigm of statistical processing. Still, today, there is no dominant approach. Despite the commercial success of many MT systems, tools, and other products, the main problem remains unsolved, and the various ways of combining approaches and paradigms are only beginning to be explored.

4.1 Definition of MT

The term machine translation (MT) is normally taken in its restricted and precise meaning of fully automatic translation. However, in this chapter we consider the whole range of tools that may support translation and document production in general, which is especially important when considering the integration of other language processing techniques and resources with MT. We therefore define Machine Translation to include any computer-based process that transforms (or helps a user to transform) written text from one human language into another. We define Fully Automated Machine Translation (FAMT) to be MT performed without the intervention of a human being during the process. Human-Assisted Machine Translation (HAMT) is the style of translation in which a computer system does most of the translation, appealing in case of difficulty to a (mono- or bilingual) human for help. Machine-Aided Translation (MAT) is the style of translation in which a human does most of the work but uses one of more computer systems, mainly as resources such as dictionaries and spelling checkers, as assistants.

Traditionally, two very different classes of MT have been identified. Assimilation refers to the class of translation in which an individual or organization wants to gather material written by others in a variety of languages and convert them all into his or her own language. Dissemination refers to the class in which an individual or organization wants to broadcast his or her own material, written in one language, in a variety of language to the world. A third class of translation has also recently become evident. Communication refers to the class in which two or more individuals are in more or less immediate interaction, typically via email or otherwise online, with an MT system mediating between them. Each class of translation has very different features, is best supported by different underlying technology, and is to be evaluated according to somewhat different criteria.

4.2 Where We Were Five Years Ago

Machine Translation was the first computer-based application related to natural language, starting after World War II, when Warren Weaver suggested using ideas from cryptography and information theory. The first large-scale project was funded by the US Government to translate Russian Air Force manuals into English. After a decade of initial optimism, funding for MT research became harder to obtain in the US. However, MT research continued to flourish in Europe and then, during the 1970s, in Japan. Today, over 50 companies worldwide produce and sell translations by computer, whether as translation services to outsiders, as in-house translation bureaux, or as providers of online multilingual chat rooms. By some estimates, MT expenditure in 1989 was over $20 million worldwide, involving 200—300 million pages per year (Wilks 92).

Ten years ago, the typical users of machine translation were large organizations such as the European Commission, the US Government, the Pan American Health Organization, Xerox, Fujitsu, etc. Fewer small companies or freelance translators used MT, although translation tools such as online dictionaries were becoming more popular. However, ongoing commercial successes in Europe, Asia, and North America continued to illustrate that, despite imperfect levels of achievement, the levels of quality being produced by FAMT and HAMT systems did address some users’ real needs. Systems were being produced and sold by companies such as Fujitsu, NEC, Hitachi, and others in Japan, Siemens and others in Europe, and Systran, Globalink, and Logos in North America (not to mentioned the unprecedented growth of cheap, rather simple MT assistant tools such as PowerTranslator).

In response, the European Commission funded the Europe-wide MT research project Eurotra, which involved representatives from most of the European languages, to develop a large multilingual MT system (Johnson, et al., 1985). Eurotra, which ended in the early 1990s, had the important effect of establishing Computational Linguistics groups in a several countries where none had existed before. Following this effort, and responding to the promise of statistics-based techniques (as introduced into Computational Linguistics by the IBM group with their MT system CANDIDE), the US Government funded a four-year effort, pitting three theoretical approaches against each other in a frequently evaluated research program. The CANDIDE system (Brown et al., 1990), taking a purely-statistical approach, stood in contrast to the Pangloss system (Frederking et al., 1994), which initially was formulated as a HAMT system using a symbolic-linguistic approach involving an interlingua; complementing these two was the LingStat system (Yamron et al., 1994), which sought to combine statistical and symbolic/linguistic approaches. As we reach the end of the decade, the only large-scale multi-year research project on MT worldwide is Verbmobil in Germany (Niemann et al., 1997), which focuses on speech-to-speech translation of dialogues in the rather narrow domain of scheduling meetings.

4.3 Where We Are Today

Thanks to ongoing commercial growth and the influence of new research, the situation is different today from ten years ago. There has been a trend toward embedding MT as part of linguistic services, which may be as diverse as email across nations, foreign-language web searches, traditional document translation, and portable speech translators with very limited lexicons (for travelers, soldiers, etc.; see Chapter 7).

In organizations such as European Commission, large integrated environments have been built around MT systems; cf. the European Commission Translation Service’s Euramis (Theologitis, 1997).

The use of tools for translation by freelancers and smaller organizations is developing quickly. Cheap translation assistants, often little more than bilingual lexicons with rudimentary morphological analysis and some text processing capability, are making their way to market to help small companies and individuals write foreign letters, email, and business reports. Even the older, more established systems such as Globalink, Logos, and Systran, offer pared-down PC-based systems for under $500 per language pair. The Machine Translation Compendium available from the International Association of MT (Hutchins, 1999) lists over 77 pages of commercial MT systems for over 30 languages, including Zulu, Ukrainian, Dutch, Swahili, and Norwegian.

MT services are offered via the Internet, often free for shorter texts; see the websites of Systran and Lernout and Hauspie. In addition, MT is increasingly being bundled with other web services; see the website of Altavista, which is linked to Systran.

4.3.1 Capabilities Now

General purpose vs. Domain-specific: Most (commercial) systems are meant to be general purpose. Although the performance is actually not always very good, the systems are used anyway. However, if the systems were better, MT would be used a whole lot more--given the explosion of information in the world, the demand for translation is booming, and the only possible answer to this demand is MT (in all its forms).

Domain-specific systems deliver better performance, as they can be tailor-made to specific text types. TAUM-METEO, for example, contains a lexicon of only 220 words, and produces translations of weather reports at 98% accuracy; PaTrans (Maegaard and Hansen, 1995) translates abstracts of chemical reports at high quality. However, domain specific systems exhibit two drawbacks: they are only cost-effective in large-volume domains, and maintaining many domain-specific systems may not be manageable; cf. Section 4.3.3 below.

4.3.2 Major Methods, Techniques and Approaches

Statistical vs. Linguistic MT

One of the most pressing questions of MT results from the recent introduction of a new paradigm into Computational Linguistics. It had always been thought that MT, which combines the complexities of two languages (at least), requires highly sophisticated theories of linguistics in order to produce reasonable quality output.

As described above, the CANDIDE system (Brown et al., 1990) challenged that view. The DARPA MT Evaluation series of four MT evaluations, the last of which was held in 1994, compared the performance of three research systems, more than 5 commercial systems, and two human translators (White et al., 1992—94). It forever changed the face of MT, showing that MT systems using statistical techniques to gather their rules of cross-language correspondence were feasible competitors to traditional, purely hand-built ones. However, CANDIDE did not convince the community that the statistics-only approach was the optimal path; in developments since 1994, it has included steadily more knowledge derived from linguistics. This left the burning question: which aspects of MT systems are best approached by statistical methods, and which by traditional, linguistic ones?

Since 1994, a new generation of research MT systems is investigating various hybridizations of statistical and symbolic techniques (Knight et al., 1995; Brown and Frederking, 1995; Dorr , 1997; Nirenburg et al., 1992; Wahlster, 1993; Kay et al., 1994). While it is clear by now that some modules are best approached under one paradigm or the other, it is a relatively safe bet that others are genuinely hermaphroditic, and that their best design and deployment will be determined by the eventual use of the system in the world. Given the large variety of phenomena inherent in language, it is highly unlikely that there exists a single method to handle all the phenomena--both in the data/rule collection stage and in the data/rule application (translation) stage--optimally. Thus one can expect all future non-toy MT systems to be hybrids. Methods of statistics and probability combination will predominate where robustness and wide coverage are at issue, while generalizations of linguistic phenomena, symbol manipulation, and structure creation and transformation will predominate where fine nuances (i.e., translation quality) are important. Just as we today have limousines, trucks, passenger cars, trolley buses, and bulldozers, just so we will have different kind of MT systems that use different translation engines and concentrate on different functions.

One way to summarize the essential variations is as follows:

Feature Symbolic Statistical

robustness/coverage: lower higher

quality/fluency: higher lower

representation: deeper shallower

How exactly to combine modules into systems, however, remains a challenging puzzle. As argued in (Church and Hovy, 1993), one can use MT function to identify productive areas for guiding research. The `niches of functionality’ provide clearly identifiable MT goals. Major applications include:

assimilation tasks: lower quality, broad domains – statistical techniques predominate

dissemination tasks: higher quality, limited domains – symbolic techniques predominate

communication tasks: medium quality, medium domain – mixed techniques predominate

Ideally, systems will employ statistical techniques to augment linguistic insights, allowing the system builder, a computational linguist, to specify the knowledge in the form most convenient to him or her, and have the system perform the tedious work of data collection, generalization, and rule creation. Such collaboration will capitalize on the (complementary) strengths of linguist and computer, and result in much more rapid construction of MT systems for new languages, with greater coverage and higher quality. Still, how exactly to achieve this optimal collaboration is far from clear. Chapter 6 discusses this tradeoff in more detail.

Rule-based vs. Example-based MT

Most production systems are rule-based. That is, they consist of grammar rules, lexical rules, etc. More rules lead to more sophistication and more complexity, and may in the end develop into systems that are quite difficult to maintain. (Typical commercial MT systems contain between a quarter and a half million words and 500—1000 grammar rules for each of the more complex languages.) Consequently, alternative methods have been sought.

Translation by analogy, usually called memory-based or example-based translation (EBMT), see (Nagao, 1984), is one answer to this problem. An analogy-based translation system has pairs of bilingual expressions stored in an example database. The source language input expression is matched against the source language examples in the database, and the best match is chosen. The system then returns the target language equivalent of this example as output, i.e., the best match is based only on the source database, different translations of the source are not taken into account. Just as for translation memories, the analogy-based translation builds on approved translations, consequently the quality of the output is expected to be high.

Unfortunately, however, purely analogy-based systems have problems with scalability: the database becomes too large and unmanageable for systems with a realistic coverage. Consequently, a combination of the rule-based approach and the analogy-based approach is the solution. We are seeing many proposals for such hybrid solutions and this is certainly one of the areas that will bring practical MT further.

Transfer vs. Interlingual MT

Current rule-based MT uses either the Transfer architecture or the Interlingua architecture. These approaches can be diagrammed as:

Interlingua approach:

Source text --[analysis]-- Interlingua --[synthesis]-- Target text

Transfer approach:

Source text --[analysis]-- IntermediateStructure(source) --[transfer]--

IntermediateStructure(target) --[synthesis]-- Target text

The IntermediateStructure is a (usually grammatical) analysis of the text, one sentence at a time. The Interlingua is a (putatively) language-neutral analysis of the text. The theoretical advantage of the Interlingua approach is that one can add new languages at relatively low cost, by creating only rules mapping from the new language into the Interlingua and back again. In contrast, the Transfer approach requires one to build mapping rules from the new language to and from each other language in the system.

The Transfer approach involves a comparison between just the two languages involved. The transfer phase exactly compares lexical units and syntactic structures across the language gap and uses mapping rules to convert the source IntermediateStructure into the target IntermediateStructure representation (Tsujii, 1990). These rules, plus any additional semantic or other information, are stored in dictionaries or knowledge bases. In the transfer approach, nothing is decided a priori about the depth of analysis, i.e., the depth of analysis can depend on the closeness of the languages involved--the closer the languages, the shallower the analysis.

However, for high quality translations, syntactic analysis or shallow semantic analysis is often not enough. Effective translation may require the system to ‘understand’ the actual meaning of the sentence. For example, "I am small" is expressed in many languages using the verb "to be", but "I am hungry" is often expressed using the verb "to have", as in "I have hunger". For a translation system to handle such cases (and their more complex variants), it needs to have information about hunger and so on. Often, this kind of information is represented in so-called case frames, small collections of attributes and their values. The translation system then requires an additional analysis module, usually called the semantic analyzer, additional (semantic) transfer rules, and additional rules for the realizer. The semantic analyzer produces a case frame from the syntax tree, and the transfer module converts the case frame derived from the source language sentence into the case frame format of the target language.

Going to the limit, the Interlingual approach requires a full analysis leading to an abstract representation that is independent of the source language, so that the synthesis of the target sentence can be made without any knowledge of what the source language was. This step may require adding a considerable amount of information, even some that is not present in the input text explicitly. For example, since in Arabic paired entities are pluralized differently from other multiples, the system must be told whether a multiple entity in a sentence is (likely to be) a pair: "her eyes flashed" and "all eyes were on the speaker" differ in this regard. Such addition information improves output quality, but at what price? The addition of information, in particular semantic and extra-linguistic information, can be complex and time-consuming. Semantic knowledge is generally stored in a knowledge base or an ontology or concept lexicon (see Chapter 1). In the system KBMT-89 (Nirenburg et al., 1992) such knowledge is used to obtain an unambiguous interlingual representation, but in fact a knowledge base of this type can also be used to augment transfer systems. Generally, the interlingual representation is reached via a number of steps. KBMT-89 first performs syntactic analysis using a Lexical Functional Grammar, translates lexical entries into their interlingual counterparts using the concept dictionary, performs structural changes from the LFG structures into interlingual structures, and finally executes sentence planning and synthesis in the target language.

As mentioned above, the Interlingua approach requires less work to add a new language than the Transfer approach. However, to date no convincing large-scale Interlingua notation has yet been built. All interlingual MT systems to date have operated at the scale of demonstration (a few hundred lexical items) or prototype (a few thousand). Though a great deal has been written about interlinguas, but no clear methodology exists for determining exactly how one should build a true language-neutral meaning representation, if such a thing is possible at all (Whorf, 1956; Nirenburg et al., 1992; Hovy and Nirenburg, 1992; Dorr, 1994).

In practical systems, the transfer approach is often chosen simply because it is the simplest and scales up the best. This is an important virtue in the development of production systems. However, researchers will continue to pursue the Interlingual approach for a variety of reasons. Not only does it hold the promise of decreasing the cost of adding a new language, but it also encourages the inclusion of deeper, more abstract levels of representation, including discourse structure and interpersonal pragmatics, than are included in transfer structures.

Multi-Engine MT

In recent years, several different methods of performing MT–transfer, example-based, simple dictionary lookup, etc.–have all shown their worth in the appropriate circumstances. A promising recent development has been the attempt to integrate various approaches into a single multi-engine MT system. The idea is very simple: pass the sentence(s) to be translated through several MT engines in parallel, and at the end combine their output, selecting the best fragment(s) and recomposing them into the target sentence(s).

This approach makes admirable use of the strengths of each type of MT. For example, since Example-Based Translation is very effective in handling a wide variety of natural speech expressions and incomplete sentences, it is best employed when phrases or fixed subphrases are translated. However, for fully formed, complex grammatical sentences, the analysis stages typically included in the Transfer and Interlingual approaches is still required. The ATR Cooperative Integrated Translation project has constructed a multi-engine mechanism by an analytical method via a bottom-up chart parser mechanism (Maegaard and Hansen, 1995). Using this mechanism the project has realized a prototype system for multilingual translation by preparing language patterns of source language expression examples and translation examples for each language pair. The system characterized as ‘chat translation’ performs two kind of two-way translation, namely Japanese-English and Japanese-Korean, and moreover, one-way Japanese to German. It outputs the synthesized speech in these four languages. It has been designed for translating travel arrangement dialogues between an information service and tourists.

Another example of multi-engine MT is Pangloss (Frederking et al., 1994). This MT system combined a multilingual dictionary, an Example-Based engine, and a full-fledged KBMT-style Interlingua system into one framework. During translation, each engine assigned a score to each fragment of its output. After normalizing these scores, the post-translation integrator module placed all output fragments in a chart, in parallel. In early versions, the system then employed a dynamic programming algorithm to traverse the chart, select the best-scoring set of fragments (of whatever size), and combine the fragments into the resulting output sentences. Later versions employed statistical language modeling, as used in speech recognition, to combine the scores with the a priori likelihood of the resulting sequence of words (Brown and Frederking, 1995).

Speech-to-Speech Translation

Current commercially available technology makes speech to speech translation already possible and usable. The Verbmobil project (Niemann et al., 1997) and others are discussed in Chapter 7.

4.3.3 Major Bottlenecks and Problems

Some bottlenecks have already been mentioned above, especially in Section 4.3.2.

A rather depressing (for researchers) fact that we do know today can be stated as follows: generally, the older a system, the better its performance, regardless of the modernity of its internal operations. Why is this?

MT, as all NLP applications, deals with language. Language requires a lot of information: lexical, grammatical, translation equivalences, etc. Whatever way this knowledge is used (and the differences constitute the basis for the different approaches to MT), this knowledge must be instantiated (in dictionaries, in rules, in repository of examples, in collections of parallel texts) and processed. These two factors--knowledge collection and effective knowledge use--form the major bottlenecks faced nowadays not only for MT but for all NLP systems.

One approach is to talk of performant MT and NLP, rather than of MT (and NLP) in the abstract. Although systems are sometimes designed as to cope with general, unrestricted language, in the end it usually turns out that in order to make them performant some customization is required. This gives rise to problems of coverage, because it seems unlikely that either linguistic or statistical approaches alone can actually cope with all the possibilities of a given language. Where statistical systems can collect, sort, and classify large volumes of data, and can perhaps filter out uncommon or strange usage, linguistic insights are required to guide the statistical processing in order to operate at effective levels of abstraction. One does not, for example, build a statistical NLP system to consider all the words of four letters, or all the words beginning with the letter t. The ways one limits the statistics, and the linguistic levels at which one chooses to operate, both circumscribe the coverage of the system, and ultimately determine where (and if) it will be performant in practice.

Much more experience is needed in the question of statistics-based MT before it will be clear where the performance limits lie. It is clear that statistical techniques can be used effectively to overcome some of the knowledge acquisition bottlenecks–to collect words and phrases, for example. But can it be used to find concepts, those Interlingual units of meaning that are essential for high-quality translation? It is also clear that statistical methods help with some of the basic processes of MT–word segmentation, part of speech tagging, etc. But can they help with analysis, that process of sentence decomposition without which non-trivial MT is impossible?

A second bottleneck is partially addressed by the multi-engine approach. One can quite confidently assume that no single MT technique is going to provide the best answer in all cases, for all language styles. Furthermore, for applications such as text scanning, a rough translation is quite sufficient, and so a more detailed, but slower and more expensive, translation is not required. How can one best combine various MT engines, weaving together their outputs into the highest quality sentences? How can one combine experimental research systems (that may produce output only in some cases, but then do very well) with tried and true commercial systems (that always produce something, though it might be of low quality)? These questions are not necessarily deep, but they are pressing, and they should be investigated if we are to achieve true general-purpose FAMT.

For speech-to-speech translation, evaluation (Carter et al., 1997) shows that fundamental research is still badly needed to improve overall quality and increase usability, in particular on:

Context processing: how to transmit and use possible Centers:

in analysis, for anaphora or elision,

in generation, for controlling lexical selection and producing ellipses and elisions to improve naturalness and coherence.

Prosody processing: how to generate prosodic marks (to be used by the text to speech components) from pragmatic, semantic and syntactic features.

Integration between heterogeneous components (speech recognition and MT):

richer interface data structures (such as tree lattices),

use of common primary linguistic resources (lexical and grammatical data bases),

system architecture (pipeline, agents, blackboard, whiteboard).

Current research focuses on almost fully automatic systems, leading to extremely specific, task-dependent systems. While they can be useful, we should not repeat the errors of the 1970s. We should focus on computerized assistance for interpreters (to help several conversations partially conducted directly in some common language or indirectly through some imperfect spoken translation system) for active listeners wanting to better understand speech in a foreign language (conversation, radio, TV).

4.3.4 Breakthroughs

Several applications have proven to be able to work effectively using only subsets of the knowledge required for MT. It is possible now to evaluate different tasks, to measure the information involved in solving them, and to identify the most efficient techniques for a given task. Thus, we must face the decomposition of monolithic systems, and to start talking about hybridization, engineering, architectural changes, shared modules, etc. It is important when identifying tasks to evaluate linguistic information in terms of what is generalizable, and thus a good candidate for traditional parsing techniques (argument structure of a transitive verb in active voice?), and what is idiosyncratic (what about collocations?). Besides, one cannot discard the power of efficient techniques that yield better results than older approaches, as illustrated clearly by part of speech disambiguation, which has proved to be better solved using Hidden Markov Models than traditional parsers. On the other hand, it has been proven that good theoretically motivated and linguistically driven tagging label sets improve the accuracy of statistical systems. Hence we must be ready to separate the knowledge we want to represent from the techniques/formalisms that have to process it.

In order to cope with hybrid architectures, the role of the lexicon is fundamental. As discussed in Chapter 1, the lexicon(s) must supply all the modules with the relevant information, and, in order to detect when and where to supply information to one or another module, all the information must be interrelated and structured. Exhaustive information about both idiosyncratic and general issues must be encoded in an application independent way. Only then can we start talking about reusability of resources. In addition, the lexicon must incorporate generative components to overcome redundancy and to foresee productivity. However, as mentioned, exhaustivity creates problems of data overkill, requiring (for example) sophisticated word sense disambiguation techniques. One could also try to reduce the complexity of MT by organizing information under multilingual or cross-lingual generalizations, in the way it was tried in the Eurotra research program (Johnson et al., 1985). In summary, we should be concerned with identifying what techniques can lead to better results under separation of phenomena: transfer vs. interlingua (including ontologies), grammar-based vs. example-based techniques, and so on. We should be willing to view alternatives not as competing approaches but as complementary techniques, the key point being to identify how to structure and to control the combination of all of them.

4.4 Where We Will Be in Five Years
4.4.1 Expected Capabilities

One important trend, of which the first instances can be seen already, is the availability of MT for casual, one-off, use via the Internet. Such services can either be standalone MT (as is the case for Lernout and Hauspie and Systran) or bundled with some other application, such as web access (as is the case with website of Altavista and Systran), multilingual information retrieval in general (see Chapter 2), text summarization (see Chapter 3), and so on.

A second trend can also be recognized: the availability of low-quality portable speech-to-speech MT systems. An experimental system constructed at Carnegie Mellon University in the USA was built for use in Bosnia. Verbmobil handles meeting scheduling in spoken German, French, and English. It is expected that these domains will increase in size and complexity as speech recognition becomes more robust; see Chapter 5 and Chapter 7.

As analysis and generation theory and practice becomes more standardized and established, the focus of research will increasingly turn to methods of constructing low-quality yet adequate MT systems (semi-)automatically. Methods of automatically building multilingual lexicons and wordlists involve bitext alignment and word correspondence discovery; see (Melamed, 1998; Wu, 1995; Fung and Wu, 1995; Chapter 1).

4.4.2 Expected Methods and Techniques

It is clear from the discussion above that future developments will include highly integrated approaches to translation (integration of translation memory and MT, hybrid statistical-linguistic translation, multi-engine translation systems, and the like). We are likely to witness the development of statistical techniques to address problems that defy easy formalization and obvious rule-based behavior, such as sound transliteration (Knight and Graehl, 1997), word equivalence across languages (Wu, 1995), wordsense disambiguation (Yarowsky, 1995), etc. The interplay between statistical and symbolic techniques is discussed in Chapter 6.

Two other ongoing developments do not draw much on empirical linguistics. The first is the continuing integration of low-level MT techniques with conventional word processing to provide a range of aids, tools, lexicons, etc., for both professional and occasional translators. This is now a real market, assisting translators to perform, and confirms Martin Kay’s predictions (Kay,1997; reprint) about the role of machine-aided human translation some twenty years ago. Kay’s remarks predated the more recent empirical upsurge and seemed to reflect a deep pessimism about the ability of any form of theoretical linguistics, or theoretically motivated computational linguistics, to deliver high-quality MT. The same attitudes underlie (Arnold et al., 1994), which was produced by a group long committed to a highly abstract approach to MT that failed in the Eurotra project; the book itself is effectively an introduction to MT as an advanced form of document processing.

The second continuing development, set apart from the statistical movement, is a continuing emphasis on large-scale handcrafted resources for MT. This emphasis implicitly rejects the assumptions of the empirical movement that such resources could be partly or largely acquired automatically by, e.g., extraction of semantic structures from machine readable dictionaries, of grammars from treebanks or by machine learning methods. As described in Chapter 1, efforts continue in a number of EC projects, including PAROLE/SIMPLE and EuroWordNet (Vossen et al., 1999), as well as on the ontologies WordNet (Miller et al., 1995), SENSUS (Knight and Luk, 1994; Hovy, 1998), and Mikrokosmos (Nirenburg, 1998). This work exemplifies something of the same spirit expressed by Kay and Arnold et al., as it has been conspicuous in parts of the Information Extraction community (see Chapter 3): the use of very simple heuristic methods, while retaining the option to use full scale theoretical methods (in this case knowledge-based MT).

4.4.3 Expected Bottlenecks

One step in the integration of MT in broader systems is to determine how different modules can be integrated using common resources and common representation formats. A number of research projects are studying how to define the format in which information can be collected from different modules in order to have the right information at the right time. This will surely imply defining standard interchange formats, underspecification as a general philosophy, and highly structured lexicons where all information (grammatical features as well as collocational and multiword unit patterns, frequency of use and contextual information, conceptual identification, multilingual equivalences, links to synonyms, hypernyms, etc) are all interrelated. The issues of large-coverage resources–collection, standardization, and coordination–are discussed in Chapter 1.

Second, the problem of semantics is perennial. Without some level of semantic representation, MT systems will never be able to achieve high quality, because they will never be able to differentiate between cases that are lexically and syntactically ambiguous. The abovementioned work on semantics in lexicons and ontologies will benefit MT (as it will other applications such as Summarization and Information Extraction).

Third, an increasingly pressing bottleneck is the fact that essentially all MT systems operate at the single-sentence level. Except for a few experimental attempts, no systems have explored translating beyond the sentence boundary, using discourse representations. Typically, their only cross-sentence record is a list of referents for pronominalization. Yet many phenomena in language span sentence boundaries:

Erroneous quotation scoping: In a direct quote in Japanese, the reporting verb of the sentence (the main clause) follows the quote itself (the dependent clause), while in English it normally precedes the quote. Inverting the main and dependent clauses is manageable when the quote is a single sentence, but when it spans multiple sentences, the system currently has no way to determine at which sentence the quote began, and is hence incapable of placing the main clause correctly. As a result, quoted multi-sentence text is translated very oddly by J-E systems.

Inadequate pronominalization: The system cannot know what personal pronoun ("he", "she", or "it") to use when its referent lies in an earlier sentence. This problem occurs especially often in J-E translation since Japanese frequently omits sentence subjects; when the system attempts to create and insert a pronoun in the English it has no knowledge of previously introduced referents and hence has no alternative but to guess a pronoun.

Inappropriate comma insertion: Most synthesis modules contain a set of rules that govern the insertion of commas into the final English text. These rules seldom operate adequately. One reason is that comma placement in English is partially prosodic, based on the rhythm and balance of clauses in the text; without knowing the length and internal structure of the paragraph, comma insertion rules have no way of determining appropriate placement points.

Incorrect relative pronoun selection: The choice of relative pronoun ("that", "in which", "which", "to whom", etc.) is not always trivial, and the behavior of the current synthesis rules in the system reflect that fact. Since relative pronouns refer to entities outside of the relative clause, rules for proper pronoun usage must be able to locate and inspect the appropriate referent.

Fortunately, recent developments in Text Linguistics, Discourse Study, and computational text planning have led to theories and techniques that are potentially of great importance for MT. Using a paragraph structure, one can represent and manipulate cross-sentence groupings of clauses and sentences. Marcu (1997) describes a method of automatically producing trees representing paragraph structure. Two studies report the effects of output quality using a very simple paragraph structure tree to treat multi-sentence quotes (Hovy and Gerber, 1997) and to break up overlong sentences (Gerber and Hovy, 1998).

Fourth, the treatment of so-called low-diffusion languages requires additional attention. Not all languages are equally well covered by MT; even some of the most populous nations in the world are not yet represented in the commercial (or even research) MT sphere: Indonesia, various languages of India, and others. The so-called major languages are reasonably well covered at present, and will certainly be well covered in the future, but users of less spoken languages need MT and other tools just as much or even more than users of English, Spanish, French and Japanese. For some languages the market is not sufficiently large, which means that users of those language will lack the tools which are otherwise available. This lack of tools will have an obvious economic effect, but also a cultural effect by excluding some languages from participating in an otherwise flourishing multilinguality.

4.5 Juxtaposition of this Area with Other Areas

It is probably safe to say that Machine Translation is a central area in the emerging world of multifunctional language processing. While not everyone will use more than one language, many people will have occasion to call on MT at least a few times in their lives. The language processing tasks most closely linked to MT include cross-language Information Retrieval (Chapter 2), Speech Translation (Chapter 7), and multilingual Text Summarization (Chapter 3).

4.6 Conclusion

The future of MT is rosy. Thanks largely to the Internet and the growth of international commerce, casual (one-off) and repeated MT is growing at a very fast pace. Correspondingly, MT products are coming to market as well. The Machine Translation Compendium (Hutchins, 1999) lists commercial products in over 30 languages (including Zulu, Ukrainian, Dutch, Swahili, and Norwegian) in 83 language pairs. Comparative studies of MT systems, including the OVUM Report (OVUM, 1995) and the ABS Study (ABS, 1999), continue to become available, although they tend to cost upward of US$1,000.

In tandem with this growth, it is imperative to ensure that research in MT begins again. At this time, neither the EU nor the North American funding agencies support coordinated, or even large separate, research projects in MT. Without research, however, the difficult problems endemic to MT will not be solved; MT companies do not have enough financial leeway or in many cases the technical expertise required to make theoretical breakthroughs. Since market forces alone cannot solve the problem, governments and funding agencies have to take an active role in the protection and reinforcement of MT.

4.7 References

ABS Study. 1999. Allied Business Intelligence, Inc. Oyster Bay, NY. See http://www.infoshop-japan.com/study/ab3365_languagetranslation_toc.html.

Arnold, D.J. et al. 1994. An Introduction to Machine Translation. Oxford: Blackwell.

Brown, P.F., J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, J. Lafferty, R. Mercer, P. Roossin. 1990. A Statistical Approach to Machine Translation. Computational Linguistics 16(2) (79—85).

Brown, R., and R. Frederking. 1995. Applying Statistical English Language Modeling to Symbolic Machine Translation. Proceedings of the Conference on Theoretical and Methodological Issues in MT (TMI-95), (221—239).

Carter, D., R. Becket, M. Rayner, R. Eklund, C. MacDermid, M. Wirén, S. Kirchmeier-Andersen, and C. Philp. 1997. Translation Methodology in the Spoken Language Translator: An Evaluation. Proceedings of the Spoken Language Translation Meeting, (73—81). ACL/ELSNET, Madrid.

Church, K.W. and E.H. Hovy. 1993. Good Applications for Crummy Machine Translation. Journal of Machine Translation 8 (239—258).

Dorr, B.J. 1994. Machine Translation Divergences: A Formal Description and Proposed Solution. Computational Linguistics 20(4) (597—634).

Dorr, B. 1997. Large-Scale Acquisition of LCS-Based Lexicons for Foreign Language Tutoring. Proceedings of the Fifth ACL Conference on Applied NLP (ANLP), (139—146). Washington, DC.

Frederking, R., S. Nirenburg, D. Farwell, S. Helmreich, E. Hovy, K. Knight, S. Beale, C. Domanshnev, D. Attardo, D Grannes, R. Brown. 1994. Integrating Translations from Multiple Sources within the Pangloss Mark III Machine Translation System. Proceedings of the First AMTA Conference, Columbia, MD (73—80).

Fung, P. and D. Wu. 1995. Coerced Markov Models for Cross-Lingual Lexical-Tag Relations. Proceedings of the Conference on Theoretical and Methodological Issues in MT (TMI-95), (240—255).

Gerber, L. and E.H. Hovy. 1998. Improving Translation Quality by Manipulating Sentence Length. In D. Farwell, L. Gerber, and E.H. Hovy (eds), Machine Translation and the Information Soup: Proceedings of the Third AMTA Conference, Philadelphia, PA. Heidelberg: Springer (448—460).

Hovy, E.H. and S. Nirenburg. 1992. Approximating an Interlingua in a Principled Way. Proceedings of the DARPA Speech and Natural Language Workshop. Arden House, NY.

Hovy, E.H. and L. Gerber. 1997. MT at the Paragraph Level: Improving English Synthesis in SYSTRAN. Proceedings of the Conference on Theoretical and Methodological Issues in MT (TMI-97).

Hutchins, J. 1999. Compendium of Machine Translation Software. Available from the International Association of Machine Translation (IAMT).

Johnson, R.L, M. King, and L. Des Tombe. 1985. EUROTRA: A Multi-Lingual System under Development. Computational Linguistics 11, (155—169).

Kay, M., J.M. Gawron, and P. Norvig. 1994. Verbmobil: A Translation System for Face-to-Face Dialog. CSLI Lecture Notes No. 33, Stanford University.

Kay, M. 1997. The proper place of men and machines in translation. Machine Translation 23.

Knight, K., I. Chander, M. Haines, V. Hatzivassiloglou, E.H. Hovy, M. Iida, S.K. Luk, R.A. Whitney, and K. Yamada. 1995. Filling Knowledge Gaps in a Broad-Coverage MT System. Proceedings of the 14th IJCAI Conference. Montreal, Canada.

Knight, K. and J. Graehl. 1997. Machine Transliteration. Proceedings of the 35^th ACL-97 Conference. Madrid, Spain, (128—135).

Maegaard, B. and V. Hansen. 1995. PaTrans, Machine Translation of Patent Texts, From Research to Practical Application. Proceedings of the Second Language Engineering Convention, (1—8). London: Convention Digest.

Marcu, D. 1997. The Rhetorical Parsing, Summarization, and Generation of Natural Language Texts. Ph.D. dissertation, University of Toronto.

Melamed, I.D. 1998. Empirical Methods for Exploiting Parallel Texts. Ph.D. dissertation, University of Pennsylvania.

Nagao, M. 1984. A Framework of a Machine Translation between Japanese and English by Analogy principle, (173—180). In Elithorn and Banerji (eds.), Artificial and Human Intelligence, North Holland.

Niemann, H., E. Noeth, A. Kiessling, R. Kompe and A. Batliner. 1997. Prosodic Processing and its Use in Verbmobil. Proceedings of ICASSP-97, (75—78). Munich, Germany.

Nirenburg, S., J.C. Carbonell, M. Tomita, and K. Goodman. 1992. Machine Translation: A Knowledge-Based Approach. San Mateo: Morgan Kaufmann.

Nirenburg, S., 1998. Project Boas: "A Linguist in the Box" as a Multi-Purpose Language Resource. Proceedings of the First International Conference on Language Resources and Evaluation (LREC), (739—745). Granada, Spain.

OVUM 1995. Mason, J. and A. Rinsche. Translation Technology Products. OVUM Ltd., London.

Theologitis, D. 1997. Integrating Advanced Translation Technology. In the 1997 LISA Tools Workshop Guidebook, (1/1—1/35). Geneva.

Tsujii, Y. 1990. Multi-Language Translation System using Interlingua for Asian Languages. Proceedings of International Conference organized by IPSJ for its 30th Anniversary.

Vossen, P., et al. 1999. EuroWordNet. Computers and the Humanities, special issue (in press).

White, J. and T. O’Connell. 1992—94. ARPA Workshops on Machine Translation. Series of 4 workshops on comparative evaluation. PRC Inc., McLean, VA.

Whorf, B.L. 1956. Language, Thought, and Reality: Selected Writings of Benjamin Lee Whorf, J.B. Carroll (ed). Cambridge: MIT Press.

Wilks, Y. 1992. MT Contrasts between the US and Europe. In J. Carbonell et al. (eds), JTEC Panel Report commissioned by DARPA and Japanese Technology Evaluation Center, Loyola College, Baltimore, MD.

Wu, D. 1995. Grammarless Extraction of Phrasal Translation Examples from Parallel Texts. Proceedings of the Conference on Theoretical and Methodological Issues in MT (TMI-95), (354—372).

Yamron, J., J. Cant, A. Demedts, T. Dietzel, Y. Ito. 1994. The Automatic Component of the LINGSTAT Machine-Aided Translation System. In Proceedings of the ARPA Conference on Human Language Technology, Princeton, NJ (158—164).

Yarowsky, D. 1995. Three Machine Learning Algorithms for Lexical Ambiguity Resolution. Ph.D. thesis, University of Pennsylvania, Department of Computer and Information Sciences.