[This report is available as http://www.cs.cmu.edu/~ref/mlim/index.html .]
[It has now also been published, in Linguistica Computazionale, Volume XIV-XV, by the Insituti Editoriali e Poligrafici Internazionali, Pisa, Italy, ISSN 0392-6907.]
[It is also available as a single [very large] page: http://www.cs.cmu.edu/~ref/mlim/index.shtml .]
[Please send any comments to Robert Frederking (ref+@cs.cmu.edu, Web document maintainer) or Ed Hovy or Nancy Ide.]
Multilingual Information Management:
Current Levels and Future Abilities
A report
Commissioned by the US National Science Foundation
and also delivered to
the European Commissions Language Engineering Office
and the US Defense Advanced Research Projects Agency
April 1999
Editors
:Eduard Hovy, USC Information Sciences Institute (co-chair)
Nancy Ide, Vassar College (co-chair)
Robert Frederking, Carnegie Mellon University
Joseph Mariani, LIMSI-CNRS
Antonio Zampolli, University of Pisa
Foreword
Gary W. Strong, DARPA and NSF, USA
The Internet is rapidly bringing to the foreground the need for people to be able to access and manage information in many different languages. Even in cases where people have been lucky enough to learn several languages, they will still need help in effectively participating in the global information society. There are simply too many different languages, and all of them are important to somebody.
While machine translation has a long (over 50 year) history, computer technology now appears ready for the next great push in technology for multilingual information access and management, particularly over the World Wide Web. The European Commission and several US agencies are taking bold steps to encourage research and development in multilingual information technologies. The EC and the US National Science Foundation, for example, have recently issued a joint call for Multilingual Information Access and Management research. The US Defense Advanced Research Projects Agency is supporting a new effort in Translingual Information Detection, Extraction, and Summarization research. Both of these efforts are direct results of international planning efforts, and this Granada effort in particular.
No one was more surprised than the Granada workshop participants were at the rapid uptake in interest in Multilingual Information Management research. Attendees of the workshop in Granada, Spain hardly had their bags unpacked when the results were requested to be presented in Washington DC at a National Academy of Sciences workshop on international research cooperation. The US White House expressed interest in the topic as a groundbreaking effort for a new US-EU Science Cooperation Agreement. Now, DARPA has decided to invest in a multi-year, large-scale effort to push the envelope on rapid development of multilingual capability in new language pairs.
The World is surely shrinking as communication and computation advances proceed at a breath-taking pace. On the other hand, there is no doubt that people will continue to hold on to the values and beliefs of their native cultures. This includes holding on to the language of their families and ancestors. This is a treasure, a cultural knowledge base that must not be weakened even as pressures to be able to speak common languages increase. Therefore, efforts in multilingual technology not only allow us to share knowledge and resources of the World, they also allow us to preserve our individual human qualities that have allowed us to progress and solve problems that we all share.
I thank all whose efforts have gone into this workshop report and the resource that it represents for future efforts in the field. Those who proceed to carry on the needed research and development being called for from around the world will surely find this report to be of great value.
Introduction: The Goals of the Report
Over the past 50 years, a variety of language-related capabilities has been developed in machine translation, information retrieval, speech recognition, text summarization, and so on. These applications rest upon a set of core techniques such as language modeling, information extraction, parsing, generation, and multimedia planning and integration; and they involve methods using statistics, rules, grammars, lexicons, ontologies, training techniques, and so on.
It is a puzzling fact that although all of this work deals with language in some form or other, the major applications have each developed a separate research field. For example, there is no reason why speech recognition techniques involving n-grams and hidden Markov models could not have been used in machine translation 15 years earlier than they were, or why some of the lexical and semantic insights from the subarea called Computational Linguistics are still not used in information retrieval.
This picture will rapidly change. The twin challenges of massive information overload via the web and ubiquitous computers present us with an unavoidable task: developing techniques to handle multilingual and multi-modal information robustly and efficiently, with as high quality performance as possible.
The most effective way for us to address such a mammoth task, and to ensure that our various techniques and applications fit together, is to start talking across the artificial research boundaries. Extending the current technologies will require integrating the various capabilities into multi-functional and multi-lingual natural language systems.
However, at this time there is no clear vision of how these technologies could or should be assembled into a coherent framework. What would be involved in connecting a speech recognition system to an information retrieval engine, and then using machine translation and summarization software to process the retrieved text? How can traditional parsing and generation be enhanced with statistical techniques? What would be the effect of carefully crafted lexicons on traditional information retrieval? At which points should machine translation be interleaved within information retrieval systems to enable multilingual processing?
The purpose of this study is to address these questions, in an attempt to identify the most effective future directions of computational linguistics research and in particular, how to address the problems of handling multilingual and multi-modal information. To gather information, a workshop was held in Granada, Spain, immediately following the First International Conference on Linguistic Resources and Evaluation (LREC) at the end of May, 1998. Experts in various subfields from Europe, Asia, and North America were invited to present their views regarding the following fundamental questions:
The experts were invited to represent the following areas:
In a series of ten sessions, one session per topic, the experts explained their perspectives and participated in panel discussions that attempted to structure the material and hypothesize about where we can expect to be in a few years time. Their presentations, comments, and notes were collected and synthesized into ten chapters by a collection of chapter editors.
A second workshop, this one open to the general computational linguistics public, was held immediately after the COLING-ACL conference in Montreal in August, 1998. This workshop provided a forum for public discussion and critique of the material gathered at the first meeting. Subsequently, the chapter editors updated and refined the ten chapters.
This report is formed out of the presentations and discussions of a wide range of experts in computational linguistics research, at the workshops and later. We are proud and happy to present it to representatives and funders of the US and European Governments and other relevant associations and agencies.
We hope that this study will be useful to anyone interested in assessing the future of multilingual language processing.
We would like to thank the US National Science Foundation and the Language Engineering division of the European Commission for their generous support of this study.
Eduard Hovy and Nancy Ide, Editorial Board Co-chairs
Contributors
Nuria Bel, GILCUB, Spain
Christian Boitet , GETA, France
Nicoletta Calzolari, ILC-CNR, Italy
George Carayannis, ILSP, Greece
Lynn Carlson, Department of Defense, USA
Jean-Pierre Chanod, XEROX-Europe, France
Khalid Choukri, ELRA, France
Ron Cole, Colorado State University, USA
Bonnie Dorr, University of Maryland, USA
Christiane Fellbaum, Princeton University, USA
Christian Fluhr, CEA, France
Robert Frederking, Carnegie Mellon University, USA
Ralph Grishman, New York University, USA
Lynette Hirschman, MITRE Corporation, USA
Jerry Hobbs, SRI International, USA
Eduard Hovy, USC Information Sciences Institute, USA
Nancy Ide, Vassar College, USA
Hitoshi Iida, ATR, Japan
Kai Ishikawa, NEC, Japan
Frederick Jelinek, Johns Hopkins University, USA
Judith Klavans, Columbia University, USA
Kevin Knight, USC Information Sciences Institute, USA
Kamran Kordi, Entropic, England
Gianni Lazzari, ITC, Italy
Bente Maegaard, Center for Sprogteknologi, Denmark
Joseph Mariani, LIMSI-CNRS, France
Alvin Martin, NIST, USA
Mark Maybury , MITRE Corporation, USA
Giorgio Micca, CSELT, Italy
Wolfgang Minker, LIMSI-CNRS, France
Doug Oard, University of Maryland, USA
Akitoshi Okumura, NEC, Japan
Martha Palmer, University of Pennsylvania, USA
Patrick Paroubek, CIRIL, France
Martin Rajman, EPFL, Switzerland
Roni Rosenfeld, Carnegie Mellon University, USA
Antonio Sanfilippo, Anite Systems, Luxembourg
Kenji Satoh, NEC, Japan
Oliviero Stock, IRST, Italy
Gary Strong, National Science Foundation, USA
Beth Sundheim, SPAWAR/NCCOSC, USA
Nino Varile, European Commission, Luxembourg
Charles Wayne, Departmentof Defense, USA
John White, Litton PRC, USA
Yorick Wilks, University of Sheffield, England
Antonio Zampolli, University of Pisa, Italy
Table of Contents
Chapter 1. Multilingual Resources (lexicons, ontologies, corpora, etc.)
Editor: Martha Palmer
Chapter 2. Cross-lingual and Cross-modal Information RetrievalEditors: Judith Klavans and Eduard Hovy
Chapter 3. Automated Cross-lingual Information Extraction and SummarizationEditor: Eduard Hovy
Chapter 4. Machine TranslationEditor: Bente Maegaard
Chapter 5. Multilingual Speech ProcessingEditor: Joseph Mariani
Chapter 6. Methods and Techniques of ProcessingEditor: Nancy Ide
Chapter 7. Speaker/Language Identification, Speech TranslationEditor: Gianni Lazzari
Chapter 8. Evaluation and Assessment TechniquesEditor: John White
Chapter 9. Multimedia Communication, in Conjunction with TextEditors: Mark Maybury and Oliviero Stock
Chapter 10. Government: Policies and FundingEditors: Antonio Zampolli and Eduard Hovy
[This chapter is available as http://www.cs.cmu.edu/~ref/mlim/chapter1.html .]
[Please send any comments to Robert Frederking (ref+@cs.cmu.edu, Web document maintainer) or Ed Hovy or Nancy Ide.]
Chapter 1
Multilingual Resources
Editor: Martha Palmer
Contributors:
Nicoletta Calzolari
Khalid Choukri
Christiane Fellbaum
Eduard Hovy
Nancy Ide
Abstract
A searing lesson learned in the last five years is the enormous amount of knowledge required to enable broad-scale language processing. Whether it is acquired by traditional, manual, means, or by using semi-automated, statistically oriented, methods, the need for international standards, evaluation/validation procedures, and ongoing maintenance and updating of resources which can be made available through central distribution centers is now greater than ever. We can no longer afford to (re)develop new grammars, lexicons, and ontologies for each new application, and to collect new corpora when corpus preparation is a nontrivial task. This chapter describes the current state of affairs for each type of resourcecorpora, grammars, lexicons, and ontologiesand outlines what is required in the near future.
1.1 Introduction
Over the last decade, researchers and developers of Natural Language Processing technology have created basic tools that are impacting daily life. Speech recognition saves the telephone company millions of dollars. Text to speech synthesis aids the blind. Massive resources for training and analysis are available in the form of annotated and analyzed corpora for spoken and written language. The explosion in applications has largely been due to new algorithms that harness statistical techniques to achieve maximal leverage of linguistic insights, as well as to the huge increase in power per dollar in computing machinery.
Yet the ultimate goals of the various branches of Natural Language Processingaccurate Information Extraction and Text Summarization (
Chapter 3), focused multilingual Information Retrieval (Chapter 2), fluent Machine Translation (Chapter 4), robust Speech Recognition (Chapter 5)still remain tantalizingly out of reach. The principal difficulty lies in dealing with meaning. However well systems perform their basic steps, they are still not able to perform at high enough levels for real-world domains, because they are unable to sufficiently understand what the user is trying to say or do. The difficulty of building adequate semantic representations, both in design and scale, has limited the fields of Natural Language Processing in two ways: either to applications that can be circumscribed within well-defined subdomains, as in Information Extraction and Text Summarization (Chapter 3); or to applications that operate at a less-than-ideal level of performance, as in Speech Recognition (Chapter 5) or Information Retrieval (Chapter 2).The two major causes of these limitations are related. First, large-scale, all-encompassing resources (lexicons, grammars, etc.) upon which systems can be built are rare or nonexistent. Second, theories that enable the adequately accurate representation of semantics (meaning) for a wide variety of specific aspects (time, space, causality, interpersonal effects, emotions, etc.) do not exist, or are so formalized as to be too constraining for practical implementation. At this time, we have no way of constructing a wide-coverage lexicon with adequately formalized semantic knowledge, for example.
On the other hand, we do have many individual resources, built up over almost five decades of projects in Language Processing, ranging from individual lexicons or grammars of a few thousand items to the results of large multi-project collaborations such as ACQUILEX. We also have access to the work on semantics in Philosophy, NLP, Artificial Intelligence (AI), and Cognitive Science, and in particular to the efforts of large AI projects such as CYC on the construction of semantic knowledge bases (see Section 1.3.4 below). Thus one of our major challenges consists of collecting and reusing what exists, rather than in starting yet again.
The value of standards has long been recognized as a way to ensure that resources are not abandoned when their projects end, but that subsequent projects can build upon what came before. Both in Europe and the US, various more or less coordinated standards efforts have existed, for various resources. In the US, these issues with respect to lexicons have been taken up in a series of recent workshops under the auspices of the ACL Special Interest Group on the Lexicon, SIGLEX. Word sense disambiguation, WSD, was a central topic of discussion at the workshop on Semantic Tagging at the ANLP 1997 conference in Washington chaired by Marc Light, (Kilgarriff, 1997), which featured several working groups on polysemy and computational lexicons. This meeting led to the organization of a follow-on series, SIGLEX98-SENSEVAL and subsequent workshops (SIGLEX99), which address WSD even more directly by including evaluations of word sense disambiguation systems and in-depth discussions of the suitability of traditional dictionary entries as entries in computational lexicons. In Europe, the EAGLES standardization initiative has begun an important movement towards common formats for lexicon standardization and towards coordinated efforts towards standardizing other resources. Such standardization is especially critical in Europe, where multilinguality adds another dimension of complexity to natural language processing issues. The EAGLES report can be found at
http://www.ilc.pi.cnr.it/EAGLES96/rep2/rep2.html. Recently, renewed interest in the semi-automated acquisition of resource information (words for lexicons, rules for grammars) has led to a new urgency for the clear and simple formulation of such standards.Though the problem is a long way from being finally solved, the issues are being more clearly defined. In particular there is a growing awareness, especially among younger researchers, that the object is not to prove the truth or correctness of any particular theoretical approach, but rather to agree on a common format that can allow us to merge multiple sources of information. Otherwise we doom ourselves to expending vast amounts of effort in the pursuit of nothing more than duplication. The quickest way to arrive at a common format may be to make very coarse distinctions initially, and then refine the results lateran approach that was anathema some years ago. A common format is required not only so that we can share resources and can communicate information between languages, but to also enable a common protocol for communicating information between different modalities.
Acknowledging two facts is the key to successful future multilingual information processing:
In principle, we consider all of the various major types of information used in Language Processing which includes morphology, parts of speech, syntax, collocations, frequency of occurrence, semantics, discourse, and interpersonal and situational communicative pragmatics. Since there is no way to determine a priori which aspect plays the primary role in a given instance, all of these levels of representation could be equally relevant to a task. In this chapter, however we focus only on the resources currently most critical for continued progress:
Naturally, in parallel to the specification of the nature of elements of each of these entities, the development of semi-automated techniques for acquisition, involving statistical modeling, and efficient and novel algorithms, is crucial. These techniques are discussed in
Chapter 6.1.2 Development of Language Resources: Past, Present and Future
In this section we discuss the role of language resources in a multilingual setting, focusing on the four essentials of Language Resources, distribution, development, evaluation, and maintenance, that pertain equally to all of them. At the beginning of the 1990s, the US was at the vanguard of the production of language resources, having transformed the conclusions of the first workshop on Evaluation of Natural Language Processing Systems into DARPAs MUC, TREC, and later MT and SUMMAC evaluations. Under DARPA and other (primarily military) agency funding, the Language Processing standardization and dissemination activities included:
Since the formation of the European Language Resources Association ELRA in 1995, however, the leadership role has passed to Europe, which is now well ahead of the US in the recognition of the need for standardization, lexical semantics, and multilinguality. Recognizing the strategic role of Language Resources, the CEC launched a large number of projects in Europe in the last decade, many of them in the recent Language Engineering program. In this vision, the language resource activities essential for a coordinated development of the field included the development, evaluation, and distribution of core Language Resources for all EU languages that conformed to agreed upon standards and formats. The Language Engineering projects that coherently implemented (or started to work towards the implementation of) these types of activity include:
The ever-spreading tentacles of the Internet have revived US interest in multi-lingual information processing, with a corresponding renewed interest in relevant language resources. At this point the community will be well served by a coordinated international effort that merges what has been achieved in North America, especially in the areas of evaluation, with what has been achieved in Europe, especially with respect to development and maintenance.
Development
Efficient and effective development in an area as complex as Language Processing requires close cooperation between various different research groups, as well as frequent integration of diverse components. This makes a shared platform of large-coverage language resources and basic components an absolute necessity as a common infrastructure, to ensure:
Though we address the particular needs of individual resources below, they all have an essential need for international collaborations that specifically collect existing resources and evaluation methods, integrate them into unified practical and not-too-complex frameworks, and deliver them to such bodies as LDC and ELRA. This work will not only facilitate Language Processing projects but will prove invaluable in pinpointing gaps and theoretical shortcomings in the coverage and applicability of the resources and evaluation methods.
Evaluation
The importance of evaluations to assess the current state of the art and measure progress of technologies, as discussed in
Chapter 8, is evident. There is a need for an independent player to construct and manage both the data and the evaluation campaigns. However, performing evaluations has proven to be a rather difficult enterprise, and not only for technical reasons. Evaluations with high inherent overheads are often perceived as an unrewarding and possibly disruptive activity. However, in every endeavor in which an appropriate and systematic program of evaluations has evolved, marked progress has been achieved in practical language engineering terms. This phenomenon is discussed further in Chapter 6.
Despite this fact, many key players, (customers and developers) have historically shown little interest in performing substantial evaluations, since they simply cannot afford the sizeable investments required. Unfortunately, the consumer reports appearing in various computer magazines lack the necessary accuracy and methodological criteria to be considered objective, valid evaluations. A further limitation is the lack of access to laboratory prototypes, so that only systems that have already been fielded are available for testing by the customer community. Furthermore, developers prefer to spend their time on development instead of on assessment, particularly if the evaluation is to be public.
As a result, the only remaining players with the requisite financial resources, infrastructure, and social clout are the funding agencies. When they are potential users of the technology they can perform in-house evaluations; examples include the Service de Traduction (translation services) of the CEC and the US Department of Defense evaluations of information retrieval, text summarization, and information extraction (TREC, SUMMAC, and MUC; see
Chapters 2 and 3). They can also include evaluations as a necessary component of systems whose development they are funding, as a method of determining follow-on funding. In such a case however, it is critical to ensure community consensus on the evaluation criteria lest the issues become clouded by the need for funds.Developing evaluation measures for resources is even more complex than evaluating applications, such as summarization and machine translation. With applications, achievement of tasks can be specified with corresponding evaluation of performance being measured against the desired outcome. With resources such as lexicons, however, the evaluation has to determine, in some way, how well the resource supports the functioning of the application. This can only be done if the contribution of the lexicon can be teased apart from the contribution of the other components of the application system and the performance of the system as a whole. Therefore, evaluation of resources is by necessity secondary or indirect, making them especially difficult to perform. An unfortunate result of this has been the proliferation of unused grammars, lexicons, resource acquisition tools, and word taxonomies that, with appropriate revision, could have provided valuable community resources. Constructive evaluations of resources are fundamental to their reusability.
However, there is an inherent danger in tying funding too directly to short-term evaluation schemes: it can have the unfortunate result of stifling innovation and slowing down progress. It is critical for evaluations to measure fundamental improvements in technology and not simply reward the system that has been geared (hacked) most successfully to a particular evaluation scheme. The SIGLEX workshops mentioned above provide an example of a grassroots movement to define more clearly the role of syntax, lexical semantics, and lexical co-occurrence in word sense disambiguation, and as such it is examining not just system performance but the very nature of word sense distinctions. The next five years should see a major shift in evaluations away from purely task oriented evaluations and towards a hybrid evaluation approach that will further our understanding of the task while at the same time focusing on measurable results.
Distribution
As with the LDC in the US, the role of ELRA in Europe as an intermediary between producers and users of language resources greatly simplifies the distribution process by preventing a great deal of unnecessary contractual arrangements and easing sales across borders. MLCC, ELRAs multilingual corpus, for example, consists of data from 6 different newspapers in 6 different languages. ELRA has signed contracts with each provider, and the user who wishes to acquire the set of databases only has to sign a single contract with ELRA. Care is taken to ensure that the language resources are clear of intellectual property rights (IPR) restrictions and are available for commercial and research licenses, with a list of key applications associated with them. (The alternative is a bureaucratic nightmare, in which each user has to sign 6 different contracts, negotiate IPR rights for each one, with 6 different producers, in 6 different countries, under 6 different legal systems. Having a few major distribution sites is clearly the only sane method of making this data available. )
In addition to distributing corpora, both raw and annotated, the next few years should see the addition of grammars, lexicons and ontologies as resources that could be made available through such distribution sites.
Maintenance
Many of the resources mentioned above have just been created or are still in the process of being created. Therefore the issue of maintenance has not really been addressed in either the US or in Europe, although it did provide the topic for a panel discussion at the First International Language Resources conference LREC-98. A question that has already arisen has to do with EuroWordNet, which is linked to WordNet 1.5 (because this was the version when EuroWordNet was begun), although version 1.5 has since been replaced by WordNet 1.6. How can EuroWordNet best be updated to reflect the new version?
Anyone having even the briefest acquaintance with software product cycles will expect that the maintenance of language resources will shortly become a central issue.
1.3 Types of Language Resources
1.3.1 Corpora
Before corpora are suitable for natural language processing work, it is necessary for them to be created and prepared (or "annotated"). The term "annotation" is very broadly construed at present, involving everything from identifying paragraph breaks to the addition of information that is not in any way present in the original, such as part of speech tags. In general, one can divide what is now lumped together under the term "corpus annotation" into three broad categories:
In order to enable more efficient and effective creation of corpora for NLP work, it is essential to understand the nature of each of these phases and establish mechanisms and means to accomplish each. Step (1) can be nearly fully automated, but steps (2) and (3) require more processing overhead as well as significant human intervention. In particular, we need to develop algorithms and methods for automating these two steps. This is especially true for step (2), which has received only marginal attention except in efforts such as the TREC name identification task, and this will require funding. Step (3) has received more attention, since algorithms for identifying complex linguistic elements has typically been viewed as a more legitimate area of research. However, as discussed above, appropriate markups for lexical semantic information are at a very rudimentary stage of development. One of the most important directions for corpora annotation is determining a richer level of annotation that includes word senses, predicate argument structure, noun-phrase semantic categories, and coreference.
It is also critical that we devise means to include information about elements in a text in a way that makes the resulting texts maximally processable and reusable. In particular, it is important to ensure that the markup used to identify text elements is:
Therefore, in order to create corpora that are both maximally usable and reusable, it will be necessary to specify clearly the ways in which the corpora will be used and the capabilities of the tools that will process them. This in turn demands that effort be put into the development of annotation software, and above all, that this development be undertaken in full collaboration with developers of the software that will process this data and the users who will access it. In other words, as outlined in (Ide, 1998) there are two major requirements for advancing the creation and use of corpora in NLP:
1.3.2 Grammars
The development of powerful and accurate grammars was seen as a primary necessity for Language Processing in the 1960s and early 1970s. However, the near impossibility of building a complete grammar for any language has been gradually recognized, as well as the tremendous amount of essential lexically-specific information, such as modifier preferences and idiosyncratic expressive details. This has led to a shift in emphasis away from traditional rule-based grammars for broad-coverage applications. The systems developed for the MUC series of Information Extraction tasks (see
Chapter 3) generally employed short-range Finite State matchers that provided eventual semantic-like output more quickly and reliably than purely syntax-based parsers. However, they did not produce a rich enough syntactic structure to support discourse processing such as co-reference, which imposed a limit on their overall performance. The goal being sought today is a combination of linguistic and statistical approaches that will robustly provide rich linguistic annotation of raw text.The issues involved in developing more traditional rule-based grammar resources were thoroughly addressed in a 1996 report commissioned by the National Science Foundation; see
http://www.cse.ogi.edu/CSLU/HLTsurvey/HLTsurvey.html, whose Chapter 3 covers grammars specifically. In addition, recent advances during the last two years have resulted in significant, measurable progress in broad coverage parsing accuracy. Statistical learning techniques have led to the development of a new generation of accurate and robust parsers which provide very useful analyses of newspaper style documents, and noisier, but still usable analyses in other, similar domains (Charniak, 1995; Collins, 1997; Magerman and Rathnaparkhit, 1997; Srinivas, 1997,). Such parsers are trained on a set of (sentence, tree) pairs, and will then output the most likely parse for a new, novel, sentence.One advantage of statistical methods is their ability to learn the grammar of the language automatically from training examples. Thus the emphasis on human effort shifts from handcrafting a grammar to annotating a corpus of training examples. Human annotation can immediately provide coverage for phenomena outside the range of most handcrafted grammars, and the resulting corpus is a re-usable resource which can be employed in the training of increasingly accurate generations of parsers as its annotations are enriched and technology progresses. The handcrafted grammars can play an important role in the bootstrapping of appropriate grammatical structure, as illustrated by the role Fidditch (Hindle, 1983) played in the development of the Penn TreeBank (Marcus, 1993), and the success of the Supertagger, (Joshi and Srinivas, 1994, Srinivas, 1997), developed from corpora to which XTAG parses had been assigned (XTAG, 1995).
An important next major advance has to come from a closer integration of syntax and lexical semantics, namely, the ability to train these parsers to recognize not just syntactic structures, but structures that are rich with semantic content as well, (Hermjakob and Mooney, 1997). In the same way that the existence of the Penn TreeBank enabled the development of extremely powerful new syntactic analysis methods, moving to the stage of lexical semantics will require a correspondingly richer level of annotation that includes word senses, predicate argument structure, noun-phrase semantic categories and coreference.
In order to both produce such a resource, and perhaps more importantly, to utilize it effectively, we need to team our parsing technologies more closely with lexical resources. This is an important part of the motivation behind lexicalized grammars such as TAG (Joshi, Levy and Takahasi, 1975, Joshi, 1985) and CCG (Steedman, 1996). Tightly interwoven syntactic and semantic processing can provide the levels of accuracy that are required to support discourse analysis and inference and reasoning, which forms the foundation of any natural language processing application. This has important implications for the future directions of both corpora and lexicons as resources, as well as ontologies.
1.3.3 Lexicons
Lexicons are the heart of any natural language processing system. They include the vocabulary that the system can handle, both individual lexical items and multi-word phrases, with associated morphological, syntactic, semantic and pragmatic information. In cases of spoken language systems, they also include pronunciation and phonological information. In machine translation systems, the bilingual and multilingual lexicons provide the basis for mapping from the source language to the target language. The EAGLES report on monolingual lexicons in several languages,
http://www.ilc.pi.cnr.it/EAGLES96/rep2/rep2.html, gives a comprehensive description of how morphological and syntactic information should be encoded. Available on-line lexicons for English such as Comlex (Grishman, et al, 1994) and XTAG to a large degree satisfy these guidelines, as do the SIMPLE lexicons being built in Europe for the other language. The EAGLES working group on Lexical Semantics is preparing guidelines for encoding of semantic information.However, to this date these guidelines have not addressed the issue of making sense distinctions. How does the lexicon creator decide to make one, two or more separate entries for the same lexeme? An issue of major concern is the current proliferation of different English lexicons in the computational linguistics community. There are several on-line lexical resources that are being used that make sense distinctions, Longman's, Oxford University Press, (OUP), Cambridge University Press (CUP), Webster's, and WordNet, to name just a few, and they each use very different approaches. In SENSEVAL, the training data and test data was prepared using a set of OUP senses. In order to allow systems using WordNet to compete as well, a mapping from the OUP senses to the WordNet senses was made. The WordNet system builders commented that "OUP and WordNet carve up the world in different ways. It's possible that WordNet is more fine-grained in some instances, but in the map for the words in SENSEVAL, the OUP grain was generally finer (about 240 WN entries for the SENSEVAL words and about 420 OUP entries.) More than anything, the grain is not necessarily uniform -- not within WordNet, not within OUP." This is true of dictionaries in general. They make different decisions about how to structure entries for the same words, decisions which are all equally valid, but simply not compatible. There was quite a bit of concern expressed, both at the workshop, and afterwards, that this makes it impossible to create performance-preserving mappings between dictionaries.
This is an incompatibility with consequences that are for more wide-spread than the comparison of word sense disambiguation systems. Sense inventories, or lexicons, as the core of an information processing application, are critical as well as being one of the most labor intensive components. Many existing natural language processing applications are described as domain-specific, and this primarily describes the lexicon being used, which contains the domain-specific senses for the vocabulary that is relevant to that application. Because of this incompatibility, it is very unlikely that lexicons from two different applications could be readily merged to create a new system with greater range and flexibility. The task of merging the lexicons could be just as labor intensive as the task of building them in the first place. Even more sweeping is the impact on multilingual information processing. All of these tasks require bilingual lexicons that make the mapping from English to French or German or Japanese. Many of these bilingual lexicons are currently being built, but they are all mapping to different English lexicons which are themselves incompatible. The problem of merging two different domain-specific English to French bilingual lexicons is an order of magnitude larger than the problem of merging two English domain-specific lexicons. Then the problem of trying to integrate a bilingual lexicon involving a third language, such as Korean, that was mapped to yet another incompatible English lexicon, requires that it be done all over again. The sooner we can regularize our representation of English computational lexicons, the less work we will have to do in the future.
Regularizing the English computational lexicon is not a trivial task. Creating a consensus on grammatical structure for the TreeBank required posting guidelines that described literally hundreds of distinct grammatical structures. Where lexical entries are concerned the numbers are in the hundreds of thousands. The first step is simply agreeing on criteria for deciding when two different usages should be considered separate senses and when they should not, and should that be allowed to change depending on the context? Once these general principles have been determined, then the business of revising one of the existing on-line lexicons, preferably WordNet since it is being used the most widely, can begin. Only when the criteria for sense distinctions has been agreed upon, can we create reliable sense-tagged corpora for machine learning purposes, and move our information processing systems onto the next critical stage.
Lexicon Development
There is increased recognition of the vital role played by lexicons (word lists with associated information), when fine tuning general systems to particular domains.
Due to the extremely fluid and ever-changing nature of language, lexicon development poses an especially difficult challenge. No static resource can ever be adequate. In addition, as soon as large-scale generic lexicons with different layers of encoded information (morphological, syntactic, semantic, etc.) are created, they will still need to be fine-tuned for use in specific applications.
Generic and domain-specific lexicons are mutually interdependent. This makes it vital, for any sound lexicon development strategy, to accompany core static lexicons with dynamic means for enriching and integrating thempossibly on the flywith many types of information. This global view eliminates the apparent dichotomy between static vs. dynamically built (or incremental) resources, encompassing the two approaches in a more comprehensive perspective that sees the two as complementary and equally necessary facets of the same problem. In the past few years, steps towards this objective have been taken by a consistent number of groups all over the world, with many varied research and development efforts aimed at acquiring linguistic and, more specifically, lexical, information from corpora. Among the EC projects working in this direction we mention LE SPARKLE (combining shallow parsing and lexical acquisition techniques capable of learning aspects of word knowledge needed for LE applications) and LE ECRAN.
Gaps in Static Lexicons
As Gross clearly stated already in the 1970s (Gross 1984), most existing lexicons contain simple words, while actually occurring texts such as newspapers are composed predominantly of multi-word phrases. Still, however, the phrasal nature of the lexicon has not been addressed properly, and is a major limitation of available resources. Correcting this will require corpora to play a major role, but also methodologies of extraction, and linguistic methods of classification.
As mentioned above, resources for evaluation and the evaluation of resources is a major open problem in lexicon development, validation, and reuse.
While large morphosyntactically annotated corpora exist for many European languages, built for example in MULTEXT and for all the EU languages in PAROLE, and also the production of large-size syntactically annotated corpora has started for some EU languages, semantically tagged corpora do not yet exist. This is rapidly becoming a major requirement for developing application-specific tools.
Critical Priorities in Lexicon Development
Computational lexicons, like human dictionaries, often represent a sort of stereotypical/theoretical language. Carefully constructed or selected large corpora are essential sources of linguistic knowledge for the extensive description of the concrete use of the language in real text. To be habitable and practical, a computational lexicon has to faithfully represent the apparently irregular facts (evidenced by corpus analysis), and the divergences by actual usage from what is potentially/in theory acceptable. We need to clearly representand separatewhat is allowed, but only very rarely instantiated, from what is both allowed and actually used. To this end, more robust and flexible tools are needed for (semi-) automatic induction of linguistic knowledge from texts. This usually implies a bootstrapping method, because extraction presupposes some capability of automatically analyzing the raw text in various ways, which first requires a lexicon. The induction phase must however be followed by a linguistic analysis and classification phase, if the induced data is to be used and merged together with already available resources. Therefore:
The EC-funded projects provide an excellent framework for facilitating these types of interactions, by providing the necessary funding for combining the efforts of different and complementary groups. This complementarity of existing competence should continue to be sought and carefully planned.
1.3.4 Ontologies
Background
As described in
Chapters 2, 3, and 4, semantic information is central in improving the performance of Language Processing systems. Lexical semantic information such as semantic class constraints, thematic roles, and lexical classifications need to be closely coupled to the semantic frameworks used for language processing. Increasingly, such information is represented and stored in so-called ontologies.An ontology can be viewed as an inventory of concepts, organized under some internal structuring principle. Ontologies go back to Aristotle; more recently (in 1852), Peter Mark Roget published his Thesaurus of English Words and Phrases Classified and Arranged so as to Facilitate the Expression of Ideas and Assist in Literary Composition. The organization of the words in a thesaurus follows the organization of the concepts that the words express and not vice versa, as in a dictionary; a thesaurus can therefore be considered to be an ontology. Rogets thesaurus has been revised (Chapman, 1977), but not significantly altered. However, for computational purposes, a consistently structured ontology is needed for automatic processing, which is over and beyond what is provided by Roget.
The set of concepts definition, however, begs the notoriously difficult question: What is a concept? Over the past decade, two principal schools of thought have emerged on this question. Researchers in Language Processing circles, typically, have simplified the answer to this question by equating concept with lexicalized concept, i.e., a concept that is expressed by one or more words of a language. (The assumption that more than one word may refer to the same concept reflects the familiar phenomenon of synonymy.) Under this view, an ontology is the inventory of word senses of a languageits semantic lexicon. This definition has the advantage that it contains only those concepts that are shared by a linguistic community. It excludes possible concepts like my third cousins black cat, which are idiosyncratic to a given speaker and of no interest to psychologists, philosophers, linguists, etc. Relating ones ontology with the lexicon also excludes potential concepts expressible by ad-hoc compounds like paper clip container, which can be generated on the fly but are not part of the core inventory, as their absence from dictionaries shows. Moreover, we avoid the need to define words by limiting our inventory to those strings found in standard lexical reference works. Thus the Language Processing ontologies that have been built resemble Rogets thesaurus in that they express the relationships among concepts at the granularity of words. Within Artificial Intelligence (AI), in contrast, a concept has roughly been identified with some abstract notion that facilitates reasoning (ideally, by a system and not just by the ontology builder), and the ontologies that have been built have also been called Domain Models or Knowledge Bases. To differentiate the two styles, the former are often referred to as terminological ontologies (or even just term taxonomies), while the latter are sometimes called conceptual or axiomatized ontologies.
The purpose of terminological ontologies is to support Language Processing. Typically, the content of these ontologies is relatively meager, with only a handful of relationships on average between any given concept and all the others. Neither the concepts nor the inter-concept relationships are formally defined, and are typically only differentiated by name and possibly textual definition. The core structuring relationship is usually called is-a and expresses the rough notion of "a kind of" or conceptual generalization. Very often, to support the wide range of language, terminological ontologies contain over 100,000 entities, and tend to be linked to lexicons of one or more languages that provide the words expressing the concepts. The best-known example of a terminological ontology is WordNet (Miller, 1990; Fellbaum, 1998), which as an on-line resource of reference has had a major impact on the ability of researchers to conceive of different semantic processing techniques. However, before the collection of truly representative large-scale sets of semantic senses can begin, the field has to develop a clear consensus on guidelines for computational lexicons. Indeed, attempts are being made, including (Melcuk, 1988; Pustejovsky, 1995; Nirenburg et al., 1992; Copestake and Sanfilippo, 1993; Lowe et al., 1997; Dorr, 1997; Palmer, 1998). Other terminological ontologies are Mikrokosmos (Viegas et al., 1996), used for machine translation, and SENSUS (Knight and Luk, 1994; Hovy, 1998), used for machine translation of several languages, text summarization, and text generation.
In contrast, the conceptual ontologies of AI are built to support logic-based inference, and often include substantial amounts of world knowledge in addition to lexical knowledge. Thus the content of each concept is usually richer, involving some dozens or even more axioms relating a concept to others (for example, a car has-part wheels, the usual-number of wheels being 4, the wheels enabling motion, and so on). Often, conceptual ontologies contain candidates for concepts for which no word exists, such as PartiallyTemporalAndPartiallySpatialThing. Recent conceptual ontologies reflect growing understanding that two core structuring relationships are necessary to express logical differences in generalization, and that concepts exhibit various facets (structural, functional, meronymic, material, social, and so on). Thus a glass, under the material facet, is a lot of glass matter; under the meronymic facet, it is a configuration of stem, foot, and bowl; under the functional facet, it is a container from which one can drink and through which one can see; under one social facet, it is the object that the bridegroom crushes at a wedding; see (Guarino, 1997). Given the complex analysis required to build such models, and the interrelationships among concepts, conceptual ontologies tend to number between 2,000 and 5,000 entities. The largest conceptual ontology, CYC (Lenat and Guha, 1995) contains approx. 40,000 concepts; every other conceptual ontology is an order of magnitude smaller. (In contrast, as mentioned above, WordNet has roughly 100,000 concepts.) Unfortunately, given the complexity of these ontologies, internal logical consistency is an ongoing and serious problem.
Ontologies contain the semantic information that enables Language Processing systems to deliver higher quality performance. They help with a large variety of tasks, including word sense disambiguation (in "he picked up the bench", "bench" cannot stand for judiciary because it is an abstraction), phrase attachment (in "he saw the man with the telescope", it is more likely that the telescope was used to see the man than that it is something uniquely associated with the man, because it is an instrument for looking with), and machine translation (as an inventory of the symbols via which words in different languages can be associated). The obvious need for ontologies, coupled with the current lack of many large examples, leads to the vexing question of exactly how to build useful multi-purpose ontologies.
Unfortunately, ontologies are difficult and expensive to build. To be useful, they have to be large and comprehensive. Therefore, the more an ontology can be shared by multiple applications, the more useful it is. However, it is not so much a matter of designing the right ontology (an almost meaningless statement, given our current lack of understanding of semantics), but of having a reasonable one that can serve impelling purposes, and on which some consensus between different groups can be reached. In this light, creating a consensus ontology becomes a worthwhile enterprise; indeed, this is precisely the goal of the ANSI group on Ontology Standards (Hovy, 1998), and is a critical task for the EAGLES Lexicon/Semantics Working Group. Initiatives of this kind must converge and act in synergy to be fruitful for the Language Processing community.
Open Questions in Language Processing Ontologies
WordNet (Miller, 1995; Fellbaum, 1998) is a lexical database organized around lexicalized concepts or synonym sets. Unlike Rogets largely intuitive design, WordNet was originally motivated by psycholinguistic models of human semantic memory and knowledge representation. Supported by data from word association norms, WordNet links together its synonym sets (lexicalized concepts) by means of a small number of conceptual-semantic and lexical relations. The most important ones are hyponymy (the superclass relation) and meronymy (the part-whole relation) or concepts expressible by nouns, antonymy for adjectives, and several entailment relations for verbs (Miller, 1990; Fellbaum, 1998). Whereas WordNet is entirely hand-constructed, (Amsler, 1980) and (Chodorow et al., 1985) were among those who tried to extract hyponymically related words automatically from machine-readable dictionaries by exploiting their implicit structure. (Hearst, 1998) proposed to find semantically related words by finding specific phrase patterns in texts.
SENSUS (Knight and Luk, 1994; Hovy, 1998) is a derivative of WordNet that seeks to make it more amenable to the tasks of machine translation and text generation. The necessary alterations required the inclusion of a whole new top level of approx. 300 high-level abstractions of English syntax called the Upper Model (Bateman et al., 1986), as well as a concomitant retaxonomization of WordNet (separated into approx. 100 parts) under this top level. To enable machine translation, SENSUS concepts act as pivots between different language words; its concepts are linked to lexicons of Japanese, Spanish, Arabic, and English.
Mikrokosmos (Viegas et al., 1996; Mahesh 1995) is an ontology of approx. 5,000 high-level abstractions, out of which lexical items are defined for a variety of languages, also in the task of machine translation.
The experience of designing and building these and other ontologies all shared the same major difficulties. First among these is the identification of the concepts. The top-level concepts in particular remain a source of controversy, because these very abstract notions are not always well lexicalized and can often be referred to only by phrases such as causal agent and physical object. Second, concepts fall into distinct classes, expressible by different parts of speech: Entities are referred to by nouns; functions, activities, events, and states tend to be expressed by verbs, and attributes and properties are lexicalized by adjectives. But some concepts do not follow this neat classification. Phrases and chunks such as "wont hear of it" and "the X-er the Y-er"(Fillmore, 1988; Jackendoff, 1995) arguably express specific concepts, but they cannot always be categorized either lexically or in terms of high-level concepts.
Second, the internal structure of proposed ontologies is controversial. WordNet relates all synonym sets by means of about a dozen semantic relations; (Melcuk, 1988) proposes over fifty. There is little solid evidence for the set of all and only useful relations, and intuition invariably comes into play. Moreover, it is difficult to avoid the inherent polysemy of semantic relations. For example, (Chaffin et al., 1988) analyzed the many different kinds of meronymy, and similar studies could be undertaken for hyponymy and antonymy (Cruse, 1986). Another problem is the fact that semantic relations like antonymy, entailment, meronymy, and class inclusion are themselves concepts, raising the question of circularity.
Two major approaches currently exist concerning the structure of ontologies. One approach identifies all elemental concepts as factors, and then uses concept lattices to represent all factor combinations under which concepts can be taxonomized (Wille, 1992). The more common approach is to taxonomize concepts using the concept generalization relation as structural principle. While the debate concerning the relative merits of both approaches continues, only the taxonomic approach has been empirically validated with the construction of ontologies containing over 10,000 concepts.
Third, a recurrent problem relates to the question of multiple inheritance. For example, a dog can be both an animal and a pet. How should this dual relation be represented in an ontology? WordNet and SENSUS treat both dog and pet as kinds of animals and ignores the type-role distinction, because it seems impossible to construct full hierarchies from role or function concepts such as pet. But clearly, there is a difference between these two kinds of concepts. Casting this problem as one of conceptual facets, Pustejovsky (1995) proposed a solution by creating a lexicon with underspecified entries such as newspaper together with structured semantic information about the underlying concept. Depending on the context in which the word occurs, some of its semantic aspects are foregrounded whereas others are not needed for interpreting the context, e.g., the building vs. the institution aspects of newspaper. Guarino (1997) takes this approach a step further, and identifies at least 8 so-called Identity Criteria that each express a different facet of conceptual identity. Such approaches may well offer a satisfactory solution for the representation of the meaning of complex concepts.
Despite their quasi-semantic nature, ontologies based on lexicons do not map readily across languages. It is usually necessary to find shared concepts underlying the lexical classifications in order to facilitate multilingual mappings. Currently, the EuroWordNet project is building lexical databases in eight European languages patterned after WordNet but with several important enhancements (Vossen, et al., 1999). EuroWordNet shows up crosslinguistic lexicalization patterns of concepts. Its interlingual index is the union of all lexicalized concepts in the eight languages, and permits one to examine which concepts are expressed in all languages, and which ones are matched with a word in only a subset of the languages, an important perspective to gain for ontology theoreticians. Multilingual applications are always a good test of ones theories. EuroWordNet is testing the validity of the original WordNet and the way it structures the concepts lexicalized in English. Crosslinguistic matching reveals lexical gaps in individual languages, as well as concepts that are particular to one language only. Eventually, an inspection of the lexicalized concepts shared by all eight member languages should be of interest, as well as the union of the concepts of all languages. Similar data should be available from the Mikrokosmos project. To yield a clear picture, ontologies from as wide a variety of languages as possible should be compared, and the coverage should be comparable for all languages.
The Special Challenge of Verbs
It is not surprising that WordNets noun classification has been used more successfully than the verb classification, or that the majority of the entries in the Generative Lexicon are of nouns. By their very nature verbs involve multiple relationships among many participants which can themselves be complex predicates. Classifying and comparing such rich representations is especially difficult.
An encouraging recent development in linguistics provides verb classifications that have a more semantic orientation (Levin, 1993, Rappaport Hovav and Levin 1998). These classes, and refinements on them (Dang et al., 1998; Dorr, 1997), provide the key to making generalizations about regular extensions of verb meanings, which is critical to building the bridge between syntax and semantics. Based on these results, a distributional analysis of properly disambiguated syntactic frames should provide critical information regarding a verbs semantic classification, as is being currently explored by (Stevenson & Merlo, 1997). This could make it possible to use the syntactic frames occurring with particular lexical items in large parsed corpora to automatically form clusters that are both semantically and syntactically coherent. This is our doorway, not just to richer computational lexicons, but to a methodology for building ontologies. The more we can rely on semi-automated and automated methods for building classifications, even those tailored to specific domains, the more objective these classifications will be, and the more reliably they will port to other languages.
Recent encouraging results in the application of statistical techniques to lexical semantics lend credence to this notion. There have been surprising breakthroughs in the use of lexical resources for semantic analysis in the areas of homonym disambiguation (Yarowsky, 1995) and prepositional phrase attachment (Stetina and Nagao, 1997). There are also new clustering algorithms that create word classes that correspond to linguistic concepts or that aid in language modeling tasks (Resnik, 1993; Lee et al., 1997). New research projects exploring the application of linguistic theories to verb representation promise to advance our understanding of computational lexicons, FRAMENET (Lowe, et al., 1997) and VERBNET (Dang et al., 1998). The next few years should bring dramatic changes to our ability to use and represent lexical semantics.
The Future
Ontologies are no longer of interest to philosophers only, but also to linguists, computer scientists, and people working in information and library sciences. Creating an ontology is an attempt to represent human knowledge in a structured way. As more and more knowledge, expressed by words and documents, is available to larger numbers of people, it needs to be made accessible easily and quickly. Ontologies permit one to efficiently store and retrieve great amounts of data by imposing a classification and structure on the knowledge in these data.
There is only one way in which progress in this difficult and important question can be made effectively. Instead of re-building ontologies anew for each domain and each application, the existing ontologies must be pooled, converted to the same notation, and cross-indexed, and one or more common, standardized, and maximally extensive ontologies should be created. The semi-automated cross-ontology alignment work reported in (Knight and Luk, 1994; Agirre et al., 1994; Rigau and Agirre, 1995; Hovy, 1996; Hovy, 1998) illustrates the extent to which techniques can be developed to exploit the ontology structure, concept names, and concept definitions.
If this goal, shared by a number of enterprises, including the ANSI Ad Hoc Committee on Ontology Standardization (Hovy, 1998), can indeed be realized, it will constitute a significant advance for semantic-based Language Processing.
1.4 Conclusion
The questions raised here are likely to continue to challenge us in the near future; after all, ontologies have occupied peoples minds for over 2,500 years. Progress and understanding are likely to come not from mere speculation and theorizing, but from the construction of realistically sized models such as WordNet, CYC (Lenat and Guha, 1990), Mikrokosmos (Viegas et al., 1996, Mahesh, 1996), and SENSUS, the ISI multilingual ontology (Knight and Luk, 1994; Hovy, 1998).
One next major technical advance is almost certain to come from a closer integration of syntax and lexical semantics, most probably via the ability to train statistical parsers to recognize not just syntactic structures, but structures that are rich with semantic content as well. In the same way that the existence of the Penn TreeBank enabled the development of extremely powerful new syntactic analysis methods, the development of a large resource of lexical semantics (either in the form of an ontology or a semantic lexicon) will facilitate a whole new level of processing. Construction of such a semantic resource requires corpora with a correspondingly richer level of annotation. These annotations must include word senses, predicate argument structure, noun-phrase semantic categories and coreference, and multilingual lexicons rich in semantic structure that are coupled to multilingual ontologies. Tightly interwoven syntactic and semantic processing can provide the levels of accuracy that are required to support discourse analysis and inference and reasoningthe foundation of any natural language processing application.
The thesis of this chapter is the recognition of the essential role that language resources play in the infrastructure of Language Processing, as the necessary common platform on which new technologies and applications must be based. In order to avoid massive and wasteful duplication of effort, public fundingat least partiallyof language resource development is critical to ensure public availability (although not necessarily at no cost). A prerequisite to such a publicly funded effort is careful consideration of the needs of the community, in particular the needs of industry. In a multilingual setting such as todays global economy, the need for standards is even stronger. In addition to the other motivations for designing common guidelines, there is the need for common specifications so that compatible and harmonized resources for different languages can be built. Finally, clearly defined and agreed upon standards and evaluations will encourage the widespread adoption of resources, and the more they are used the greater the possibility that the user community will be willing to contribute to further maintenance and development.
1.5 References
Agirre, E., X. Arregi, X. Artola, A. Diaz de Ilarazza, K. Sarasola. 1994. Conceptual Distance and Automatic Spelling Correction. Proceedings of the Workshop on Computational Linguistics for Speech and Handwriting Recognition. Leeds, England.
Amsler, R.A. 1980. The Structure of the Merriam-Webster Pocket Dictionary. Ph.D. dissertation in Computer Science, University of Texas, Austin, TX
Bateman, J.A., R.T. Kasper, J.D. Moore, and R.A. Whitney. 1989. A General Organization of Knowledge for Natural Language Processing: The Penman Upper Model. Unpublished research report, USC/Information Sciences Institute, Marina del Rey, CA.
Chaffin, R., D.J. Herrmann, and M. Winston. 1988. A taxonomy of part-whole relations: Effects of part-whole relation type on relation naming and relations identification. Cognition and Language 3 (132).
Chapman, R. 1977. Rogets International Thesaurus, Fourth Edition. New York: Harper and Row.
Charniak, E. 1995. Parsing with Context-Free Grammars and Word Statistics. Technical Report: CS-95-28, Brown University.
Chodorow, M., R. Byrd, and G. Heidorn. 1985. Extracting semantic hierarchies from a large on-line dictionary. Proceedings of the 23rd Annual Meeting of the Association for Computational Linguistics (299304).
Collins, M. 1997. Three generative, lexicalised models for statistical parsing. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics. Madrid, Spain.
Copestake, A. and A. Sanfilippo. 1993. Multilingual lexical representation. Proceedings of the AAAI Spring Symposium: Building Lexicons for Machine Translation. Stanford University, California.
Cruse, D.A. 1986. Lexical Semantics. Cambridge: Cambridge University Press.
Dang, H., K. Kipper, M. Palmer, and J. Rosenzweig. 1998. Investigating regular sense extensions based on intersective Levin classes. Proceedings of ACL98. Montreal, Canada.
Dorr, B. 1997. Large-scale dictionary construction for foreign language tutoring and interlingual machine translation. Machine Translation12 (155).
Fellbaum, C. 1998. (ed.) WordNet: An On-Line Lexical Database and Some of its Applications. Cambridge, MA: MIT Press
Fillmore, C., P. Kay, and C. OConnor. 1988. Regularity and idiomaticity in grammatical construction. Language 64 (501568).
Grishman, R., Macleod C., and Meyers, A. 1994. Comlex Syntax: Building a Computational Lexicon, Proc. 15th Int'l Conf. Computational Linguistics (COLING 94), Kyoto, Japan, August.
Gross, M. 1984.Lexicon-Grammar and the Syntactic Analysis of French, Proceedings of the 10th International Conference on Computational Linguistics (COLING'84), Stanford, California.
Guarino, N. 1997. Some Organizing Principles for a Unified Top-Level Ontology. New version of paper presented at AAAI Spring Symposium on Ontological Engineering, Stanford University, March 1997.
Hearst, M. 1998. Automatic Discovery of WordNet Relations. In C. Fellbaum (ed), WordNet: An On-Line Lexical Database and Some of its Applications (131151). Cambridge, MA: MIT Press
Hermjakob, U. and R.J. Mooney. 1997. Learning Parse and Translation Decisions from Examples with Rich Context. Proceedings of the ACL/EACL Conference. Madrid, Spain (482487).
Hindle, D. 1983. User manual for Fidditch. Technical memorandum 7590-142, Naval Research Laboratory.
Hovy, E.H. 1996. Semi-Automated Alignment of Top Regions of SENSUS and CYC. Presented to ANSI Ad Hoc Committee on Ontology Standardization. Stanford University, Palo Alto, September 1996.
Hovy, E.H. 1998. Combining and Standardizing Large-Scale, Practical Ontologies for Machine Translation and Other Uses. Proceedings of the First International Conference on Language Resources and Evaluation (LREC). Granada, Spain.
Jackendoff, R. 1995. The Boundaries of the Lexicon. In M. Everaert, E.J. van den Linden, A. Schenk, and R. Schreuder, (eds), Idioms: Structural and Psychological Perspectives. Hillsdale, NJ: Erlbaum Associates.
Joshi, A.K. 1985. Tree Adjoining Grammars: How much context Sensitivity is required to provide a reasonable structural description. In D. Dowty, L. Karttunen, and A. Zwicky (eds), Natural Language Parsing (206250). Cambridge: Cambridge University Press.
Joshi, A. and L. Levy, and M. Takahashi. 1975. Tree Adjunct Grammars. Journal of Computer and System Sciences.
Joshi, A.K. and B. Srinivas. 1994. Disambiguation of Super Parts of Speech (or Supertags): Almost Parsing, Proceedings of the 17th International Conference on Computational Linguistics (COLING-94). Kyoto, Japan.
Kilgarriff, A. 1997. Evaluating word sense disambiguation programs: Progress report. Proceedings of the SALT Workshop on Evaluation in Speech and Language Technology. Sheffield, U.K.
Knight, K. and S.K. Luk. 1994. Building a Large-Scale Knowledge Base for Machine Translation. In Proceedings of the AAAI Conference.
Lenat, D.B. and R.V. Guha. 1990. Building Large Knowledge-Based Systems. Reading: Addison-Wesley.
Lowe, J.B., C.F. Baker, and C.J. Fillmore. 1997. A frame-semantic approach to semantic annotation. Proceedings 1997 Siglex Workshop, ANLP97. Washington, D.C.
Mahesh, K. 1996. Ontology Development for Machine Translation: Ideology and Methodology. New Mexico State University CRL report MCCS-96-292.
Marcus, M., B. Santorini, and M. Marcinkiewicz. 1993. Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics Journal, Vol. 19.
Melcuk, I. 1988. Semantic description of lexical units in an explanatory combinatorial dictionary: Basic principles and heuristic criteria. International Journal of Lexicography (165188).
Miller, G.A. 1990. (ed.). WordNet: An on-line lexical database. International Journal of Lexicography 3(4) (235312).
Nirenburg, S., J. Carbonell, M. Tomita, and K. Goodman. 1992. Machine Translation: A Knowledge-Based Approach. San Mateo: Morgan Kaufmann.
Lee, L., Dagan, I. and Pereira, F. 1997. Similarity-based methods for word sense disambiguation. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics. Madrid, Spain.
Palmer, M. 1998. Are WordNet sense distinctions appropriate for computational lexicons? Proceedings of Senseval, Siglex98. Brighton, England.
Pustejovsky, J. 1995. The Generative Lexicon. MIT Press
Rappaport Hova, M and B. Levin. 1998. Building Verb Meanings. In M. Butt and W. Geuder (eds.) The Projection of Arguments. Stanford, CA, CSLI Publications.
Ratnaparkhi, A. 1997. A Linear Observed Time Statistical Parser Based on Maximum Entropy Models. Proceedings of the Second Conference on Empirical Methods in Natural Language Processing.
Resnik, P. 1993. Selection and Information: A Class-Based Approach to Lexical Relationships. Ph.D. thesis, University of Pennsylvania, Department of Computer and Information Sciences, 1993.
Rigau, G. and E. Agirre. 1995. Disambiguating Bilingual Nominal Entries against WordNet. Proceedings of the 7th ESSLI Symposium. Barcelona, Spain.
Srinivas, B. 1997. Performance Evaluation of Supertagging for Partial Parsing. Proceedings of Fifth International Workshop on Parsing Technology, Boston.
Steedman, M. 1996. Surface Structure and Interpretation. Cambridge, MA: MIT Press.
Stetina, J. and M. Nagao. 1997. Corpus Based PP Attachment Ambiguity Resolution with a Semantic Dictionary. Proceedings of the Fifth Workshop on Very Large Corpora (6680). Beijing and Hong Kong.
Stevenson, S. and P. Merlo. 1997. Lexical structure and parsing complexity. Language and Cognitive Processes 12(2/3) (349399).
Viegas, E., K. Mahesh, and S. Nirenburg. 1996. Semantics in Action. Proceedings of the Workshop on Predicative Forms in Natural Language and in Knowledge Bases, (108115). Toulouse, France.
Vossen, P., et al. 1999. EuroWordNet. Computers and the Humanities, special issue (in press).
Wille, R. 1992. Concept lattices and conceptual knowledge systems. Computers and Mathematics with Applications 23 (493515).
The XTAG-Group. 1995. A Lexicalized Tree Adjoining Grammar for English. Technical Report IRCS 95-03, University of Pennsylvania. Updated version available at
http://www.cis.upenn.edu/xtag/tr/tech-report.html.Yarowsky, D. 1995. Three Machine Learning Algorithms for Lexical Ambiguity Resolution. Ph.D. thesis, University of Pennsylvania, Department of Computer and Information Sciences.
[This chapter is available as http://www.cs.cmu.edu/~ref/mlim/chapter2.html .]
[Please send any comments to Robert Frederking (
ref+@cs.cmu.edu, Web document maintainer) or Ed Hovy or Nancy Ide.]
Chapter 2
Multilingual (or Cross-lingual) Information Retrieval
Editors: Judith Klavans and Eduard Hovy
Contributors:
Christian Fluhr
Robert E. Frederking
Doug Oard
Akitoshi Okumura, Kai Ishikawa, and Kenji Satoh
Abstract
The term Multilingual Information Retrieval (MLIR) involves the study of systems that accept queries for information in various languages and return objects (text, and other media) of various languages, translated into the user's language. The rapid growth and online availability of information in many languages has made this a highly relevant field of research within the broad umbrella of language processing research. We ignore here issues pertaining to Machine Translation (Chapter 4) and Multimedia (Chapter 9), and focus on the extensions required of traditional Information Retrieval (IR) to handle more than one language.
2.1 Multilingual Information Retrieval
2.1.1 Definition and Terms
Multilingual Information Retrieval (MLIR) refers to the ability to process a query for information in any language, search a collection of objects, including text, images, sound files, etc., and return the most relevant objects, translated if necessary into the user's language. The explosion in recent years of freely-distributed unstructured information in all media, most notably on the World Wide Web, has opened the traditional field of Information Retrieval (IR) up to include image, video, speech, and other media, and has extended out to include access across multiple languages. Being new, MLIR will probably also include the historically excluded access mechanisms typical of libraries involving structured data, such as MARC catalogue records.
The general field of MLIR has expanded in several directions, focusing on different issues; what exactly is within its purview remains open to discussion. It is generally agreed, however, that Machine Translation proper (see
Chapter 4) and Multimedia processing (see Chapter 9) are not included. Nonetheless, several new terms have arisen around the new IR, each with a slight variation in emphasis, inclusiveness, or historical association with related fields. For example, recent research in multilingual information retrieval, such as (Fluhr et al., 1998) in (Grefenstette, 1998), includes descriptive catalogue data from libraries as well as unstructured data. Hull and Grefenstette (1996) list five uses of the term MLIR:In addition to MLIR, four related terms have been used:
1. Multilingual Information Access (MLIA). The broadest possible term to use is Multilingual Information Access, which refers to query, retrieval, and presentation of information in any language. The term MLIA is used in the NSF-EU working groups (Klavans and Schäuble, 1998). In general, the use of information access rather than retrieval implies a more general set of access functions, including those that have been part of the traditional library, as well as other modalities of access to other media. Access could refer to the use of speech input for video output, where the language component could consist of close-captioned text or text from speech recognition, or catalogue querying to metadata. The term information access came into use recently as a way to broaden the historically narrower use of information retrieval.
2. Multilingual Information Retrieval (MLIR). This term refers to the ability to process a query in any language and return objects, such as text, images, sound files, etc., relevant to the user query in any language. Historically, however, Information Retrieval (IR) as a field involved a group of researchers from the unstructured text data base community who employed statistical methods to match query and document (Salton, 1988). In general, this work was English dominated, given the amount of digital information made available to the research community in the early years in English, and excluded access mechanisms typical of libraries involving structured data, such as MARC catalogue records. Thus MLIR as used in this chapter denotes a significantly wider field of interest than that of traditional IR.
3. Cross-lingual Information Access. The use of the term cross-lingual refers (in this context) to bridging two languages, rather than the ability to access information in any language starting with input any language. Systems with cross-lingual capability can accept a query in language L1 or L2, for example English and French, and are capable of returning documents in either L1 or L2. (In other meetings, the term cross-lingual (or translingual) has been used to distinguish systems that cross a language barrier, as opposed to multiple monolingual systems as in TREC.) This term logically includes access via catalogue record and other structured indexing, as for MLIA.
4. Cross-lingual Information Retrieval (CLIR). CLIR
generally implies a relationship to IR, with all the implications that
apply to MLIR. At the 1997 Cross-language Information Retrieval Spring Symposium of the American Association of Artificial Intelligence (Oard et al., 1997), CLIR was defined with the following research challenge: Given a query in any medium and any language, select relevant items from a multilingual multimedia collection which can be in any medium and any language, and present them in the style or order most likely to be useful to the user, with identical or near-identical objects in different media or languages appropriately identified.
This definition of the requirements of a system gives full recognition to the query, retrieval, presentation requirements of a working system from a user perspective, and encapsulates succinctly the full set of capabilities to be included. However, its breadth makes it fit well with a definition of MLIA, the most general term, rather than CLIR, a more precise term.
2.1.2 MLIR: Linking and Hybridizing IR and MT
Multilingual Information Retrieval is a hybrid subject area, interacting with or encompassing several other fields. Section 2.5 discusses related fields.
How MLIR Relates to Information Retrieval
MLIR is an application of information retrieval. In many respects, as discussed above, the two fields share exactly the same goals; as such, well-known IR techniques such as vector space indexing, latent semantic indexing (LSI), similarity functions for matching documents, and query processing procedures are equally useful in MLIR. However MLIR differs from IR in several significant ways. Most important, IR involves no translation component, since only one language is involved. The related but not identical problems of translating queries and documents are discussed below. Subsidiary problems, such as keeping track of translations across several languages, are also not part of the standard monolingual information retrieval process.
How MLIR Relates to and Uses Machine Translation
The goal in machine translation (MT; see
Chapter 4) is to convert a text, written in language L1, into a coherent and accurate translation in language L2. To do so, most MT systems convert the input text, usually sentence by sentence, into a series of progressively more abstract internal representations, in which sentence-internal relationships are determined and the intended meaning of each word is identified. Armed with this information, the appropriate conversions are made to support the output language, upon which output realization, usually also sentence by sentence, is performed. MT requires that the meaning of each individual word be known (as does accurate IR); without this knowledge, homographs (for example plane, which can refer to an airplane, carpentry tool, geometric surface, the action of skimming over water, and several other meanings) cannot be translated into their intended foreign words. Without word translation, no output is possible.Can MLIR be Achieved by Coupling IR and MT?
Unfortunately, while at first blush it may seem that MLIR is simply a matter of coupling IR and MT engines, the special nature of MLIR places constraints on the input to MT that makes a straightforward coupling infeasible. At one extreme, some recent MLIR research has explored extending IR-based indexing techniques to directly bridge language gaps with no explicit translation step at all; see Sections 2.2.2 and 2.3.1 below. Arguments regarding the special nature of MLIR, contained in the NSF-EU MLIA Working Group White Paper (Klavans and Schäuble, 1998), are summarized here.
Differences between the two types of input submitted by MLIR for translation—queries and documents—necessitate two different types of Machine Translation. In the case of queries, the input to MT is a set of disconnected words, or possibly multi-word phrases. There is no call for MT to parse the input, since no syntactic sentence structure can be found. More seriously, the MT system cannot apply traditional methods of wordsense disambiguation, since the input is not a semantically coherent text. It will have to employ other (possibly IR-like) methods to determine the sense of each polysemous word in order to furnish accurate translations. On the other hand, there is no need to produce a linear, coherent output, and in fact multiple (correct) translations of a query term can provide a form of query expansion, which can improve IR performance. Finally, the processes of sentence planning and sentence realization are irrelevant when the input is a string of isolated query words. Without accurate queries, IR accuracy falls dramatically (results of recent studies are given later in this chapter).
For the stage of IR after retrieval (that is, in the case of retrieved documents), in contrast, documents can be translated back into the user's language using the normal methods of MT. However, also for this part of MLIR, partial translation, or keyword extraction and translation, is often adequate for the user's needs. In particular, given the computational expense of MT, it may be inefficient to translate a full document that the user later determines is not exactly what was desired. In addition, fully general purpose MT (especially between a wide variety of languages) is a very difficult problem. Translating a few keywords or a summary (see
Chapter 3) is often a wise policy.Several additional differences between monolingual IR and MLIR arise if the user is familiar with more than one language too. In particular, the user interface must provide differential display capabilities to reflect differing language proficiency levels of users. When more than one user receives the results, translation into several languages may have to be provided. Furthermore, depending on the user's level of sophistication, translation of different elements at different stages can be provided to users for a range of information access needs, including keyword translation, term translation, title translation, abstract translation, specific paragraph translation, caption translation, full document translation, etc. Finally, monolingual IR users can also take advantage of the results of MLIR. Simply the knowledge that a particular query will access a certain number of documents in other languages could, in itself, be valuable information, even if translations are not required.
Thus for MLIR much of the typical MT machinery is irrelevant, or at best only partially relevant. The differences with traditional MT mean that MLIR cannot simply employ MT engines as front-end query translators and back-end document translators.
Rather, efficient ways of coupling together the internal processes of IR and MT engines are required, allowing them to employ the results of the other's intermediate results. It is inevitable that second-generation MLIR systems will exhibit some more-than-surface integration of MT and IR modules.
2.1.3 Key Technical Issues for MLIR
We discuss three different positions on what are the key problems in MLIR. Grefenstette (1998) focuses on term choice and filtering. Oard (1998) presents user-centered challenges. Finally, Klavans (1999) outlines a two-part view that accommodates system-directed and user-directed research issues.
Grefenstette (1998) outlines three problems involving the processing of query terms for MLIR:
This problem requires knowing how terms map between languages. Since little or no contextual text is present in the query to help with term disambiguation, this involves knowing the full range of choices of translations, not just one possible translation, coupled with an understanding how different domains affect translation possibilities.
The second problem deals with determining how to filter, from all possible choices, which ones should be retained in the current application. Unlike MT, a MLIR system can retain a wider set of possibilities that can later be automatically filtered, depending on the kinds of variants that are permitted. Thus the MLIR system has to balance the amount of inaccurate translations (noise) that degrade results against the amount of processing performed to disambiguate the terms and ensure accuracy.
Given that it is advisable to retain a set of well-chosen possible terms for the best retrieval performance, a problem new to MLIR arises. The possibility of assigning alternate weights to different translations permits more accurate term choice. For example, in a compound term such as "morphological change", the first word is quite narrow in translation possibilities (e.g., in French, only one translation la morphologie) while the second is more general ("change" could be changement or monnaie). In such cases, more weight could be given to the first word's translation than to the second. This problem is compounded by the fact that some multi-word terms do not decompose, but should be treated as a collocation. Thus, mechanisms for weighing alternatives must consider individual word translation weights as well as multi-word term translation weights.
Grefenstette points out that the first two problems are also found in machine translation, and still require research for fully effective solutions. The third problem is one that clearly distinguishes MLIR from both MT and IR.
Oard (1998), in presentations during the Workshops on MLIR, outlined a historical view of CLIR that is user-centered in nature. He views the overall problem of CLIR as a series of processes, including query formulation and document selection, involving feedback from system to user and from user to system. The system-internal processes of indexing, document processing, and matching are treated as components supporting direct user interaction. He presents three points of historical perspective:
Oard's five challenges for the next five years are given in Section 2.4 below.
Klavans (1999) approaches the central problems in a somewhat different way, focusing on two sets of issues. One set involves three questions relating to the parts of the query-retrieval process, and the other set relates to user needs.
System issues
Usability Issues. IR systems present two main interface challenges: first, how to permit a user to input a query in a natural and intuitive way, and second, how to enable the user to interpret the returned results. A component of the latter encompasses ways to permit a user to comment and provide feedback on results and to iteratively improve and refine results. MLIR brings an added complexity to the standard IR task. Users can have different abilities for different languages, affecting their ability to form queries and interpret results. For example, a user might be proficient in understanding documents in French, but could not produce a query in French. In this case, the user will need to formulate a query in his native language, but will want documents returned only in French, not translated. At the same time, this user may have spotty knowledge of German. In this case, he might request a set of key terms translated to his native language, and not want to view source documents in German at all. Or he may simply want a numerical count, in order to know that for a given query, there are a certain number in German, a certain number in French, a certain number in Vietnamese, and so on. In addition, knowing the specific sources of relevant information may also be very valuable.
Since research and applications in MLIR are so new, a full understanding of user needs has yet to be developed and tested. However, these needs differ from simple MT needs, given the user query production and refinement stages.
2.1.4 Summary of Technical Challenges
MLIR involves at least the following four technical challenges:
2.2 Where We Were Five Years Ago
2.2.1 Capabilities Then
The lure of cross language information retrieval attracted experimentation by the IR community early on. Already in 1971, Salton showed that the use of a transfer dictionary for English and French (a bilingual wordlist with predefined mappings between terms) could be used to translate from a query in one language to another (Salton, 1971). This experiment, although ignoring the realistic and challenging problem of ambiguity, nonetheless served the information retrieval community well in providing a model for a viable approach to cross language IR. However, at the same time, the experiment also illustrated some of the exceedingly difficult problems in the language translation and mapping component of a system, namely one to many mappings, gaps in term translations, and ambiguity. Similarly, in a manual test with a small corpus, Pevzner (1972) showed for English and Russian that a controlled thesaurus can be used effectively for query term translation.
For nearly twenty years, the areas of IR and MT remained separate, leaving MLIR somewhat dormant. Apart from a few forays into refining these early techniques, all significant advances in MLIR have been made in the past five years. This is not surprising, given that increased amounts of information are becoming available in electronic format, and the economy is globalizing.
2.2.2 Major Methods, Techniques, and Approaches Five Years Ago
We discuss the problem within the framework outlined above.
System issues include the following.
Usability issues include the following. Early experiments were performed at such a small scale, more in the nature of proof-of-concept rather than full-fledged large-scale systems. User feedback and user needs were simply not part of what was tested.
2.2.3 Major Bottlenecks and Problems Five Years Ago
The three major bottlenecks of the early part of this decade still persist. They are: limited resources for building domain and language models; limited new technologies for coping with size of collections; and limited understanding of the myriad of user needs.
2.3 Where We Are Today
The burgeoning field of MLIR field is clearly in evidence, as can be seen in the bibliography in the first major review article on the topic (Oard and Dorr, 1996). Papers cited include related work on machine translation, including some research translated from Russian. There are 16 citations prior to 1980, 10 from 1980-89, and 52 from 1990 to early 1996. The first major book to be published on the topic (Grefenstette, 1998) reflects the same temporal bias. This work is slanted towards IR rather than toward MT. It contains 11 citations prior to 1980, 25 from 1980-89, and 101 from 1990 to very early 1998.
2.3.1 Major Methods, Techniques, and Approaches Now
Following the format above, we divide the methods into system-centered and user-centered concerns, although each provides feedback to the other.
System issues include the following:
Usability issues include the following. The development of effective MLIR technology will have no impact if the user's needs and operation patterns are not considered. Since MLIR is a growing field, and since applications are just emerging, formative studies of usability are essential. Currently, there are a limited number of systems in early operation which are providing important data (e.g., EuroSpider, the translate function of AltaVista, multilingual catalogue access). The incorporation of users in the relevance feedback loop is particularly important, since user needs vary greatly. A full review of user needs is found in (Klavans and Schäuble, 1998).
2.3.2 Major Bottlenecks and Problems
Since this is a new field, the bottlenecks listed in Section 2.2.3, evident in earlier years, persist.
2.4 Where We Will Be in Five Years
The growing amount of multilingual corpora is providing a valuable and as yet untapped resource for MLIR. Such corpora are essential to building successful dynamic term and phrase translation thesauri, which is, in turn, key to effective indexing and matching. One of the key challenges is in devising efficient yet linguistically informed methods of tapping these resources, methods which combine the best of what is know about fast statistical techniques along with more knowledge based symbolic methods. Even promising new techniques, such as translingual LSI (Landauer et al., 1998) and related techniques (Carbonell et al., 1997), will most probably still rely on parallel corpora. Such corpora are often difficult to find, and very expensive to prepare. This has been the motivation for the work on comparable corpora. However, more and more are being created electronically, especially to conform to legal requirements for the European Union. The issues surrounding corpora are extensively discussed in
Chapter 1.An important class of techniques involves machine learning, as applied to the cross-language term mapping problem. Since term translation, loosely defined, is at the core of query processing, document processing, and matching, it is an important process to do thoroughly and accurately. Even if multiple translations are retained in the MLIR process, obtaining a sensible set of domain linked terms is an important and central task. One way to obtain these ter