[This report is available as http://www.cs.cmu.edu/~ref/mlim/index.html .]
[It has now also been published, as Linguistica Computazionale, Volume XIV-XV, "Multilingual Information Management: Current Levels and Future Abilities", Eduard Hovy, Nancy Ide, Robert Frederking, Joseph Mariani, and Antonio Zampolli (editors). Publisher: Insituti Editoriali e Poligrafici Internazionali, Pisa, Italy, 2001. ISSN 0392-6907.]
[It is also available as a single [very large] page: http://www.cs.cmu.edu/~ref/mlim/index.shtml .]
[Please send any comments to Robert Frederking (firstname.lastname@example.org, Web document maintainer) or Ed Hovy or Nancy Ide.]
Multilingual Information Management:
Current Levels and Future Abilities
Commissioned by the US National Science Foundation
and also delivered to
the European Commissions Language Engineering Office
and the US Defense Advanced Research Projects Agency
Eduard Hovy, USC Information Sciences Institute (co-chair)
Nancy Ide, Vassar College (co-chair)
Robert Frederking, Carnegie Mellon University
Joseph Mariani, LIMSI-CNRS
Antonio Zampolli, University of Pisa
Gary W. Strong, DARPA and NSF, USA
The Internet is rapidly bringing to the foreground the need for people to be able to access and manage information in many different languages. Even in cases where people have been lucky enough to learn several languages, they will still need help in effectively participating in the global information society. There are simply too many different languages, and all of them are important to somebody.
While machine translation has a long (over 50 year) history, computer technology now appears ready for the next great push in technology for multilingual information access and management, particularly over the World Wide Web. The European Commission and several US agencies are taking bold steps to encourage research and development in multilingual information technologies. The EC and the US National Science Foundation, for example, have recently issued a joint call for Multilingual Information Access and Management research. The US Defense Advanced Research Projects Agency is supporting a new effort in Translingual Information Detection, Extraction, and Summarization research. Both of these efforts are direct results of international planning efforts, and this Granada effort in particular.
No one was more surprised than the Granada workshop participants were at the rapid uptake in interest in Multilingual Information Management research. Attendees of the workshop in Granada, Spain hardly had their bags unpacked when the results were requested to be presented in Washington DC at a National Academy of Sciences workshop on international research cooperation. The US White House expressed interest in the topic as a groundbreaking effort for a new US-EU Science Cooperation Agreement. Now, DARPA has decided to invest in a multi-year, large-scale effort to push the envelope on rapid development of multilingual capability in new language pairs.
The World is surely shrinking as communication and computation advances proceed at a breath-taking pace. On the other hand, there is no doubt that people will continue to hold on to the values and beliefs of their native cultures. This includes holding on to the language of their families and ancestors. This is a treasure, a cultural knowledge base that must not be weakened even as pressures to be able to speak common languages increase. Therefore, efforts in multilingual technology not only allow us to share knowledge and resources of the World, they also allow us to preserve our individual human qualities that have allowed us to progress and solve problems that we all share.
I thank all whose efforts have gone into this workshop report and the resource that it represents for future efforts in the field. Those who proceed to carry on the needed research and development being called for from around the world will surely find this report to be of great value.
Introduction: The Goals of the Report
Over the past 50 years, a variety of language-related capabilities has been developed in machine translation, information retrieval, speech recognition, text summarization, and so on. These applications rest upon a set of core techniques such as language modeling, information extraction, parsing, generation, and multimedia planning and integration; and they involve methods using statistics, rules, grammars, lexicons, ontologies, training techniques, and so on.
It is a puzzling fact that although all of this work deals with language in some form or other, the major applications have each developed a separate research field. For example, there is no reason why speech recognition techniques involving n-grams and hidden Markov models could not have been used in machine translation 15 years earlier than they were, or why some of the lexical and semantic insights from the subarea called Computational Linguistics are still not used in information retrieval.
This picture will rapidly change. The twin challenges of massive information overload via the web and ubiquitous computers present us with an unavoidable task: developing techniques to handle multilingual and multi-modal information robustly and efficiently, with as high quality performance as possible.
The most effective way for us to address such a mammoth task, and to ensure that our various techniques and applications fit together, is to start talking across the artificial research boundaries. Extending the current technologies will require integrating the various capabilities into multi-functional and multi-lingual natural language systems.
However, at this time there is no clear vision of how these technologies could or should be assembled into a coherent framework. What would be involved in connecting a speech recognition system to an information retrieval engine, and then using machine translation and summarization software to process the retrieved text? How can traditional parsing and generation be enhanced with statistical techniques? What would be the effect of carefully crafted lexicons on traditional information retrieval? At which points should machine translation be interleaved within information retrieval systems to enable multilingual processing?
The purpose of this study is to address these questions, in an attempt to identify the most effective future directions of computational linguistics research and in particular, how to address the problems of handling multilingual and multi-modal information. To gather information, a workshop was held in Granada, Spain, immediately following the First International Conference on Linguistic Resources and Evaluation (LREC) at the end of May, 1998. Experts in various subfields from Europe, Asia, and North America were invited to present their views regarding the following fundamental questions:
The experts were invited to represent the following areas:
In a series of ten sessions, one session per topic, the experts explained their perspectives and participated in panel discussions that attempted to structure the material and hypothesize about where we can expect to be in a few years time. Their presentations, comments, and notes were collected and synthesized into ten chapters by a collection of chapter editors.
A second workshop, this one open to the general computational linguistics public, was held immediately after the COLING-ACL conference in Montreal in August, 1998. This workshop provided a forum for public discussion and critique of the material gathered at the first meeting. Subsequently, the chapter editors updated and refined the ten chapters.
This report is formed out of the presentations and discussions of a wide range of experts in computational linguistics research, at the workshops and later. We are proud and happy to present it to representatives and funders of the US and European Governments and other relevant associations and agencies.
We hope that this study will be useful to anyone interested in assessing the future of multilingual language processing.
We would like to thank the US National Science Foundation and the Language Engineering division of the European Commission for their generous support of this study.
Eduard Hovy and Nancy Ide, Editorial Board Co-chairs
Nuria Bel, GILCUB, Spain
Christian Boitet , GETA, France
Nicoletta Calzolari, ILC-CNR, Italy
George Carayannis, ILSP, Greece
Lynn Carlson, Department of Defense, USA
Jean-Pierre Chanod, XEROX-Europe, France
Khalid Choukri, ELRA, France
Ron Cole, Colorado State University, USA
Bonnie Dorr, University of Maryland, USA
Christiane Fellbaum, Princeton University, USA
Christian Fluhr, CEA, France
Robert Frederking, Carnegie Mellon University, USA
Ralph Grishman, New York University, USA
Lynette Hirschman, MITRE Corporation, USA
Jerry Hobbs, SRI International, USA
Eduard Hovy, USC Information Sciences Institute, USA
Nancy Ide, Vassar College, USA
Hitoshi Iida, ATR, Japan
Kai Ishikawa, NEC, Japan
Frederick Jelinek, Johns Hopkins University, USA
Judith Klavans, Columbia University, USA
Kevin Knight, USC Information Sciences Institute, USA
Kamran Kordi, Entropic, England
Gianni Lazzari, ITC, Italy
Bente Maegaard, Center for Sprogteknologi, Denmark
Joseph Mariani, LIMSI-CNRS, France
Alvin Martin, NIST, USA
Mark Maybury , MITRE Corporation, USA
Giorgio Micca, CSELT, Italy
Wolfgang Minker, LIMSI-CNRS, France
Doug Oard, University of Maryland, USA
Akitoshi Okumura, NEC, Japan
Martha Palmer, University of Pennsylvania, USA
Patrick Paroubek, CIRIL, France
Martin Rajman, EPFL, Switzerland
Roni Rosenfeld, Carnegie Mellon University, USA
Antonio Sanfilippo, Anite Systems, Luxembourg
Kenji Satoh, NEC, Japan
Oliviero Stock, IRST, Italy
Gary Strong, National Science Foundation, USA
Beth Sundheim, SPAWAR/NCCOSC, USA
Nino Varile, European Commission, Luxembourg
Charles Wayne, Departmentof Defense, USA
John White, Litton PRC, USA
Yorick Wilks, University of Sheffield, England
Antonio Zampolli, University of Pisa, Italy
Table of Contents
Chapter 1. Multilingual Resources (lexicons, ontologies, corpora, etc.)
Editor: Martha PalmerChapter 2. Cross-lingual and Cross-modal Information Retrieval
Editors: Judith Klavans and Eduard HovyChapter 3. Automated Cross-lingual Information Extraction and Summarization
Editor: Eduard HovyChapter 4. Machine Translation
Editor: Bente MaegaardChapter 5. Multilingual Speech Processing
Editor: Joseph MarianiChapter 6. Methods and Techniques of Processing
Editor: Nancy IdeChapter 7. Speaker/Language Identification, Speech Translation
Editor: Gianni LazzariChapter 8. Evaluation and Assessment Techniques
Editor: John WhiteChapter 9. Multimedia Communication, in Conjunction with Text
Editors: Mark Maybury and Oliviero StockChapter 10. Government: Policies and Funding
Editors: Antonio Zampolli and Eduard Hovy