MLIM: Chapter 10

[This chapter is available as http://www.cs.cmu.edu/~ref/mlim/chapter10.html .]

[Please send any comments to Robert Frederking (ref+@cs.cmu.edu, Web document maintainer) or Ed Hovy or Nancy Ide.]

Chapter 10

Government: Policies and Funding

Editors: Antonio Zampolli and Eduard Hovy

Contributors:

Nino Varile

Gary Strong

Charles Wayne

Lynn Carlson

Khalid Choukri

Joseph Mariani

Nicoletta Calzolari

Antonio Zampolli

Abstract

Language Technology has made great strides since its inception fifty years ago. Still, however, few people can participate in the growing global human-centered Information Society, partly because of impediments imposed by language barriers. One of the principal tasks of Language Technology is to overcome these barriers. It can best do so through international and, increasingly, intercontinental collaboration of research and development efforts. This chapter outlines the areas of Language Technology in urgent need of collaboration and highlights some of the potential benefits of wise funding policy in this regard.

10.1 General Context: Transatlantic Cooperation

Multilingual Language Processing has two obvious aspects. A citizen must be able to access the services of the Information Society in his/her own language, and he or she must be able to communicate and use information and services across language barriers.

The various fields of Language Processing have coexisted in the continents of Asia, Europe, and North America since the early 1960s. Despite several international organizations, including SIGIR for Information Retrieval, IAMT for Machine Translation, ACL for general Computational Linguistics, ICASSP for speech processing, and a host of smaller associations, relatively little formal cross-continent research and development has taken place. But as the fields mature, and as technology is increasingly commercialized, international cooperation for research is increasingly important. Cooperation enhances advance the state of the art by combining most effectively the strengths and the excellence developed in different regions. Cooperation also facilitates integration of language technology across languages, which is surely on of the key aspects that makes this field relevant to society at large.

In light of such arguments, the US government and the European Commission have recently signed an agreement for scientific and technological cooperation with regard to Language Technology.

This chapter addresses the issues that influence thinking by the Funding Agencies in the two continents. It draws upon the findings of the preceding chapters of this report, each chapter dedicated to a major sector of Language Technology. The goal of this chapter is to discuss and identify issues for which transatlantic cooperation is primarily needed and promises to be particularly fruitful. We indicate activities and concrete suggestions for which cooperation is likely to be effective, providing material for anyone interested in defining policy regarding intercontinental (and even local) R&D directions for Language Technology.

The interests of national and international Funding Agencies in the social, economic, industrial, and strategic potential impact of human language technology has decisively contributed to the evolution of our field. This interest is bound to grow in the current context of the global multilingual society, in which information and communication technologies are increasingly interpenetrated. Language Technology involves not only R&D issues, but also cultural and political aspects: languages and cultures are deeply interconnected, and the availability of adequate Language Technology products and services are an essential component of the networked Information Society.

A recent survey shows that R&D support of Language Technology is extremely uneven across various countries. Thus the strategy that national and international Funding Agencies adopt with regard to Language Technology will play a key role in shaping the future of the global human-centered Information Society. Language Technology is the key that can open the door to a true multilingual society.

10.2 Potential Areas of Cooperation

In this section we discuss five core areas in which intercontinental cooperation can have the most beneficial effect.

10.2.1 Standards (de facto, best practice)

Standards for language resources are seen as essential by all the panelists and discussants at the workshops that gave rise to this report. Standards for applications serve multiple purposes: they help eliminate redundancy of effort, ensure multilingual interoperability, consolidate current technical achievements and practices, allow convergence and coordination of distributed efforts, promote the development of a common software infrastructure, enable the integration of components and tools in workflow, and promote the adoption of best practices.

This holds especially for the development of multilingual applications, since each new language that is addressed can be most quickly developed to an acceptable level and incorporated if it can employ existing technology.

Various unified standardization efforts have been supported in the past, but in a somewhat piecemeal fashion. The Text Encoding Initiative (Sperberg-McQueen and Burnard, 1994) provides guidelines for electronic text encoding. Recommendations of the EU-sponsored EAGLES study for corpus encoding, lexicon representation, spoken language, and the evaluation of machine translation and speech recognition (see the three EAGLES websites http://www.ilc.pi.cnr.it/EAGLES/home.html; http://coral.lili.uni-bielefeld.de/EAGLES; http://www.cst.ku.dk/projects/eagles2.html; King et al., 1996) have already been adopted in several countries. In the US, the TREC series of information retrieval contests (Voorhees and Harman, 1998; see Chapter 2) and MUC information extraction contests (Grishman and Sundheim, 1996; see Chapter 3) have helped put in place standards and methods of evaluation.

To capitalize on the existing momentum, it has been proposed that researchers in the U.S. join EAGLES as soon as possible. Future cooperation of European and American participants should be initiated in all the fields currently covered by EAGLES Working Groups.

10.2.2 Language Resources and Related Tools

As discussed in Chapter 1 and referred to in almost every other chapter of this report, it is clear that language resources–monolingual and multilingual, and multifunctional (shared by different language technologies)–are a central issue for efficient future development. Resources include text and speech collections, lexicons, and grammars, as well as related research and development of methods and tools for acquisition, annotation, maintenance, development, customization, etc.

Language resources are an essential component of any Language Technology activity: research, system development, and training and evaluation, in both mono- and multilingual context. The integration of different technologies and languages, a major focus of this report, requires as a key enabling condition that language resources are shared among the different sectors and applications:

Computational lexicons (mono- and multilingual, general and domain-specific) are essential components of any Language Technology application, both written and spoken. Their utility increases with the technological complexity of the application.

Monolingual and multilingual corpora (general and task/domain-specific), especially national corpora developed in close coordination with the countries involved, provide comparable data across the various languages as well as parallel data.

Semantic knowledge (semantic annotations of corpora, semantic information in lexical resources) is the single most urgent need for the advancement in research on quality improvement. Increasingly, it is a requirement for significant applications, which have to become content-based to really impact on the market. In addition, semantic knowledge is essential for the addition of a multilingual layer to the lexicons. In this area, coordination both between standards design, ongoing development activities, and on research aspects is crucial.

Common methods and tools must be developed for editing, maintaining, annotating, and, in particular, for the dynamic and (semi-)automatic acquisition and adaptation of language resources.

The issue of Intellectual Property Rights requires attention.

A still understudied aspect is the methodology and standards of validation of language resources. To date, the relative scarcity of resources has meant that validation is not really an issue. However, as resources become more common and overlap more, it is increasingly necessary to be able to quantify the quality, coverage, extensibility, ease of use, and a host of other aspects of resources.

The need to ensure reusability, integration, global planning and coordinated international cooperation in the field of Language Resources has been stressed. This can only be achieved if there are projects explicitly dedicated to their development and maintenance. Although language resources must be tested against concrete applications, they should be developed to be multifunctional, i.e., to serve different multiple applications, and must thus be built outside of specific applications.

The cycle of language resource production includes the following phases: research, specification, manual creation and/or automatic acquisition, timing, validation, exploitation, maintenance, and subsequent identification of the next generation of language resources in correlation with user needs. It is important to plan the research and production process, technically and financially, in particular for multilingual language resources, and to create a suitable infrastructure. The production of real language resources demands time; for example, a speech database may be collected in one year or six months, a time-scale impossible for a large computational lexicon or a multimodal/multimedia database.

The value of language resources suggests that this aspect be allocated a research and development area to itself, with the production of language resources fully financed. As illustrated in the US with the Linguistic Data Consortium (LDC), distribution should be supported until self-sufficient. A distributed networked infrastructure should be established, and cooperation of ELRA and LDC should be promoted.

The richness of the multilingual capabilities associated to a language depends on the number of languages for which language resources exist. It is in the common interest that language resources are developed for as many languages as possible. A balance between the market forces and the political and social issues should be found. International cooperation in the construction of language resources is the key that can open the door to a true multilingual society.

10.2.3 Core Technologies

Given the complexity of language and real-world Language Technology applications, no application consists of a single module running independently. Many basic techniques and even functionality components are used in a variety of applications. By core technology we mean the general technology that serves as a basis for many innovative applications, in both the spoken and written areas, and includes both methods and techniques (such as vector space distance metrics, HMM technology) as well as processes (such as parsing, document indexing, etc.). It is worth noting that several basic natural language and speech processing tasks coincide with language resources tools requirements, including word sense disambiguation; dynamic acquisition of linguistic knowledge from textual data; shallow parsing; transfer of technology among applications, domains and languages; and customization.

While innovative research must continue to be fostered somewhat independently of applications, the more mature, well-delimited, and robust functionalities and techniques can be selected for general re-use. When new technologies are proven, they are still often fragmented and need advancement and integration. Often, they can be enhanced not only by good software engineering practice but also by including methods for acquiring linguistic/lexical information from corpora dynamically, at run-time.

The Language Technology community should foster the development of plug and play modules that can be easily integrated into larger systems and thereby support rapid prototyping and software application development. The recent appearance of such low-level text processing tools as part of speech taggers and proper name recognizers has had a beneficial effect on many research projects. This effect can be magnified by the development of more such tools, as well as one or more architectures or platforms upon which innovative applications/systems can be built. Existing examples of such platforms, such as GATE (Cunningham et al., 1996) and ALEMBIC (Aberdeen et al., 1996), illustrate how integrated platforms can support further research on specific targeted problems. Effort should be devoted to the creation of widely multilingual platforms.

10.2.4 Evaluation

As discussed in Chapter 8, evaluation has on several occasions provided huge benefits for Language Technology, to researchers, commercial developers, and Funding Agencies.

Much has been written about evaluation; we do not repeat the arguments here, beyond to note that it is often only through such evaluations as TREC and MUC that research areas find a common focus and make easily quantifiable progress. In this light, evaluation should cover functionality, methods, components, and application systems, and the perspectives of both the developers and the users should be considered.

The EAGLES Evaluation Working Group offers a good basis for cooperation that supports complementarity between the American (competitive evaluation) and the European experiences (standards, general methodology developed in various projects, and the user and usability perspectives). An intercontinental cooperation on a common evaluation effort should:

focus on (core) technologies;

include multilingual tasks, as well as the integration of speech and natural language processing;

concern applications which are of major interest to the citizens;

find the right balance between difficulties (to be relevant) and accessibility (to stimulate participation);

be oriented to specific problem solving;

establish links with language resources production and distribution, in order to reuse language resources or promote the creation of language resources as needed;

adopt standards for resources annotation that are developed in the standards cooperation, and provide feedback on their utility.

10.2.5 Vertical Sectoral Application Domains

The development of innovative precompetitive systems has functioned and should continue to function as a testbed for the different language resources, technologies, components, and evaluation methodologies. The appropriate balance should be found, in selecting areas, between the interests of the citizen, the integration of different sectors, the cultural and social impact, the needs of the administrations, the commercial potential, the strategic value, the industrial requirements, and the stimulus of long-term challenges.

The following types of application development have been mentioned:

Education

Tourism

Access to cultural heritage/resources

Language learning

Digital libraries

E-commerce

The international cooperation framework gives priority to multilingual/translingual applications.

10.3 Proposals for Cooperative Projects
Standards

American researchers can immediately join EAGLES and help to plan future EAGLES-like follow-up and development efforts.

Language Resources

Corpora, lexicons, tools development and related research issues: cooperation in lexical projects (FrameNet with PAROLE/SIMPLE; WordNet with EuroWordNet; BNC-ANC-PAROLE; Multilingual parallel corpora; multilingual lexical layer design and development and related research; cross-membership and networking of data production, validation, distribution centers (e.g., PAROLE-ELRA-LDC); development of common language resources related tools; identification of priority topics; research on innovative/new types of language resources (semantic; dialogue; multimodal, etc.).

Core Technologies

Architecture for research and development; automatic learning methods (SPARKLE and ECRAN); integration of symbolic and statistical methods; word sense disambiguation; robust analyzers (shallow parsers, e.g., SPARKLE); customization of language resources; transfer of language resources and technologies to different domains, languages, applications. The core technology could be accompanied by related evaluation (e.g., as in SENSEVAL and Romanseval for word sense disambiguation).

Evaluation

Exchange of best practices in areas such as:

User centered evaluation.

Task based technology evaluation.

Monitoring and assessment of the Language Technology programs through the evaluation of the scientific/technological progresses and the impact of Language Technology and its applications.

Integrating the US evaluation scheme and the European EAGLES approach

Practical examples include topic spotting for broadcast news and multilingual TREC with European participation.

Applications

See above.

10.4 Conclusion

Language Technology has made great strides since its inception fifty years ago. Still, however, few people can participate in the growing global human-centered Information Society, partly because of impediments imposed by language barriers. One of the principal tasks of Language Technology is to overcome these barriers. To do so, this chapter highlights the need to promote the convergence and integration of different technologies and know-how, for example by designing an integrating multilingual information access service that includes voice, written, image modalities, machine translation, information retrieval, information extraction, summarization, and browsing and display functionalities.

The wise funding of intercontinental collaboration of research and development, specifically in the areas of outlined in this chapter, will allow society to reap the benefits inherent in multilingual Language Technology.

10.5 References

Aberdeen, J., J. Burger, D. Day, L. Hirschman, D. Palmer, P. Robinson, and M. Vilain. 1996. Description of the ALEMBIC System as used in MET. Proceedings of the TIPSTER Workshop, Vienna, Virginia (461—462).

Cunningham, H, K. Humphreys, R. Gaizauskas, and Y. Wilks. 1996. TIPSTER-Compatible Projects at Sheffield. Proceedings of the TIPSTER Workshop, Vienna, Virginia (121—125).

Grishman, R. and B. Sundheim (eds). 1996. Message Understanding Conference 6 (MUC-6): A Brief History. Proceedings of the COLING-96 Conference. Copenhagen, Denmark (466—471).

King, M. et al. 1996. EAGLES Evaluation of Natural Language Processing Systems: Final Report. EAGLES Document EAG-EWG-PR.2, Center for Sprogteknologi, Copenhagen.

Sperberg-McQueen, C.M. and L. Burnard (eds). 1994. Guidelines for Electronic Text Encoding and Interchange. Text Encoding Initiative. Chicago and Oxford.

Voorhees, E. and D. Harman. 1998. (TREC series) Overview of the Sixth Text Retrieval Conference (TREC-6). In Proceedings of the Sixth Text Retrieval Conference (TREC-6), in press. See also http://www.TREC.nist.gov.