MLIM: Chapter 9

[This chapter is available as http://www.cs.cmu.edu/~ref/mlim/chapter9.html .]

[Please send any comments to Robert Frederking (ref+@cs.cmu.edu, Web document maintainer) or Ed Hovy or Nancy Ide.]

Chapter 9

Multimedia Communication, including Text

Editors: Mark Maybury and Oliviero Stock

Contributors:

George Carayannis

Eduard Hovy

Mark Maybury

Oliviero Stock

Abstract

Multimedia communication is a part of everyday life and its appearance in computer applications is increasing in frequency and diversity. This article defines the area, outlines fundamental research questions, summarizes the history of this technology, identifies current challenges and concludes by predicting future breakthroughs and discussing multilinguality. We conclude describing several new research issues that systems of systems raise.

9.1 Definition of Multimedia Communication

We define communication as the interaction between human-human, human-system, and human-information. This includes interfaces to people, interfaces to applications, and interfaces to information. Following Maybury and Wahlster (1998), we define:

Multimedia: physical means via which information is input, output and/or stored (e.g., interactive devices such as keyboard, mouse, displays; storage devices such as disk or CD-ROM).

Multimodal : human perceptual processes such as vision, audition, taction.

Multicodal: representations used to encode atomic, elements, syntax, semantics, pragmatics and related data structures (e.g., lexicons, grammars) associated with media and modalities.

The majority of computational efforts have focused on multimedia human computer interfaces. There exists a large literature and associated techniques to develop learnable, usable, transparent interfaces in general (e.g., Baecker et al., 1995). In particular, we focus here on intelligent and multimedia user interfaces (Maybury, 1993) which, from the user perspective, assist in tasks, are context sensitive, adapt appropriately (when, where, how), and may:

analyze synchronous and asynchronous multimedia or multimodal input (e.g., spoken and written text, gesture, drawings) which might be imprecise, ambiguous, and/or partial;

generate (design, realize) coordinated, cohesive, and coherent multimedia/modal presentations; and

manage the interaction (e.g., training, error recovery, task completion, tailoring interaction) by representing, reasoning, and exploiting models of the domain, task, user, media/mode, discourse, and environment.

From the developer’s perspective, there is also interest in decreasing the time, expense, and level of expertise necessary to construct successful systems.

Finally, in interactions with information spaces, the area of media content analysis (Maybury, 1997), which includes retrieval of text, audio, imagery and/or combinations thereof, plays an important role.

9.2 Fundamental Questions

The fundamental questions mirror the above definitions:

Analysis: How do we build systems to deal with synchronous and asynchronous, imprecise, ambiguous, and/or partial multimedia and multimodal input?

Generation: How do we design, realize, and tailor coordinated, cohesive, and coherent multimedia and multimodal presentations?

Management: How do we ensure efficient, effective and natural interaction (e.g., training, error recovery, task completion, tailoring interaction styles)? How do we represent, reason, and exploit models of the domain, task, user, media/mode, and context (discourse, environment)?

Methods: What kinds of representations and reasoning are required to enable the above? What kinds of multimedia corpora are required? What kinds of evaluation measures, metrics and methods will move this area forward?

9.3 Timeline

Computer supported multimedia communication has been studied for the past three decades. We briefly characterize the major problems addressed, developments, and influence on related areas in each decade.

Late 1950s

Input/Output: First integrated graphics/pointing system (SAGE) developed and deployed (Roth et al., 1990). Natural language (NL) interfaces is a topic at the first meeting to discuss the possibility of Artificial Intelligence at Dartmouth University.

1960s

Input/Output: Initial interest in NL interfaces. Small laboratory investigations of Virtual Reality. Pilot work on NL parsing and generation, separately.

General: First conference on Computational Linguistics (1962).

1970s

Input: Many applications of NL interfaces studied, especially in relation to database query systems (Hendricks et al., 1970). Early phonology-based speech to text systems replaced by statistical methods in the mid-1970s (see Chapters 5 and 6). Some work on graphics interfaces.

Output: Template-based sentence generation systems developed. Little work on graphics generation.

Interface management: Early user models (Rich, 1979).

1980s

Input: Gradual commercialization of interfaces using NL, especially speech recognition. First integration of speech and gesture (e.g., "Put that there" (Bolt, 80)). Pilot systems that integrate various media and modes, including CUBRICON (Neal, 90), II (Arens et al., 88).

Output: Development of several domain-independent, distributed sentence generation systems (Mann and Matthiessen, 1985; Elhadad, 1992; Meteer et al., 1987). Creation of techniques to plan domain independent, rhetorically structured coherent text; e.g., rhetorical schemas (McKeown, 1985), communicative plans (Hovy, 1988). First multilingual generation systems. Early automated graphics design (Mackinlay, 1986).

Interface management: Early modeling of users and of discourse history (Moore, 1989). Model-based interfaces.

General: International workshops on user modeling (UM), text generation (INLGW), multimodal interaction (VENACO); government programs (DARPA IUI). Industrial visions of intelligent multimodal, multilingual interaction, such as Apple’s "Phil".

1990s

Input: Increasing commercial presence of spoken language applications (Dragon Systems, IBM, Apple, Kurzweil). More sophisticated prototypes that handle difficult phenomena, such as partial, synchronous, and ambiguous input. Great advances in non-language interface methods (Brooks et al., 1990).

Output: Prototypes demonstrating coordinated multimodal generation, e.g., WIP (Wahlster et al., 1992), COMET (Feiner and McKeown, 1990). Standard reference model for presentation systems.

Management: Prototypes that conduct longer user interactions, with deeper understanding and generation of input and output; e.g., HIPP (Biermann et al., 1990). User-adapted systems. Agents begin to appear in commercial software. Deeper understanding of characteristics of data and systems, as needed to plan displays intelligently (Roth and Matthis, 1990; Arens and Hovy, 1995; Faconti and Duke, 1996; Bruffaerts et al., 1996).

General: DARPA and EC I3 programs. First international conference on intelligent user interfaces (IUI), general Readings in IUI (Maybury and Wahlster, 1988), etc. Emergence of media content analysis for new applications, e.g., news understanding, video mail and/or VTC indexing and retrieval.

9.4 Examples of Multimedia Information Access

Significant progress has been made in multimedia interfaces, integrating language, speech, and gesture. For example, Figure 1 shows the CUBRICON system architecture (Neal, 1990). CUBRICON enables a user to interact using spoken or typed natural language and gesture, displaying results using combinations of language, maps, and graphics. Interaction management is effected via models of the user and the ongoing discourse, which not only influence the generated responses but also manage window layout, based on user focus of attention.

Figure 1. CUBRICON Multimedia Interface Architecture.

As another example, the AlFresco system (Stock et al., 1993; Stock, et al., 1997) provides multimedia information access, integrating language, speech, and image processing, together with more traditional techniques such as hypertext. AlFresco is a system for accessing cultural heritage information that integrates in a coherent exploration dialogue language based acts with implicit and explicit reference to what has been said and shown, and hypermedia navigation.

The generation system, part of the output presentation system, is influenced by a model of the user’s interests, developed in the course of the multimodal interaction. Another aspect developed in this system is a cross-model feedback (Zancanaro et al., 1997). The user is provided fast graphical feedback of the interpretation of discourse references, profitably exploiting the large bandwidth of communication that exists in a multimodal system.

In the related area of media understanding, systems are beginning to emerge that process synchronous speech, text, and images (Maybury, 1997). For example, Figure 2 shows the results of a multimedia news analysis system that exploits redundancy across speech, language (closed caption text) and video to mitigate the weaknesses of individual channel analyzers (e.g., low level image analysis and errorful speech transcription). After digitizing, segmenting (into stories and commercials), extracting named entities (Aberdeen et al., 1995), and summarizing into key frames and key sentences, MITRE’s BNN (Merlino, Morey, and Maybury, 1997) enables a user is able to browse and search broadcast news and/or visualizations thereof. A range of on-line customizable views of news summaries by time, topic, or named entity enable the user to quickly retrieve segments of relevant content.

Figure 2. Detailed Video Story Display.

In terms of multilingual information access, one problem is that machine translation systems often provide only gist quality translations. Nonetheless, these can be useful to aid users judge relevance to their tasks. Figure 3 illustrates a page retrieved from the web by searching for German chemical companies using the German words "chemie" and "gmbh". After locating a German-language web site, a web based machine translation engine (Systran) was used to obtain a gist-quality translation of the chemical products (Figure 4). Note how the HTML document structure enhances the intelligibility of the resultant translation.

Figure 3. Original Foreign Language Internet Page.

Figure 4. Translated Language Internet Page.

9.5 Major Current Bottlenecks and Problems

Well before the 1990s, researchers identified the need for medium-independent representations and the ability to convert them automatically to medium-specific representations (Mackinlay, 1986; Roth et al., 1990; Arens and Hovy, 1995). As multimedia interfaces become more sophisticated, this need keeps expanding to include additional phenomena, including ‘lexicons’ of hand gestures, body postures, and facial expressions, and information about the non-lexical text-based and intonation-based pragmatic cues that signal speaker attitude and involvement.

As discussed in Chapter 1, this area also has a strong need for resources of all kinds to support research on various topics, including multimedia content (e.g., Web, News, VTC), multimedia interaction (need for instrumentation), and multiparty interaction (e.g., CSCW).

A third issue arises from the unprecedented increase in the development of media; almost monthly, it sometimes seems, new inventions are announced. This poses a problem for system builders, who are faced with a bewildering array of possible hardware and software combinations to choose from, but who have no way to evaluate and compare them. As a result, he or she may waste a lot of time and may end up with an inferior system, and never even know it. One way to alleviate the problem is to develop a set of standards, or at least a common framework, under which different media (both hardware devices and software applications or interfaces) can be brought together and related, using a common set of terms. In order to determine what the framework should be like, however, it is important first to understand the problems. In this paper we outline three basic problems apparent today and then describe an approach that, we believe, will help in solving them, using a construct that Hovy and Arens (1996) call Virtual Devices. A Virtual Device embodies an abstract specification of all the parameters involved in media drivers, specifically with regard to the characteristics of information they handle and the kid of display or interaction they support. These parameters include hardware and software requirements, media functionality, user task, human ergonomics, humans’ cognitive features, etc. Using some such construct, as long as it adheres to a recognized set of standards, facilitates the organization of current media devices and functionalities, the evolution of the best new ones, and cooperative research on these issues.

Finally, the questions of intellectual property (ownership and distribution of media and knowledge resources) remain a perennial problem.

9.6 Major Breakthroughs in the Near Term

Given advances in corpus based techniques for information retrieval and information extraction (e.g., commercial tools for proper name extraction with 90% performance), coupled with the current transfer of these techniques to multilingual information retrieval and extraction, we can expect their application to multilingual understanding. We also believe there is an equivalent opportunity for multimedia generation for other languages. This presents the following challenges:

Integration of language processing and hypermedia;

Integration of multimodal processing mechanisms, e.g., image and language processing;

Transfer of HCI evaluation techniques (e.g., wizard of oz studies, cognitive walkthrough tests, task-based evaluation) to multimodal communication research.

Transfer of techniques from related areas will be an important concern. For example, researchers are beginning to take statistical and corpus based techniques formerly applied to single media (e.g., speech, text processing) and apply these to multimedia (e.g., VTC, TV, CSCW).

This work will enable new application areas for relatively unexplored tasks, including:

multimodal/lingual information access.

multimodal/lingual presentation generation (summarization).

multimodal/lingual collaboration environments.

9.7 Role of Multiple Languages

As indicated above, multimodal interaction resides in the integration of multiple subfields. When extending techniques and methods to multiple languages, we have the benefit of drawing upon previous monolingual techniques. For example, language generation techniques and components (e.g., content selection, media allocation, and presentation design), built initially for monolingual generation, can often be reused across languages. Analogously, interaction management components (e.g., user and discourse models) can be reused.

Of course, many language specific phenomena remain to be addressed. For example, in generation of multilingual and multimedia presentations, lexical length affects the layout of material both in space and in time. For instance, in laying out a multilingual electronic yellow pages, space may be strictly limited given a standard format and so variability in linguistic realization across languages may pose challenges. In a multimedia context, one might need to not only generate language specific expressions, but also culturally appropriate media.

Making further progress in this area, researchers may take advantage of some unique resources to help develop systems perform multimedia information access, including dubbed movies, multilingual broadcast news, that might help accelerate the development of, for example, multilingual video corpora.

9.7.1 An Example: Computer Assisted Language Learning Products

Foreign Language Learning constitutes an example in the field of Multimedia Communication. In this field it is widely accepted that a communicative approach combining dialogues based on real life situations in the form of video and textual information could prove to be very profitable for foreign language learners. In addition, the modular design of this type of software can enable multilingual support by as many languages as required without further significant effort for localization.

Learning more than one foreign language is a political and cultural choice in Europe, a policy which aims towards preserving the cultural heritage, part of which are European languages.

Thus, in Europe it is important to be able to translate readily, especially from less-spoken to widely spoken languages or vice versa. It is equally important that young people as well as adults learn other foreign languages either for business or for cultural purposes.

The current situation with respect to the level of capabilities of the Computer Assisted Foreign Language learning products can be summarized as follows:

A few static language resources available in these products

Lack of connectivity/links between textual and multimedia information

Lack of tools for correction of the learner’s mistakes

The above-mentioned disadvantages can now be faced on the basis of current language technologies available, which are in a position to provide attractive solutions to facilitate language acquisition on the one hand and motivate people to learn foreign languages on the other.

Future products could integrate resources offering the student the possibility to have access not only to some specific language phenomenon related a particular situation in a static way, but go a step beyond and handle dynamically all language resources.

To make things more explicit, we provide a possible scenario.

Supposing that Greek is the foreign language and that the learner’s mother tongue is French. The learner could be able at any time to open her/his French to Greek dictionary with a click on a French word, see the Greek equivalent in written form, see how this word is pronounced by means of the International Phonetic Alphabet, and hear the Greek word using a high quality text-to-speech synthesis system. Furthermore, the learner could see how a word is used in context, by having access to that part of the video where the word is actually being used. Otherwise, the learner can have access to the examples included in the dictionary and can also hear them via synthetic speech. All the above functions could apply to a Greek-French dictionary, as well. Both dictionaries are considered useful, as they respond to different needs.

Numerous other language-based tools are useful for foreign-language instruction, including:

morphological dictionaries, to help with learning inflectional systems;

tools able to visualize correct stress position;

tools to visualize the pronunciation effort of a student ;

tools assisting the correct writing of a foreign language;

tools accommodating the existence of information in parallel and aligned texts;

multilingual spelling and grammar checkers;

more advanced tools of speech and text understanding, including automatic translation systems.

To sum up, multimedia communication in foreign language learning situations requires the integration of many language processing tools in order to facilitate the learner to correctly learn a new language. It is a new and attractive technology that should be developed very soon.

9.8 Systems Research

Multimedia communication systems, which incorporate multiple subsystems for analysis, generation and interaction management, raise new research questions beyond the well known challenges which occur in component technologies (e.g., learnability, portability, scalability, performance, speed within a language processing system). These include inter-system error propagation, inter-system control and invocation order, and human system interaction.

9.8.1 Evaluation and Error Propagation

As systems increasingly integrate multiple interactive components, there is an opportunity to integrate and/or apply software in parallel or in sequence. The order of application of software modules is a new and nontrivial research issue. For example, in an application where the user may retrieve, extract, translate, or summarize information, one may influence the utility of the output just by sequencing systems according to their inherent performance properties (e.g., accuracy or speed). For example, one might use language processing to enhance post-retrieval analysis (extract common terms across documents, re-rank documents provide translated summaries) to focus on relevant documents. These documents might then cue the user with effective keywords to search for foreign language sources, whose relevance is assessed using a fast but low quality web-based translation engine. In contrast to this order, placing the translation step initially would have been costly, slow, and ineffective. An analogous situation arises in the search of multimedia repositories. Old and new evaluation measures, metrics, and methods will be required in this multifaceted environment.

9.8.2 Multilingual and Multimodal Sources

New research opportunities are raised by processing multilingual and multimodal sources, including the challenge of summarizing across these. For example, what is the optimal presentation of content and in which media or mix of media? See (Merlino and Maybury, 1999). Or consider that in broadcast news spoken language transcription, the best word error rates are currently around 10% for anchor speech. What is the cascaded effect of subsequently extracting entities, summarizing, or translating the text? This also extends to the nature of the interface with the user. For example, applying a low quality speech-to-text transcriber followed by a high quality summarizer may actually result in poorer task performance than providing the user with rapid auditory preview and skimming of the multimedia source.

9.8.3 User Involvement in Process

How should users interact with these language-enabled machines? Users of AltaVista are now shown foreign web sites matching their queries, with offers to translate them. When invoking the translator, however, the user must pick the source and target language, but what if the character sets and language are unrecognizable by the user? What kind of assistance should the user provide the machine, and vice versa? Should this extend to providing feedback to enable machine learning? Would this scale up to a broad set of web users? An in terms of multimedia interaction, who do we develop models of interaction that adequately address issues such as uni- and multi-modal (co)reference, ambiguity, and incompleteness?

9.8.4 Resource Inconsistencies

Finally, with the emergence of multiple language tools, users will be faced with systems that use different language resources and models. This can readily result in incoherence across language applications, an obvious case being when the language analysis module interprets a user query containing a given word, but the language generation module employs a different word in the output (because the original is not in its vocabulary). This may result in undesired implicatures by the user. For example, if a user queries a multilingual database for documents on "chemical manufacturers", and this is translated into a query for "chemical companies", many documents on marketing and distribution companies would also be included. If these were then translated and summarized, a user might erroneously infer that most chemical enterprises were not manufacturers. This situation can worsen when the system’s user and discourse models are inconsistent across problem domains.

9.9 Conclusion

We have outlined the history, developments and future of systems and research in multimedia communication. If successfully developed and employed, these systems promise:

More efficient interaction: enabling more rapid task completion with less work.

More effective interaction: doing the right thing at the right time, tailoring the content and form of the interaction to the context of the user, task, dialogue.

More natural interaction: supporting spoken, written, and gestural interaction, ideally as if interacting with a human interlocutor.

Because of the multidimensional nature of multimedia communication, interdisciplinary teams will be necessary and new areas of science may need to be invented (e.g., moving beyond psycholinguistic research to "psychomedia" research). New, careful theoretical and empirical investigations as well as standards to ensure cross system synergy will be required to ensure the resultant systems will enhance and not detract from the cognitive ability of end users.

9.10 References

Aberdeen, J., Burger, J., Day, D., Hirschman, L., Robinson, P., and Vilain, M. 1995. Description of the Alembic System Used for MUC-6. In Proceedings of the Sixth Message Understanding Conference (MUC-VI). Advanced Research Projects Agency Information Technology Office, Columbia, MD, November 1995.

Arens, Y., L. Miller, S.C. Shapiro, and N.K. Sondheimer. 1988. Automatic Construction of User-Interface Displays. In Proceedings of the 7th AAAI Conference, St. Paul, MN, 808—813. Also available as USC/Information Sciences Institute Research Report RR-88—218.

Arens, Y. and E.H. Hovy. 1995. The Design of a Model-Based Multimedia Interaction Manager. AI Review 9(3) Special Issue on Natural Language and Vision.

Baecker, R., J. Grudin, W. Buxton, and S. Greenberg. 1995. Readings in Human-Computer Interaction: Toward the Year 2000 (2^nd ed). San Francisco: Morgan Kaufmann.

Bolt, R.A. 1980. "Put-That-There": Voice and Gesture at the Graphics Interface. In Proceedings of the ACM Conference on Computer Graphics, New York, 262—270.

Brooks, F.P., M. Ouh-young, J.J. Batter, and P.J. Kilpatrick. 1990. Project GROPE--Haptic Displays for Scientific Visualization. Computer Graphics 24(4), 235—270.

Bruffaerts, A., J. Donald, J. Grimson, D. Gritsis, K. Hansen, A. Martinez, H. Williams, and M. Wilson. 1996. Heterogeneous Database Access and Multimedia Information Presentation: The Final Report of the MIPS Project. Council for the Central Laboratory of the Research Councils Technical Report RAL-TR-96-016.

Elhadad, M. 1992. Using Argumentation to Control Lexical Choice: A Functional Unification-Based Approach. Ph.D. dissertation, Columbia University.

Faconti, G.P. and D.J. Duke. 1996. Device Models. In Proceedings of DSV-IS’96.

Feiner, S. and K.R. McKeown. 1990. Coordinating Text and Graphics in Explanation Generation. In Proceedings of the 8th AAAI Conference, 442—449.

Hendricks, G. et al. 1970. NL Menus.

Hovy, E.H. 1988. Planning Coherent Multisentential Text. In Proceedings of the 26th Annual Meeting of the Association for Computational Linguistics, Buffalo, NY.

Hovy, E.H. and Y. Arens. 1996. Virtual Devices: An Approach to Standardizing Multimedia System Components . Proceedings of the Workshop on Multimedia Issues, Conference of the European Association of Artificial intelligence (ECAI). Budapest, Hungary.

Mackinlay, J. 1986. Automatic Design of Graphical Presentations. Ph.D. dissertation, Stanford University.

Mann, W.C. and C.M.I.M. Matthiessen. 1985. Nigel: A Systemic Grammar for Text Generation. In Systemic Perspectives on Discourse: Selected Papers from the 9th International Systemics Workshop, R. Benson and J. Greaves (eds), Ablex: London, England. Also available as USC/ISI Research Report RR-83-105.

Maybury, M.T. editor. 1993. Intelligent Multimedia Interfaces. AAAI/MIT Press. ISBN 0-262-63150-4. http://www.aaai.org:80/Press/Books/Maybury1/maybury.html.

Maybury, M.T. editor. 1997. Intelligent Multimedia Information Retrieval. AAAI/MIT Press. http://www.aaai.org:80/Press/Books/Maybury2.

Maybury, M.T. and W. Wahlster. editors. 1998. Readings in Intelligent User Interfaces. San Francisco: Morgan Kaufmann. ISBN 1-55860-444-8.

McKeown, K.R. 1985. Text generation: Using Discourse Strategies and Focus Constraints to Generate Natural Language Text. Cambridge: Cambridge University Press.

Merlino, A., D. Morey, and M.T. Maybury. 1997. Broadcast News Navigation using Story Segments. In Proceedings of the ACM International Multimedia Conference, 381—391. Seattle, WA, November 1997.

Merlino, A. and M.T. Maybury. 1999. An Empirical Study of the Optimal Presentation of Multimedia Summaries of Broadcast News. In I. Mani and M.T. Maybury (eds) Automated Text Summarization.

Meteer, M.W., D.D. McDonald, S. Anderson, D. Forster, L. Gay, A. Huettner, and P. Sibun. 1987. MUMBLE-86: Design and Implementation. COINS Technical Report 87-87, University of Massachusetts (Amherst).

Moore, J.D. 1989. A Reactive Approach to Explanation in Expert and Advice-Giving Systems. Ph.D. dissertation, University of California at Los Angeles.

Neal, J.G. 1990. Intelligent Multi-Media Integrated Interface Project. SUNY Buffalo. RADC Technical Report TR-90-128.

Rich, E. 1979. User Modeling via Stereotypes. Cognitive Science 3 (329—354).

Roth, S.F. and J. Mattis. 1990. Data Characterization for Intelligent Graphics Presentation. In Proceedings of the CHI’90 Conference, 193—200.

Roth, S.F., J.A. Mattis, and X.A. Mesnard. 1990. Graphics and Natural Language as Components of Automatic Explanation. In J. Sullivan and S. Tyler (eds), Architectures for Intelligent Interfaces: Elements and Prototypes. Reading: Addison-Wesley.

Stock, O. and the NLP Group. 1993. AlFresco: Enjoying the Combination of NLP and Hypermedia for Information Exploration. In M. Maybury (ed.), Intelligent Multimedia Interfaces. Menlo Park: AAAI Press.

Stock, O., C. Strappavera, and M. Zancanaro. 1997. Explorations in an Environment for Natural Language Multimodal Information Access. In M. Maybury (ed), Intelligent Multimodal Information Retrieval. Menlo Park: AAAI Press.

Wahlster, W., E. André, S. Bandyopadhyay, W. Graf, T. Rist. 1992. WIP: The Coordinated Generation of Multimodal Presentations from a Common Representation. In A. Ortony, J. Slack, and O. Stock (eds), Computational Theories of Communication and their Applications. Berlin: Springer Verlag.

Zancanaro, M., O. Stock, and C. Strappavera. 1997. Multimodal Interaction for Information Access: Exploiting Cohesion. Computational Intelligence 13(4).