[This chapter is available as http://www.cs.cmu.edu/~ref/mlim/chapter5.html .]
[Please send any comments to Robert Frederking (firstname.lastname@example.org, Web document maintainer) or Ed Hovy or Nancy Ide.]
Multilingual Speech Processing (Recognition and Synthesis)
Editor: Joseph Mariani
Speech processing involves recognition, synthesis, language identification, speaker recognition, and a host of subsidiary problems regarding variations in speaker and speaking conditions. Notwithstanding the difficulty of the problems, and the fact that speech processing spans two major areas, acoustic engineering and computational linguistics, great progress has been made in the past fifteen years, to the point that commercial speech recognizers are increasingly available in the late 1990s. Still, problems remain both at the sound level, especially dealing with noise and variation, and at the dialogue and conceptual level, where speech blends with natural language analysis and generation.
5.1 Definition of Area
Speech processing comprises several areas, primarily speech recognition and speech synthesis, but also speaker recognition, language recognition, speech understanding and vocal dialog, speech coding, enhancement, and transmission. A panorama of techniques and methods may be found in Cole et al. (1998) or Juang et al. (1998).
Speech recognition is the conversion of acoustic information into linguistic information that may result in a written transcription, or that has to be understood. Speech synthesis is the conversion of linguistic information for human auditory consumption. The starting point may be a text or a concept that has to be expressed.
5.2 The Past: Where We Come From
5.2.1 Major Problems in Speech Processing
Many of the problems that have been addressed in the history of speech processing concern variability:
Noise and channel distortions are difficult to handle, especially when there is no a priori knowledge of the noise or of the distortion. These phenomena directly affect the acoustics of the signal, but may also indirectly modify the voice at the source. This is known as the Lombard effect, where noise modifies the utterance of the words (as people tend to speak louder), but may also be reflected in voice changes due to the psychological awareness of speaking to a machine.
The fact that, contrary to written texts, speech is continuous and has no silence to separate words, adds extra difficulty. But continuous speech is also difficult to handle because linguistic phenomena of various kinds may occur at the junctions between words, or within words which are often used, and which are usually short and therefore much affected by coarticulation.
5.2.2 History of Major Methods, Techniques, and Approaches
Regarding speech synthesis, the origins may be placed very early in time. The first result in that field may be placed in 1791, when W. von Kempelen demonstrated his speaking machine, which was built with a mechanical apparatus mimicking the human vocal apparatus. The next major successful attempt may be placed at the New York World Fair in 1939, when H. Dudley presented the Voder, based on electrical devices. In this case, the approach was rather based on an analysis-synthesis approach. The sounds where first analyzed and then replayed. In both cases, it was necessary to learn how to play those very special musical instruments (one week in the case of the Voder), and the human demonstrating the systems probably used the now well-known trick of announcing to the audience what they would hear, and thus inducing the understanding of the corresponding sentence. Since then, major progress may be reported in that field, with basically two approaches still reflecting the Von Kempelen/Dudley dichotomy on "Knowledge-Based" vs "Template-Based" approaches. The first approach is based on the functioning of the vocal tract, which often goes together with formant synthesis (the formants are the resonances of the vocal tract). The second is based on the synthesis of pre-analyzed signals, which leads to diphone synthesizers, and more generally to signal segment concatenation. A speech synthesizer for American English was designed based on the first approach at MIT (Klatt, 1980), and resulted in the best synthesizer available at that time. Several works may also be reported in the field of articulatory synthesis, which aims at mimicking more closely the functioning of the vocal apparatus. However, the best quality is presently obtained by diphone based approaches or the like, using simply PCM encoded signals, especially illustrated by the Psola system designed at CNET (Moulines and Charpentier, 1990).
In addition to the phoneme-to-sound levels, Text-to-Speech synthesis systems also contain a Grapheme-to-Phoneme conversion level. This operation initially used a large set of rules, including morpho-syntactic tagging and even syntactic parsing to solve some difficult cases. Several attempts to perform this operation by automatic training on large amounts of texts or directly on the lexicon, using stochastic approaches or even Neural Nets, resulted in encouraging results, and even claims that machine "able to learn reading" have been invented. However, rule-based approaches still produce the best results. Specific attention has recently been devoted to the grapheme-to-phoneme conversion of proper names, including acronyms. Prosodic markers are generated from the texts using rules and partial parsing.
Regarding speech recognition, various techniques were used in the 60s and 70s. Researchers here also found their way between knowledge based approaches for "analytic recognition" and template matching approaches for "global recognition". In the first case, the phonemes were first recognized and then linguistic knowledge and AI techniques helped reconstruct the utterance and understand the sentence, despite the phoneme recognition errors. An expert systems methodology was specifically used for phoneme decoding in that approach. In the second approach, the units to be recognized were the words. Template matching systems include a training phase, in which each word of the vocabulary is pronounced by the user and the corresponding acoustic signal is stored in memory. During the recognition phase, the same speaker pronounces a word of the vocabulary and the corresponding signal is compared with all the signals that are stored in memory. This comparison employs a pattern matching technique called Dynamic Time Warping (DTW), which accommodates differences between the signals for two pronunciations of the same word (since even the same speaker never pronounces words exactly the same way, with differences in the duration of the pronunciation of the phonemes, the energy, and the timber). This approach was first successfully used for speaker-dependent isolated word recognition for small vocabularies (up to 100 words). It was then extended to connected speech, to speaker independent isolated words, or to larger vocabularies, but independently on each of those 3 dimensions, by improving the basic technique.
The next major progress was made on the introduction of a statistical approach called Hidden Markov Models (HMMs) by researchers at IBM (Baker, 1975, Jelinek, 1976). In this case, instead of storing in the memory the signal corresponding to a word, the system stores an abstract model of the units to be recognized, which are represented as finite state automata, made up of states and links between states. The parameters of the model are the probability to traverse a link between two states, and the probability of observing a speech spectrum (acoustic vector) while traversing that link. Algorithms were proposed in the late 60s that find those parameters (that is, train the model) (Baum, 1972), and match in an optimal way a model with a signal (Viterbi, 1967), similarly to DTW. The interesting features of this approach is that it is possible to include in a given model parameters which represent different ways of pronouncing a word for different speaking styles of the same speaker, or for different speakers, and different pronunciations of the words, with different probabilities, or, even more interestingly, that it is possible to train phoneme models instead of word models. The recognition process may then be expressed as finding the word sequence which maximizes the probability that the word sequence produced the signal. This can be simply rewritten as the product of the probability that the signal was produced by the word sequence (Acoustic Model) and the probability of the word sequence (Language Model). This latter probability can be obtained by computing the frequency of the succession of two (bigrams) or three (trigrams) words in texts or speech transcriptions corresponding to the kind of utterances which will be considered in the application. It is also possible to consider the probabilities of grammatical category sequences (biclass and triclass models).
The HMM approach requires very large amounts of data for training, both in terms of signal and in terms of textual data, and the availability of such data is crucial for developing technologies and applications, and evaluating systems.
Various techniques have been proposed for the decoding process (depth-first, breadth-first, beam search, A* algorithm, stack algorithm, Tree Trellis, etc.). This process is very time consuming, and one research goal is to accelerate the process without losing quality.
This statistical approach was proposed in the early 70s. It was developed throughout the early 80s in parallel with other approaches, as there was no quantitative way of comparing approaches on a given task. The US Department of Defense DARPA Human Language Technology program, which started in 1984, fostered an evaluation-driven comparative research paradigm, which clearly demonstrated the advantages of the statistical approach (DARPA, 198998). Gradually, the HMM approach became more popular, both in the US and abroad.
In parallel, the connectionist, or neural network (NN), approach was experimented in various fields, including speech processing. This approach is also based on training, but is considered to be more discriminative than the HMM one. However, it is less adequate than HMM to model the time information. Hybrid systems that combine HMMs and NNs have therefore been proposed. Though they provide interesting results, and, in some limited cases, even surpass the pure HMM approach, they have not proven their superiority.
This history illustrates how problems were attacked and in some cases partly solved by different techniques: acoustic variability through the use of Template-Matching using DTW in the 70s, followed by stochastic modeling in the 80s, speaker and speaking variability through clustering techniques followed by stochastic modeling, differential features and more data in the 80s, linguistic variability through N-grams and more data, in the 70s and 80s. It is an example of the classic paradigm development and hybridization for Language Processing, as discussed inChapter 6. Currently, the largest efforts are presently devoted to address improved language modeling, phonetic pronunciation variability, noise and channel distortion through signal processing techniques and more data, up to multilinguality, through more data and better standards, and to multimodality, through multimodal data, integration and better standards and platforms (see also Chapter 9).
5.3 The Present: Major Bottlenecks and Problems
In speech recognition, basic research is still needed in the statistical modeling approach. Some basic statements are still very crude, such as considering the speech signal to be stationary, or the acoustic vectors to be uncorrelated. How can HMM capabilities be pushed? Progress continues, using HMMs with more training data, or considering different aspects of the data for different uses: understanding or dialog handling, through the use of corpus containing semantically labeled words or phrases. At the same time, the availability of large quantities of data for a given application is not always possible, and the adaptation of a system to a new application is often very costly. Techniques have been proposed, such as tied mixtures for building acoustic models or backing off techniques for building language models, but progress is still required. It is therefore important to develop methods that enable easy application adaptation, even if little or no data is available beforehand.
Using prosody in recognition is still an open issue. Still today, very few operational systems consider prosodic information, as there is no clear evidence that taking into account prosody results in better performances, given the nature of the applications being addressed at present. It seems likely however that some positive results have been obtained on the German language within the Verbmobil program (Niemann et al., 1997).
Addressing spontaneous speech is still an open problem, and difficult tasks such as DARPAs SwithBoard and CallHome projects still achieve poor results, despite the efforts devoted to the development of systems in this area.
Recognizing voice in noisy conditions is also important. Two approaches are conducted in parallel, either using noise robust front-ends or using a model based approach. The second will probably provide the best results in the long run.
Systems are now getting more speaker-independent, but commercial systems are still "speaker adaptive": they may recognize a new user with low performance, and improve during additional spoken interaction with the user. Speaker adaptation will stay as a research topic for the future, with the goal to make it more natural and invisible. The systems will thus become more speaker-independent, but will still have a speaker adaptation component. This adaptation can also be necessary for the same speaker, if his or her voice changes due to illness conditions for example
In speech synthesis, the quality of text-to-speech synthesis is better, but still not good enough for replacing "canned speech" (constructed by concatenating phrases and words). The generalization of the use of Text-to-Speech synthesis for applications such as reading aloud email messages will however probably help making this imperfect voice familiar and acceptable. Further improvement should therefore be obtained on phoneme synthesis itself, but attention should be placed on improving the naturalness of the voice. This involves prosody, as it is very difficult to generate a natural and acceptable prosody from the text, and it may be somehow easier to do it in the speech generation module of an oral dialogue system. This also involves voice quality, allowing the TTS synthesis system to change its voice to interpret the right meaning of a sentence. Voice conversion (allowing a TTS synthesis system to speak with the voice of the user, after analysis of this voice) is another area of R&D interest (Abe et al., 1990).
Generally speaking, the research program for the next years should be "to put back Language into Language Modeling", as proposed by F. Jelinek during the MLIM workshop. It requires taking into account that the data which has to be modeled is language, not just sounds, and that it therefore has some specifics, including an internal structure which involves more than a window of two or three words. This would suggest going beyond Bigrams and Trigrams, to consider parsing complete sentences.
In the same way, as suggested by R. Rosenfeld during the MLIM workshop, it may be proposed "to put Speech back in Speech Recognition", since the data to be modeled is speech, with its own specifics, such as having been produced by a human brain through the vocal apparatus. In that direction, it may be mentioned that the signal processing techniques for signal acquisition were mostly based on MFCC (Mel Frequency Cepstral Coefficients) in the 80s (Davis and Merlmelstein, 1980), and are getting closer to perceptual findings with PLP (Perceptually weighted Linear Prediction) in the 90s (Hermansky, 1990).
Several application areas are now developing, including consumer electronics (mobile phones, hand-held organizers), desktop applications (Dictation, OS navigation, computer games, language learning), telecommunications (auto-attendant, home banking, call-centers). These applications require several technological advances, including consistent accuracy, speaker-independence and quick adaptation, consistent handling of Out-Of-Vocabulary words, easy addition of new words and names, automatic updating of vocabularies, robustness to noise and channel, barge-in (allowing a human to speak over the systems voice and interrupt it), and also standard software and hardware compatibility and low cost.
5.4 The Future: Major Breakthroughs Expected
Breakthroughs will probably continue to be obtained through sustained incremental improvements based on the use of statistical techniques on ever larger amounts of data and differently annotated data. Every year from the mid-80s we can identify progress and better performances on more difficult tasks. Significantly, results obtain within DARPAs ATIS task (Dahl et al., 1994) showed that the performance on understanding obtained on written data transcribed from speech was achieved on actual speech data only one year later.
Better pronunciation modeling will probably enlarge the population that can get acceptable results on a recognition system, and therefore strengthen the acceptability of the system.
Better language models are presently a major issue, and could be obtained by looking beyond N-Grams. This could be achieved by identifying useful linguistic information, and incorporating more Information Theory in Spoken Language Processing systems.
In five years, we will probably have considerably more robust speech recognition for well defined applications, more memory-efficient and faster recognizers to support integration with multi-media applications, speech I/O embedded in client server architecture, distributed recognition to allow mass telephony applications, efficient and stable multilingual applications, better integration of NLP in well-defined areas, and much more extensible modular toolkits to reduce the lifecycle of application development. While speech is considered nowadays as a communication means, it will be considered, with the research progress, as a material comparable to text, that you can easily index, access randomly, sort, summarize, translate, and retrieve. This view will drastically change our relationship with the vocal media.
Multimodality is an important area for the future, as discussed inChapter 9. It can intervene for the processing of a single media, such as speech recognition using both the audio signal and the visual signal of the lips, which results in improved accuracy, especially in noisy conditions. But it can also address different media, such as integrating speech, vision and gesture in multimodal multimedia communication, which includes the open issue of sharing a common reference for the human and the machine. Multimodal training is another dimension, based on the assumption that humans learn to use one modality by getting simultaneous stimuli coming from different modalities. In the long run, modeling speech will have to be considered in tandem with other modalities.
Transmodality is another area of interest. It addresses the problem of providing an information through different media, depending on which media is more appropriate to the context in which the user stands when requesting the information (sitting in front of his computer, in which case a text + graphics output may be appropriate, or driving his car, in which case, a speech output of a summarized version of the textual information may be more appropriate, for example).
5.5 Juxtaposition of this Area with Other Areas
Over the years, speech processing is getting closer to natural language processing, as speech recognition is shifting to speech understanding and dialogue, and as speech synthesis becomes increasingly natural and approaches language generation from concepts in dialogue systems. Speech recognition would benefit from better language parsing, and speech synthesis would benefit from better morpho-syntactic tagging and language parsing.
Speech recognition and speech synthesis are used in Machine Translation (Chapter 4) for spoken language translation (Chapter 7).
Speech processing meets Natural Language Processing, but also computer vision, computer graphics, gestural communication in multimodal communication systems, with open research issues on the relationship between image, language and gesture for example (seeChapter 9).
Even imperfect speech recognition meets Information Retrieval (Chapter 2) in order to allow for multimedia document indexing through speech, and retrieval of multimedia documents (such as in the US Informedia (Wactlar et al., 1999) and the EU Thistle or Olive projects). This information retrieval may even be multilingual, extending the capability of the system to index and retrieve the requested information, whatever the language spoken by the user, or present in the data. Information Extraction (Chapter 3) from spoken material is a similar area of interest, and work has already been initiated in that domain within DARPAs Topic Detection and Tracking program. Here also, it will benefit from cooperation between speech and NL specialists and from a multilingual approach, as data is available on multiple sources in multiple languages worldwide.
Speech recognition, speech synthesis, speech understanding and speech generation meet in order to allow for oral dialogue. Vocal dialogue will get closer to research in the area of dialogue modeling (indirect speech acts, beliefs, planning, user models, etc.). Adding a multilingual dimension empowers individuals and gives them a universal access to the information world.
5.6 The Treatment of Multiple Languages in Speech Processing
Addressing multilinguality is important in speech processing. A system that handles several languages is much easier to put on the market than a system that can only address one language. In terms of research, the structural differences across languages are interesting for studying any one of them. Rapid deployment of a system to a large market, which necessitates the handling of several languages, is challenging, and several companies offer speech recognition or speech synthesis systems that handle different languages in their different versions, less frequently different languages within a single version. Addressing multilinguality not only includes getting knowledge on the structures and elements of a different language, but also requires accommodating speakers who speak that language with accents that may differ and who use words and sentence structures that may be far away from the canonical rules of the language.
As discussed inChapter 7, language identification is part of multilingual speech processing. Detecting the language spoken enables selecting the right Acoustic and Language Models. An alternative could be to use language-independent Acoustic Models (and less probably even language-independent Language Models). However, present systems will get into trouble if someone shifts from one language to another within one sentence, or one discourse, as humans sometimes do.
Similarly, a speech synthesis system will have to be able to identify the language spoken in order to pronounce it correctly, and systems aiming at the pronunciation of email will have to shift most often between the users language and English, which is used for many international exchanges. Here also, some sentences may contain foreign words or phrases that must be pronounced correctly. Large efforts may be required to gather enough expertise and knowledge on the pronunciation of proper names in various countries speaking different languages, as in the European project Onomastica (Schmidt et al., 1993). Also, successful attempts to quickly train a speech synthesis system by using a large enough speech corpus in that language have been reported (Black and Campbell, 1995). In this framework, the synthesis is achieved by finding in the speech corpus the longest speech units corresponding to parts of the input sentence. This approach requires no extended understanding of the language to be synthesized. Another aspect of multilingual speech synthesis is the possibility of using voice conversion in spoken language translation. In this case, the goal is to translate the speech uttered by the speaker in the target language and to synthesize the corresponding sentence to the listener, using the voice that the speaker would have if he would be speaking that language. Such attempts were conducted in the Interpreting Telephony project at ATR (Abe et al., 1990).
Complete multilingual systems therefore require language identification, multilingual speech recognition and speech synthesis, and machine translation for written and spoken language. Some spoken translation systems already exist and work in laboratory conditions for well-defined tasks, including conference registration (Morimoto et al., 1993) and meeting appointment scheduling (Wahlster , 1993).
With respect to multilinguality, there are two important questions. First, can data be shared across languages (if a system is able to recognize one language, will it be necessary to conduct the same effort to address another one? Or is it possible to reuse for example the acoustic models of the phonemes that are similar in two different languages)? Second, can knowledge be shared across language? (Could the scientific results obtained in studying one language be used for studying another language? As the semantic meaning of a sentence remains the same, when it is pronounced in two different languages, it should be possible to model language-independent knowledge independently of the languages used)?
Notwithstanding the difficulty of the problems facing speech processing, and despite the fact that speech processing spans two major areas, acoustic engineering and computational linguistics, great progress has been made in the past fifteen years. Commercial speech recognizers are increasingly available today, complementing machine translation and information retrieval systems in a trio of Language Processing applications. Still, problems remain both at the sound level, especially dealing with noise and variations in speaker and speaking condition, and at the dialogue and conceptual level, where speech blends with natural language analysis and generation.
Abe, M., S. Nakamura, K. Shikano, and H. Kuwabara. 1990. Voice conversion through vector quantization. Journal of the Acoustical Society of Japan, E-11 (7176).
Baker, J.K. 1975. Stochastic Modeling for Automatic Speech Understanding. In R. Reddy (ed), Speech Recognition (521542). Academic Press.
Baum, L.E. 1972. An Inequality and Associated Maximization Technique in Statistical Estimation of Probabilistic Functions of Markov Processes. Inequalities 3 (18).
Black, A. W. and N. Campbell. 1995. Optimising selection of units from speech databases for concatenative synthesis. Proceedings of the fourth European Conference on Speech Communication and Technology (581584). Madrid, Spain.
Cole, R., J. Mariani, H. Uszkoreit, N. Varile, A. Zaenen, A. Zampolli, V. Zue. 1998. Survey of the State of the Art in Human Language Technology. Cambridge: Cambridge University Press (or seehttp://www.cse.ogi.edu/CSLU/HLTsurvey/HLTsurvey.html.)
Dahl, D.A., M. Bates, M. Brown, W. Fisher, K. Hunicke-Smith, D. Pallett, C. Pao, A. Rudnicky, and E. Shriberg.. 1994. Expanding the Scope of the ATIS Task: the ATIS-3 Corpus. Proceedings of the DARPA Conference on Human Language Technology (4349). San Francisco: Morgan Kaufmann.
DARPA. 19891998. Proceedings of conference series initially called Workshops on Speech and Natural Language and later Conferences on Human Language Technology. San Francisco: Morgan Kaufmann.
Davis, S. B. and P. Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-28 (357366).
Hermansky, H. 1990. Perceptual linear predictive (PLP) analysis for speech. Journal of the Acoustical Society of America, 87(4) (17381752).
Jelinek, F. 1976. Continuous speech recognition by statistical methods. Proceedings of the IEEE, 64 (532556).
Juang, B.H., D. Childers, R.V. Cox, R. De Mori, S. Furui, J. Mariani, P. Price, S. Sagayama, M.M. Sondhi, R. Weishedel. 1998. Speech Processing: Past, Present and Outlook. IEEE Signal Processing Magazine, May 1998.
Klatt, D.H. 1980. Software for a cascade/parallel formant synthesizer. Journal of the Acoustical Society of America 67 (971995).
Morimoto, T., T. Takezawa, F. Yato, S. Sagayama, T. Tashiro, M. Nagata, and A. Kurematsu,. 1993. ATRs speech translation system: ASURA. In Proceedings of the third European Conference on Speech Communication and Technology (12951298). Berlin, Germany.
Moulines, E. and F. Charpentier. 1990. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication. 9 (453467).
Niemann, H., E. Noeth, A. Kiessling, R. Kompe and A. Batliner. 1997. Prosodic Processing and its Use in Verbmobil. Proceedings of ICASSP-97 (7578). Munich, Germany.
Schmidt, M.S., S. Fitt, C. Scott, and M.A. Jack. 1993. Phonetic transcription standards for European names (ONOMASTICA). Proceedings of the third European Conference on Speech Communication and Technology (279282). Berlin, Germany.
Viterbi, A.J. 1967. Error Bounds for Convolutional Codes and an Asympotically Optimum Decoding Algorithm. IEEE Transactions on Information Theory IT-13(2), (260269).
Wactlar, H.D., M.G. Christel, Y. Gong, A.G. Hauptmann. 1999. Lessons learned from building a Terabyte Digital Video Library. IEEE Computer-32(2), (6673).
Wahlster, W. 1993. Verbmobil, translation of face-to-face dialogs. Proceedings of the Fourth Machine Translation Summit (127135). Kobe, Japan.