Craig Olinsky - February 2002

CONTENTS

Multilingual and Adaptive Speech Synthesis: Issues Overview

Project: Bootstrapping a Language-Independent Speech Synthesizer

Project: Dynamic Adaptation for Language and Dialect in a Speech Synthesis System

MULTILINGUAL AND ADAPTIVE SPEECH SYNTHESIS: ISSUES OVERVIEW [return]

I am a member of the Adaptive Speech Interfaces a recently formed research team in the Department of Computer Science, University College Dublin and in MediaLab Europe, in coordination with the Cognitive Machines group at the MIT Media Laboratory. The Adaptive Speech Interfaces group seeks to extend the range of speech technology to easily and rapidly include new languages, especially from developing countires. Speech interfaces which support spoken output will allow access to IT by a vast new user community. The interfaces we design wil also be capable of learning and adaptation, to accomodate to new user groups and dialects.

While the Multilingual capability of a Speech system can be seen as a consequence of its modularity and generalizability, we find greater concern in the concept of multilingual flexibility. I find it useful to define the goal of mutilinguality in terms of "native performance". That is, a succesful multilingual synthetic voice is one with the capability of speaking natively in a multitude of languages.

Of course, this raises a major question: What is it to sound "native" in a given language? To speak "natively" is not itself the same as speaking "fluently". One can speak English perfectly well with a French accent. Nor is it necessarily related to one's education, the size of one's vocabulary in a language, or even the hesitancy with which one speaks.

My work addresses the concept of "nativeness" specifically in terms of the phonetic inventory of a specific language/speaker and the set of rules and lexical entries used to generate a pronunciation of (or for) a desired utterance (itself composed from the phonetic units of a language).

Concatenative Speech Synthesis primarily involves the recording of a large number of spoken utterances of a given speaker, which are then labelled, chopped up, and re-pasted together , according to a generated pronunciation string, to produce a new utterance. The speaker herself obviously has her own particular accent and dialect [I'm intentionally going to be conflating "Accent" and "Dialect" for the time being, but I'll separate them later when necessary...], and thus the Synthesized speech will necessarily share the accent which is evident in the segmented units of speech.

Suppose one has a custom-built English Synthesizer, and a French one, both created as concatenative text-to-speech systems from a set of speech recordings of a native speaker each respective language. Now suppose we wish to use these synthesizers to create, for instance, a portable tourism support system for use by travellers to Paris. What do we do?

The English synthesizer, unless specifically trained for the purpose, will likely pronounce the names of French places completely incorrectly. We could do the obvious thing: use the English synthesizer for English words, and the French one for proper names. This will work, but it will sound completely artifical: both systems being based on recordings of different speaker, the travel information will correspondingly jump back and forth not just between languages, but between different speakers.

We could come up with a set of pronunciation rules for the English synthesizer which approximate a proper pronunciation of French words. Of course, this will still sound like a foreigner speaking French: some sounds, for instance, the French /r/, are simply not captured by the English voice we have recorded. Conversely (and probably most aply), we can teach our French synthesizer English, or rebuild a full English Synthesizer using our French recordings. The English will, of course, sound heavily accented, but this will be completely believable for the application: we'd expect the one to show us around Paris to be a native speaker of French.

My work attempts to raise the question: How can we get around the necessity of designing and reconfiguring synthesizers for all of these specific purposes. Can we take a speech database we have recorded from an English speaker, and, with just a small example of French speech, use those same speech segments we have already perfected to generate French speech without that nagging French accent? Can we, in fact, come up with a "generic" set of units or recorded database which can be used to produce any desired language?

One might point directly to the IPA as a supposed solution to the problem. The problem with IPA is that it is (1) It is connected to semantic rather than purely phonetic or phonological differentation is sounds and that (2) The pronunciation of any given phoneme is influenced not only by its definition but also by its context.

The two research projects described below divide these goal into two distinct taks. The first project, Bootstrapping a Language-Independent Synthesizer, looks at ways in which which a generic, fully multilingually-capable framework can be constructed which allows the construction of a synthetic voice of reasonable quality for any language or dialect, without the need for (or availability of) specific linguistic or phonetic information about the given language; in fact, nothing except a set of recordings and a native-orthography transcription. This toolkit, a set of utilities, definition files, and its documentation based on the Festival Speech Synthesis System and FestVox project, will be made fully availabile to the community.

My dissertation project, Dynamic Adaptation for Language and Dialect in a Speech Synthesis System, looks directly at how effort can be conserved through techniques which allow one to "adapt" a fully-trained Speech Synthesis, through only a small amount of recorded "evidence", to speak that accent, dialect, or language. Specifically, the system listens to the set of target recordings, which it compares to its own output to determine the phonetic composition and pronunciation differences between the two utteraces; and then modifies itself accordingly to produce speech more closely resembling that of the target.

BBC indian languages article

MediaLab Asia indian languages article

Hindi: DHVANI: The Simputer Text-to-Speech Software for the Simputer project....

BOOTSTRAPPING A LANGUAGE-INDEPENDENT SPEECH SYNTHESIZER [return]

Project Abstract: The building of a language-dependent speech synthesizer is a process which typically requires a great deal of linguistic knowledge of the tongue in question. Examples of this include detailed information about the phonetic inventory and pronunciation rules; syntactic and semantic parsing procedures and usage lexicons for part-of-speech information, as well as well-segmented and labelled recordings of a native speaker.

For many common, commercially viable languages, this is no problem. but for minority languages, and a number of non-standard dialects, resources such as computer-usable pronunciation lexicons, text corpora, and carefully recorded spoken-language databases are not available. Unfortunately, where these languages are spoken include many areas with high illiteracy rates, where spoken language computer interfaces may in fact provide the greatest benefit.

This project aims to set up a generic language-neutral (or conversely, fully multilingual-capable) framework from which a reasonable speech synthesizer can, without the availability of such information, be produced in a relatively automatic and self-sufficient manner. (Of course, if any of this matter is available, the framework will also support its use). Such a framework must, at the same time, provide a number of automated learning operations to enable native-level synthesis, while carefully avoiding any pre-existing assumptions which would limit it from encompassing a particular language type (such as a tonal language, or one a non-roman orthography).

Our first system developed using an early version of this framework is a basic Korean Language synthesizer.

DYNAMIC ADAPTATION FOR LANGUAGE AND DIALECT IN A SPEECH SYNTHESIS SYSTEM
[full written proposal - pdf]

Project Abstract: The personalization of a speech synthesis system for a particular use or market can provide much benefit to a deployed system. recent articles have suggested, in fact, that humans connect better as listeners with a speaker and voice who sound like them, finding it easier to listen to and understand what is said, but also find it more natural to assign emotional state and judge such factors as authority and honesty, and even intelligibility.
Given a speech synthesizer targeted for a specific language and dialect, such as us english, we would imagine that much of the training and knowledge incorporated into the system would be highly useful in the an alternate accent of the same language, such as london-accented english. The same influence and reusability might be expected between different language with similar influences or historical similarities, such as the set of romance languages.

This process of modifying a fully trained speech synthesizer to a related use can be considered to be a synthesis analogue to adaptation processes for speech recognition systems. adaptation in speech recognition systems is a procedure in which the acoustic model, or in limited cases in the language model of the recognition system, after being fully trained, is provided with additional speech data. based upon this data, the values parameters, nodes, weights, or other coefficients representing the acoustic model are shifted "towards" the new information such that the system should exhibit improved performance on data resembling the new training data even though such data was not included in its initial training procedure. As a result, the system is transformed from a "speaker-independent" one to a "speaker-dependent" one.

Of course, the nature of concatenate speech synthesis dictates that techniques from speaker adaptation in recognition systems cannot be directly carried over. For this system, we envision a system which adapts dynamically from a source accent or language towards a target of which a small set of transcribed recordings are provided. the system compares characteristics of its current output to those in the target recordings, and attempts to discern how the speaker differs from itself - specifically in terms of the phonology (phone-set) and the set of rules and exceptions for generating pronunciations from native orthography text. This system then attempts to modify its structure and rules to more closely mimic the target language speech.

The techniques of Speaker Adaptation in Speech Recognition systems attempt to, using a minimum of voice data, customize a general speaker-independent system into a targeted speaker-dependent one. We believe that these same techniques can be applicable, with some modification, to the adaptation of Speech Synthesis systems.

Synthesis adds an additional problem to recognition adaptation: the fact that the database of recorded segments themselves is itself used for concatentation. This means that we can not just merge the entire set of recorded data together - there would be noticeable discrepancies between concatenative units taken from each individual speaker. On the other hand, if we just use the new set of segments, we aren't adapting; we're just building a new synthesizer. For this study, we take the new target data to be a small data set; not enough to be a good set of units for synthesis on its own.

We are thus required to use existing (source) units for synthesis. However, these source recordings and their associated existing synthetic voice have a specific accent/dialect, with a pre-defined phone set. Even with a proper dictionary and proper letter-to-sound rules providing use with a "proper"pronunciation taking into account pronunciation variation for our target accent, stringing the "best match" units together likely won't sound like a native speaker of that accent. The vowel quality might be vastly different, or phones might be missing in the source language (e.g., a French /r/). We need to compensate for this. Overall, we want to sound native in the target accent/dialect/language, using units recorded from the speaker of a different one.

For the first phase of this project, we are examining adaptation for English accent/dialects utilizing the IViE Corpus, a set of recordings demonstrating Intonational and Dialectical Variation of English in the Britsh Isles.

Stage two will investigate pairs with a higher degree of differences: name, adapation between Scottish Gaelic and Irish (which are, to some degree, mutually intelligeable) and then on to Welsh (which, although historically related, isn't).

Finally, we will example adpation between a [still-to-be-selected] set of Indic languages.

Here's the theoretical behavior of the system.

Say we're starting with a "well-trained" unit-selection concatenative synthesizer for US-accented English. In the creation of this synthesizer, we had a number of choices to make, such as the phone set to use (assuming it wasn't fully dictated by the availability of a lexicon). We could choose something closely designed for US-english (such as the darpabet and cmu-dict), or we could use something aimed at being more language-independent (such as IPA or SAMBA), or conversely something more speaker-dependent, such as phones derived from some sort of recursive clustering and separation of acoustic centroids of units in the actual speech signal of our source speaker. Whatever set of phones we're using, we also have a means of generating pronunciations (a list of desired phones) of our desired synthetic speech from text. We then cluster the recorded units for each phone according to "predictable" acoustic features so that we can later predict desired features and choose an appropriate unit for synthesis.

We next have a few recording of a target speaker of, say, UK-accented English. Assume we also have the equivalent of a basic phone recognizer to go along with our US-English voice. Using our US-synthesizer, we generate our expected phone sequences for the spoken text. We then compare this to the target-accented speech, and (ideally) we'd find significant differences in certain phones which are indicative of the difference in accents. Using some sort of a distance or confidence measure, we'd chose to modify our synthesizer's pronunciation to get closer to that of the target.

How do we modify our synthesizer? There are two separate issues: determining the new pronunciation, and generating it.

Once we believe we have a pronunciation difference. We then need to make a decision: are we replacing, or moving a phoneme? That is -- is a different phone being pronuncued in the same context; is it a variety of a phone we already know, or something completely different? Is it a replacement, deletion, or addition? Basically, two decisions need to be made: (1) is this an expection (i.e. change it only in the lexicon) or a regularity (i.e. change it in the letter-to-sound rules) or do we simply keep the pronunciation we have and "redefine" in some way the pronunced phone ;and (2) what change should be made? are we redefining a phone, or adding a different one?

The next issue is how to modifiy our selection of synthesis units to produce the new pronunciation. There seem to be three basic techniques, with a lot of possible variations. The first is to simply merge in the units from the target voice into the whole selection set. This will likely sound the worst, because we're selecting units from two separate speakers. Secondly, we can use the units from the target speech as "centroids" or models for clustering the source speaker units, without actually including them in the database. This seems to be the most promising -- we're looking at some suggestions from mimicry studies which suggest that once can do a reasonably good job (at least, an acoustically discenable job) at approixmating a foreign accent through basic modifications such as duration, stress, etc. This would involve modifying our prediction and unit selection models to use data trained from target dialect speech, but select on source accent units. The third option is creating an actual signal-processed "morph" of the units of the two voices and select from that. This tends to sound rather ugly.

MULTILINGUAL SYNTHESIS FRAMEWORK [return]

The Multilingual Synthesis Framework, as described in the "Bootstrapping a Language Independent Speech Synthesizer" project, is a set of definition files, scripts, and utilities which, in cooperation with the Festival Speech Synthesis System and FestVox Project help to automate the creation of reasonable speech synthesis voices from an arbitrary language without the need for linguistic or language-specific information.

[Specifically, the Festival System is used for waveform Synthesis and some early processing, while other sections of the phonetic processing and build processing have been externalized into locally developed tools.]

Use of the Framework requires a set of recordings (around 100 sentences or more should provide a reasonable level of phonetic coverage), along with a transcript in native or romanized orthography. The reading of a few dozen newspaper articles should provide such coverage. If avaiable, a computer-readable pronunciation lexicon can also be utilized.

More information is available from the framework's Word File README.txt file, or by browsing the directory of files.

FULL EXAMPLE: KOREAN SYNTHESIZER [sound sample]

This Korean synthesizer is a preliminary example of a synthesized voices developed using an early version of the Multilingual Framework. It is based upon a set of __ recordings of Korean Sentences provided by the ____ group.

As described above, One major component of the Framework's operation is the manner in which a relationship is established between the orthography of the language and the pronunciation generated for pronunciation by the system. The unique hangul writing system which the Korean language uses provides an interesting case for examination.

Hangul is in one sense a syllabic writing system -- each character represents a syllable of speech, but at the same time can be broken down into sub-characters called jamo, each of which represent a specific consonant or vowel sound.

One "stealthy" way in which we utilized pre-organized linguistic information is inherent in the design of the Unicode codeset, which is designed to be able to represent an superset inclusive of the characters in [the majority of] writing systems on earth. In this case, the encoding of korean hangul characters implicitly provides non-ambigous "instructions" on their decomposition into this jamo forms. (Similiarly, the Unicode representations of accented roman characters, such as a', provide implicit instructions on decomposing such characters into the base vowel plus the accent form). To generate a more direct and learnable relationship between orthographic form and phonetic units, we first pre-process the source text with all indicated decompositions.

We are currently in search of a suitable recording set to use in testing the toolkit with an Indic language.

Craig Olinsky [

cv]
colinsky@mle.media.mit.edu
Research Fellow [Medialab Europe];
Ph.D. Candidate [University College Dublin]

MediaLab Europe
Sugar House Lane
Crane Street
Dublin 8
Republic of Ireland

+353 (1) 474-2837