Newsgroups: comp.speech
Path: lyra.csx.cam.ac.uk!warwick!slxsys!pipex!sunic!trane.uninett.no!eunet.no!nuug!EU.net!uunet!news.iij.ad.jp!wnoc-tyo-news!aist-nara!wnoc-kyo-news!atrwide!atr-la!awb
From: awb@itl.atr.co.jp (Alan W Black)
Subject: Re: TTS Technology
In-Reply-To: blueb@netcom.com's message of Tue, 6 Sep 1994 07:12:24 GMT
Message-ID: <AWB.94Sep9165611@as53.itl.atr.co.jp>
Sender: news@itl.atr.co.jp (USENET News System)
Nntp-Posting-Host: as53
Organization: ATR Interpreting Telecommunications Research Labs.,Japan
References: <bluebCvp5Co.3K2@netcom.com>
Date: Fri, 9 Sep 1994 07:56:11 GMT
Lines: 121


Let me try to answer some of these questions ...

In article <bluebCvp5Co.3K2@netcom.com> blueb@netcom.com (Tim Kusumi) writes:

> I am looking to find out what is the current state of the art
> in TTS technology.

> 1) Does anyone know of a recent comparision between the various commercial
> TTS products which are out there (DECTALK, BeST, SpeechPlus, L&H etc)? 

I don't know much about the commercial end, though I have access to a few,
they are in the general case reasonable.

> 2) What research is going to improve the quality of computer generated 
> speech? Who is doing it?  I hear about a lot of research on Speech 
> Recognition, but not too much on TTS -- Is much going on in this area?

First I'd like to split the area of TTS into three parts: high, middle
and low level:

High: that is the initial text processing, finding the sentences
(or whatever sized chunks you wish), dealing with punctuation, fonts,
section titles etc.

Mid: translating words to phonemes, assigning duration, intonation tunes
and prosodic phrasing

Low: synthesizing the waveform itself.

I know little about the first which is a shame as it is a non-trivial
problem that often gets ignored in speech synthesis but does make the
difference between an experimental system and a real system (at least
a real TTS system).  Although text to speech is one application
of speech synthesis, you can expect more use of message-to-speech
systems---that is where synthesis is made from structured data rather
than raw text.

The middle level, has a very active following, papers on letter to
sound rules, intonation, phrasing appear in many of the major
computational linguistic conferences and journals.  Basically its seems
relatively easy to assign reasonable defaults but as soon as something
out of the ordinary comes along a synthesizer will quickly sound very
bad (and non-ordinary events happen all the time in speech)

As for the low level there are a number of different methods for this,
the ones I know of are:

Formant Synthesis (as in the Klatt synthesizer (which became DECTalk))
is quite popular but its difficult to really get it right, with the
proper specification of parameters it is possible to make it in
distinguishable from human speech but that can not yet be done fully
automatically (though there are people trying).

Concatenative synthesis, where sections of natural speech are
concatenated to form utterances.  At first single phonemes were used
but it is too difficult to join them seemlessly.  Diphones are still
the most popular (and perhaps the easiest to do). That is you
concatenate pre-recorded double phoneme sets joining them at the
centres of phones rather than the edges.  Longer units may also be
used but the more units used the more varied they become and more
criteria is required in order to select appropriate units.  Once units
are selected they need to be modified at least in duration and pitch,
there are a number of signal processing algorithms to do this, but as
usual there are advantages and disadvantages with each.

Articulatory synthesis: that is trying to model the human vocal tract.
In the long term this might be the best way but it is a difficult
problem requiring a lot of computing time.  I have yet to hear
articulatory synthesis within a full TTS system but that might happen
sometime.

> 3) What universities are still focusing on TTS?

Well the majority of research seems to be at the middle and low
levels.  Speech synthesis as a whole is valid subject in universities
throughout the world.  MIT, CMU and UCLA all have strong centres.
Other Universities have smaller groups though also contribute to the
field.  Also, most telecom companies do research in speech synthesis.
AT&T has a substantial group working on it.  We at ATR, here in Japan
are interested in telephone translation, so do research in synthesis
for multiple languages.  Often synthesis is tagged onto the
back of a speech recognition group.

> 4) Are there any research projects going on which have developed 
> "better" algorithms which are not yet be practical (i.e they take several 
> Cray's to say a simple sentense)?

Well in my experience (somewhat limited) I feel that we are not as
resource bound as we were a few years ago.  Real time synthesis has
only really been possible in the last five years.  Although say, unit
selection, and some low level signal processing algorithms can be
computationally intensive there is still room for other algorithms to
make it sound better (if we knew what those algorithms were).
However articulatory synthesis may be more viable on faster machines.


> 5) What appear to the be the limiting factors in improving this technology?

Well, again note I've only recently come to this field, one problem
noticed by many is that it is difficult to measure the quality of the
resulting speech, unlike recognition where we can give numbers saying
how much it recognised (though there, there is a problem in defining
how difficult the task is), in synthesis it is difficult to say if one
synthesizer is better than an other. Or as is more likely whether
changing a module in your own synthesizer makes it better or worse.
Measurement of success makes it difficult to try out all the existing
techniques (and vary their parameters) to maximise the quality when we
have no real measure of it.  Often we have to use perceptual tests
(i.e. have humans judge) but that brings in a whole host of
other problems.

Hope this isn't too random

Alan

* Alan W Black ---  ATR Interpreting Telecommunications Laboratories *
2-2 Hikaridai                         email: awb@itl.atr.co.jp
Seika-cho, Soraku-gun,                tel: (+81) 7749 5 1314
Kyoto 619-02, Japan                   fax: (+81) 7749 5 1308

