Next: Requirements Up: Flite: a small fast Previous: Flite: a small fast

Motivation

To some, it may seem old-fashioned to worry about size and speed of a software application. With ever-increasing CPU speed, and with disk sizes growing continuously, many have forgotten what it is like to be restricted in memory and computational complexity. However, to those wishing to make speech applications ubiquitous, it quickly becomes clear that not all applications are deployed in resource-rich environments, with lots of CPU cycles to burn and large amounts of memory and storage. The ability to produce high quality synthetic speech is quickly followed by the demand for high quality speech synthesis on a range of small device, which pose interesting challenges for modern synthesizers - especially those using concatenative synthesis methods.

With the development of the Festival Speech Synthesis System [3], it has become much easier for people to develop their own synthesis techniques. The FestVox project [1] specifically addresses the issues of building new voices, and particularly within Festival. Elsewhere, as speech technology becomes more mainstream, the demand for more good synthetic speech has risen dramatically, as have the specific requirements for these systems.

Some systems rely on large servers for rendering synthetic speech, and render one or more ports of synthesis per machine. While such servers can be the latest large machines with bleeding-edge bus speeds and massive amounts of memory, many applications do not fit well into that model. One may wish to run many ports on a single machine; the deployment may be on resource-limited handheld devices, such as portable telephones or PDAs; and furthermore, full-bandwidth speech input and output may be too demanding on the communication infrastructure, given the speed of relatively ubiquitous wireless solutions, including CDPD or GSM-based data modems.

A device that renders text locally as speech would allow speech output to be used in more places than it currently is. As noted in various projects at CMU and elsewhere we have been involved with, a small footprint synthesizer for handhelds, wearable and ultimately cellular phones would we very readily received. But it appears that its not just the small devices that could utilize small footprint synthesizers, large servers also are not as large as one always needs. The ability to run many clients on the one server would also be advantageous.

The size and computationally requirements of many of the newer synthesis systems are much larger than their predecessors; this is mostly due to the benefits of concatenative synthesis, and the expansion in footprint is driven by a desire for more naturalistic speech in applications - which require larger unit inventories, especially if one is to avoid the introduction of the artifacts during modification of intonation or duration.

The mounting resource requirements also result somewhat merely because they can be met more easily; much of the synthesis work done before the 90s was much smaller. Databases of formant parameters, and rules for their implementation and modification, are much smaller than their concatenative counterparts, even with current compression and coding techniques; even the earlier diphone synthesizers were leaner, because they had to be.

The Festival Speech Synthesis System [3] is a fine exemplar of a big system; it has been developed as a platform for not only research, but as the basis for several commercial synthesis offerings. While we will discuss it here, as developers with great familiarity in both its machinery and use, part of the critique apply to varying degree to a number of existing unit selection synthesis systems.

Festival was designed to address three types of use. First, for speech synthesis researchers to provide a workbench that they could develop and test new synthesis theories within. The second was to speech technologists who wished to use speech synthesis as a component within their speech systems. This second group would not modify low level aspects of the system but would want some control over voices, lexicons etc. The third group Festival was aimed at was the black box text-to-speech users. End users who just want speech from text and care little about the methods used to achieve that.

That these users are addressed by the same system is important. Having real users use the same system as we develop in, even with different module choices, has meant there are been a clear focus on what real issues need to be solved, and how to perform the process robustly. For example our work on letter to sound rules [2] was directly a result of complaints about pronunciation of unknown words. As Festival has matured, the second group, speech technologists and integrators, have become very important. Issues of interfaces and latency are very important to usefulness of Festival in real dialog systems for example. Now issues of deployment, as well as the creation of new voices, are surfacing more.

The use of the client/server model for Festival was primarily developed to make it easier to use Festival with low latency in real time speech applications. Although has been successful, it is clear that Festival is still a relatively slow, heavyweight system for the applications that are appearing.

In a dialog system, there are many processes that must be executed before the synthesis can even start. Speech recognition consumes some of the cycles, and dialog management, although it may be small per se, often depends of databases lookup which can take a significant amount of time. Network latency and asynchrony in such systems can also be non-trivial; by the time the synthesizer gets to do its work, there has already been a significant delay, and a further delay is not helpful. Furthermore, slow response is often blamed upon the synthesis, regardless of where the bottleneck may be, apparently because ``it took so long to speak.''

Even if Festival can produce waveforms 20 times faster than it takes to say them, a 10 second utterance would still take 500ms to render, which at the end of a speech chain, is too long. Much work was done in Festival to make it as efficient as possible but still keep the clear modular aspects intact. Its speed was partly sapped by the deliberate levels of indirection introduced in the internal structures so changing modules without affecting others would be possible.

Festival also contains many parts which are used only by a few users. In production use in a particular application, only a small amount of the system is brought into action. Given this fact, an initial investigation was done to see if a small subset of the system could be partitioned that would provide a much smaller and usable footprint. Although this is partly possible, certain modules can be easily removed, and others, with a little work, can also be ignored, the fundamental objects in these system their related functions are still large. With version 1.4.1, a binary object file of less than 1.5 megabytes total size can be produced, excluding the voice and lexicon. This has been done on a Compaq iPaq (StrongARM platform), but only by carefully selecting modules and deleting irrelevant portions.

The large footprint of the objects in Festival, and their member functions, is partly due to speed optimizations, in classic space-time tradeoffs. Many of the low level access functions are made to compile in-line so they may be fast but this has the consequence of making the footprint larger. When large unit selection databases of several hundreds of megabytes are used, the size of the core Festival system is pretty much irrelevant, but when we want to put the system on machines with less than 16M of memory and only 16M of local flash ram - such as the iPaq - that overhead is prohibitive. On large servers, even if the database is large, we also do not want the per-utterance run-time synthesis RAM requirement to be as large as 10-20 megabytes, as can be now in Festival.

Given these constraints, we decided to address the issue of a small footprint synthesizer, not by changing Festival itself, but by writing another system that includes re-implementations of the core Festival design. We call this new system Flite, which was chosen to reflect the desire for a Festival-lite system.

Next: Requirements Up: Flite: a small fast Previous: Flite: a small fast

Alan W Black 2001-08-26