festvox: Building Voices in Festival

talk by Alan W Black including work by Kevin A Lenzo

In our continuing goal of making speech synthesis more accessible. I will describe our latest advancements in documenting and automating the process of building new voices in Edinburgh University's Festival Speech Synthesis System

The intention is allow relatively unskilled users build new synthetic voices in currently supported and completely new languages. Although the task of producing perfect quality synthesis is still a research issue, we now have examples of how basic diphone synthesizers in new langauges can created in a few months of work (sometimes more, sometimes less). I will discusses the generic techniques we provide for building text analysers, lexicons, letter to sound rules, data driven prosodic models, autolabelling techniques, schema generation and recording aids.

I will also discuss some limit domain synthesis techniques that allow near automatic construction high quality natural synthesis for specific tasks, using one our unit selection techniques.

Most of the documents, scripts tools and techniques discussed in the talks are collect together at http://www.festvox.org, (which is continually being updated).

Sound samples

Various intro/language examples
- US English voice built using described technique.
- European Spanish example
- Welsh example
- Female German example (OGI)
- Male German example (OGI)
Non-standard word, text analysis
As done through the Johns Hopkins University, Summer Workshop '99 on Normalization of Non-standard Words
```
57 ST E/1st & 2nd Ave Huge
drmn 1 BR 750+ sf, lots of sun \&
clsts. Sundeck & lndry facils. Askg
$187K, maint $868, utils
incld. Call Bkr Peter 914-428-9054.
```
- Raw standard Festival TTS analysis.
- Text analysis trained on similar labelled classified add data
- Text analysis trained on similar unlabelled classified add data
Dialect independent Lexicons
from Edinburgh's Susan Fitt and Stephen Isard (1999). "Synthesis of Regional English using a Keyword Lexicon" in Eurospeech 99, pp. 823-826
- UK RP English: I say Pakistan
  Southern Irish English: I say Pakistan
- UK RP English: What a waste
  Southern Irish English: What a waste
- Southern Irish English: It's hotter in the city
  Standard US English: It's hotter in the city
Diphone databases building
- Some example synthesized prompts (1348 for US English):
  kal_0001 ("b-aa" "aa-b") (t aa b aa b aa)
  kal_0002 ("p-aa" "aa-p") (t aa p aa p aa)
- Some spoken words (i.e. human) (1348 for US English):
  kal_0001 ("b-aa" "aa-b") (t aa b aa b aa)
  kal_0002 ("p-aa" "aa-p") (t aa p aa p aa)
- We can automatically label and extract the diphones
  Fully automatic
  with some hand correction
General unit selection
- Using clunits module in Festival for unit selection
  using BU FM Radio data (f2b)
Limited domain synthesis
From domain of 24 spoken (time specific) sentences we can build a synthesizer to say any time (fully automatic).
- 10:35am
  4:47pm
  7:54pm

This page is maintained by Alan W Black awb@cs.cmu.edu