`t2p`: Text-to-Phoneme Converter Builder

Kevin Lenzo, Carnegie Mellon University.

What is `t2p?`

t2p is a public domain package in Perl for building grapheme-to-phoneme rules from pronunciation dictionaries. In other words, it builds letter-to-sound rules for pronouncing words given a set of example pronunciations, like the CMU Pronouncing Dictionary.

What would you use it for?

Since it can generalize to words outside of the training set, it can be used to find the pronunciations of words that the program has never seen. This can be useful for a number of things, such as

Pronunciations for Out-Of-Vocabulary words (OOVs) for Speech Recognition
Pronunciations for Speech Synthesis; for instance, see the article in The Perl Journal entitled
s/($text)/speech $1/eg;
Retrieval by sound keys rather than exact spelling; SOUNDEX is a simplified method for this.

How does it do it?

t2p takes in a pronunciation dictionary, such as the CMU Pronouncing Dictionary, and builds Decision Trees that model the words.

Here's what the CMU dictionary looks like: (from 0.6d)

...
LEX  L EH1 K S
LEXICAL  L EH1 K S IH0 K AH0 L
LEXICOGRAPHER  L EH2 K S IH0 K AA1 G R AH0 F ER0
LEXICON  L EH1 K S IH0 K AA2 N
LEXIE  L EH1 K S IY0
LEXINE  L EH1 K S AY0 N
LEXINGTON  L EH1 K S IH0 NG T AH0 N
LEXIS  L EH1 K S IH2 S
LEXMARK  L EH1 K S M AA2 R K
LEXUS  L EH1 K S AH0 S
...

It's a list of words and the associated phonemes, in order. What t2p does is take all words together and find the first rule that makes the best predictive split of the data, then keeps doing that on subsets until it makes a tree of decisions. The resulting Perl code looks something like

  if ($att{'L'} eq 'H') { 
    if ($att{'L1'} eq 'A') { 
      if ($att{'R1'} eq 'A') { 
        if ($att{'L3'} eq 'G') { 
          if ($att{'R3'} eq '-') { 
            return 'HH';
          } 
          return '_';
        } 
        if ($att{'L3'} eq 'H') { 
          return '_'; # unique at depth 4
        } 
        if ($att{'L3'} eq 'U') { 
          if ($att{'L2'} eq 'J') { 
            return 'AE'; # unique at depth 5
          } 
          return 'HH';
        }

where L is the letter itself, L1 is the first letter to the left, R1 is the first to the right, and so on. The return value is the output of the transducer for each letter, given context. Thus, it's a context-sensetive rewrite system for a grammar of limited depth.

In collaboration with Alan Black and Vincent Pagel, we have made a number of these packages. Results of similar form are used in Festival, a free, source-available system from the University of Edinburgh, and the MBRDICO, MBROLA dictionary compression. the MBRDICO code produces much smaller results in C; this packages is a Perl implementation used to build the base.

OK, where do I get it?

t2p Perl code is available as a tarred, gzipped file from http://www.cs.cmu.edu/~lenzo/t2p/code .

References and Links

s/($text)/speech $1/eg; in The Perl Journal number 12, Winter, 1998.
The Festival Speech Synthesis System, http://www.cstr.ed.ac.uk/projects/festival.html
phonebox text-to-speech synthesis, http://www.cs.cmu.edu/~lenzo/phonebox/
t2p text-to-phoneme converter builder, http://www.cs.cmu.edu/~lenzo/t2p/ .
CMU Dictionary, http://www.speech.cs.cmu.edu/cgi-bin/cmudict
Oxford Advanced Learner's Dictionary, http://www.speech.cs.cmu.edu/comp.speech/Section1/Lexical/cuvolad-dict.html
Moby Lexicon, http://www.dcs.shef.ac.uk/research/ilash/Moby/
NetTalk data, http://www.boltz.cs.cmu.edu/benchmarks/nettalk.html .
Black, Lenzo, and Pagel, "Issues in Building General Letter to Sound Rules," for the 1998 ESCA Speech Synthesis Workshop, Jenolan Caves, Blue Mountains, Australia. http://www.cs.cmu.edu/~lenzo/areas/papers/ESCA98/ESCA98_lts.ps
Pagel, Lenzo, and Black, "Letter to Sound Rules for Accented Lexicon Compression", for the 1998 Internation Conference on Spoken Language Processing, Sydney, Australia. http://www.cs.cmu.edu/~lenzo/areas/papers/ICSLP98/ICSLP98_lts.ps
Multilingual Text-to-Speech Synthesis, the Bell Labs Approach. Richard Sproat, editor; Kluwer Academic Publishers, 1998.
Mastering Regular Expressions. Jeffrey Friedl, O'Reilly and Associates, Inc., 1997.
"Parallel networks that learn to pronounce English text," Sejnowski, T.J., and Rosenberg, C.R.. In Complex Systems, 1, 145-168. 1987.

Kevin Lenzo is a Ph.D. student in The Robotics Institute at Carnegie Mellon University.
Mail Kevin if you have any questions or comments; be sure to cite this page.