In building so many diphone database, and particularly by repeating the process a number of times with the same speaker in the same dialect, we feel we have streamlined our build process so that it is much more reliable. In other systems, building new voices is such an undertaking that it is not something that can easily be experimented with, without significantly more work. We can record, label and hand-check a complete US English diphone set in a day, albeit a long day, and with only preliminary quality; but, with such turnaround rate, we have be able to identify specific problems that we have had to address.
Even when the speaker is an expert in phonetics and diphone synthesis, we know it is still very easy to make phonetic mistakes in recording. The vowel-vowel transitions are notably difficult to produce. They are relatively rare in normal speech but of course as we are collecting complete coverage we need instances of all examples. We have also noted that the phones [AX] and [AH] are particularly difficult to reliably produce, even when we are keenly aware of the trouble spot.
The US English databases themselves are available from http://festvox.org/dbs/index.html , and the full documentation with scripts, code and explicit walk-troughs of these techniques with examples are available at http://festvox.org.