Fully automatic builds of synthesizers in unresearched languages is a long way off, however with the greater demand for support in minority languages it is something that should be addressed.
Using acoustic information to find distinctions is implicitly what we have been trying to do in unit selection synthesis, thus explicitly taking advantage of that should not be a surprise.
Anecdotal evidence of this already shows up in other synthesizers build by us. When using an American English based synthesizer with US English phoneset, a US English lexicon, and a Scottish English speaker, the lexical entries do not properly match the speaker's pronunciations. For examples palatalized /uw/ as found in British English in /t y uw z d ey/ (Tuesday) is defined as /t uw z d ey/ in the US English lexicon. When this labeling is used against a Scottish English speaker the /y-uw/ segment is labeled as /uw/. Thus when other words are synthesized with similar contexts the palatalization is still generated thus words labeled as /s t uw d eh n t/ (student) may correctly, for the dialect, be synthesized as acoustics that could be labeled as /s t y uw d eh n t/.
It should be noted that it is rare that absolutely no phonetic knowledge is available for a language and often at least some information (vowel/consonant) can be directly derived from the orthographic system. However it is not unusual that there are no linguistically knowledgeable speakers of the language available, and native speakers are often not explicitly conscious of the distinction they are making. In a practical sense, a gross classification of phonemes can be reasonably specified but fine distinctions are much harder.
It is worth comparing the complexity of mapping letters directly to acoustics, with the more standard approach of having an intermediate finite phone set. As we are considering mapping without explicit lexicons it is best to compare with the automatic letter to sound rule mappings as described in ; in this case, we map letters to predefined finite phone sets. Importantly, letter to sound training sets are bigger, because it is easier to collect text than speech. However the difference in size is only perhaps one order of magnitude (5000 words vs. 50,000 words), and in the letter to acoustic case we have selected data deliberately to get coverage.
Machine learning techniques could allow us to assume a hidden layer that explicitly represents a phone set, but we have not investigated that yet.
Another direction that may be worth investigating is to cluster the acoustics independent of any labeling and then match the types identified by the clusters to letters. Such techniques for acoustically derived units have been studied for speech recognition (e.g. ) but have not yet been investigated for unit selection synthesis.
It is clear that depending on the language and knowledge available, there is a scale of pure letter to acoustic through to letter to phone and phone to acoustic models. But we would like to make that scale available to the voice builder so they may best take advantage of the information they currently have available.
Another point that we wish to make clear is that without native speaker's feedback for evaluation the ultimate quality of a synthetic voice cannot be determined. As those who work in the field immediately notice, synthesis in languages you are not familiar with typically sound better than synthesis in languages you are knowledgeable about. It requires fluent speakers to properly evaluate content. In our experience in building synthesizers for minority language we find, anecdotally, that listeners can be more extreme that those in more common languages. On one hand, that there is a synthesizer at all in their language can make some native listeners accept what is not the best possible synthesis. On the other hand, listeners of minority languages are likely to be unfamiliar with speech synthesis, and they can even find listening to high quality recorded speech difficult to understand.