In order to get both the flexibility and naturalness of human speech in a synthesizer, it is clear we need to look closer at how we build our voices. Recording everything is not sufficient and already we are finding that recording very large databases is significantly hard.
When coverage problems like this exist in other fields the solution is to decompose the system so each part can be covered separately. For example we could consider separate spectral models, intonation models and duration models. This is some sense what was done with earlier diphone systems, and we know that these do not have the naturalness of unit selection systems.
We have however seen unit selection techniques being used on separate streams of information. For example  move towards unit selection techniques for selecting appropriate ToBI () labels for databases of intonationally labelled speech.  select F0 contours for databases of speech in the same basic way as (spectral) unit selection.
There is a disadvantage though, by decomposing the signal, we introduce the problem that we have to reconstruct it afterwards. The artifacts that such reconstruction introduces were one the reasons unit selection with minimal smoothing became popular.
But now that we are finding the limitations of conventional unit selection techniques, improving the decomposition and reconstruction of the signal, which would allow us to model component separately, seems like the most direct way to improve the flexibility of synthetic voices.