We are now at the stage that we can craft high quality voices that mostly sound human. However its not just the unit selection technology itself that will allow satisfied customers. Even when the voice sounds human, it may still not be appropriate. We are already seeing people comment on voices as being, too direct, not friendly, too friendly, overly polite etc. That is, people are commenting on underlying style rather than naturalness. No matter how natural the voice maybe some people may just not like that voice.
Emphasis, style, voice quality can only currently be controlled with explicitly recording such varied data. Current unit selection techniques typically do not model the speech itself in any sophisticated way, usually because that would introduce degradation in the signal.
If we are to please all the people all of the time we need to be able to control the voice quality and control the style. Which means we need to better model the speech signal, probably using techniques that were developed for early formant type synthesis techniques. This is hard research and may take time before it will reach reliability of current unit selection synthesizers but will give us the flexibility that we require.