Statistical Parametric Speech Synthesis has been successful in producing highly understandable speech but the result is usually buzzy, robotic, and somewhat unlikeable. One major reason for this is the inadequate modeling of the human speech production system. Speech is traditionally modeled using a source-filter framework with overly simplistic assumptions of the source function. In the first part of the proposal, I describe the results obtained when more sophisticated and synthesis-appropriate models of the source function are used. I then draft a plan for future directions of investigation in this research area.
Complex source models alone do not solve the problems of synthesis though. While a variety of source-filter representations have been proposed for speech, few are suitable for use with modern statistical and machine learning techniques. One possible solution is to project these unsuitable models into a suitable space where machine learning techniques can be used. Preliminary experiments have revealed that a deep learning approach might provide the key to solving this problem. In the second part of the proposal, I explain the reasons behind this choice of technique and provide details of the experiments.
Elaborate models and complicated machine learning techniques are only useful if there exist objective metrics that can tell us how effective these two are. In the third part of the proposal, I highlight the shortcomings of current objective metrics and sketch out my ideas for an improved objective metric for Statistical Parametric Speech Synthesis.
Alan Black (Chair)
H.Timothy Bunnell (Nemours Biomedical Research)
staceyy [atsymbol] cs.cmu.edu