next up previous
Next: Selecting the best database Up: Optimal Data Selection for Previous: Initial attempts

Cluster unit selection

Before we address the two specific shortcomings for selecting data mentioned above, let us include a description of the unit selection algorithm we have been using, because its method helps offer an answer to those questions.

This general unit selection method is first described in [3] which also includes some techniques of both [6] and [10] in it. The general idea is to take all units of a particular type and define an acoustic distance between them. In the experiments presented here the types are phones, though they could be diphones or demisyllables. Using features such as phonetic, metrical and prosodic context find which features can best split the cluster such that the mean acoustic distance is smaller. Then recursively apply this splitting until some threshold size is achieved. Thus using CART techniques [4] we end up with a decision tree indexed by features available at synthesis time identifying clusters of acoustically similar units.

More formally, we define the acoustic distance $D(U,V)$ between two units $U$, and $V$ where $\vert V\vert > \vert U\vert$ as


\begin{displaymath}
\displaystyle
P \frac{\vert U\vert}{\vert V\vert} \displayst...
...}{abs(F_{ij}(U)-F_{(i\frac{\vert V\vert}{\vert U\vert})j}(V))}
\end{displaymath}

where $P$ is a duration penalty, $\vert U\vert$ is the number of frames in $U$, $W_{j}$ is the weight for parameter $j$. $F_{xy}(U)$ is the parameter $y$ of frame $x$ of unit $U$, $\sigma_{j}$ is the standard deviation of parameter $j$, and there are $n$ parameters. The term $F_{(i\frac{\vert V\vert}{\vert U\vert})j}$ is $F_{xy}$, where the $x$ index is computed as $i \times \frac{\vert V\vert}{\vert U\vert}$, and $y = j$.

We can then define the impurity of a cluster as


\begin{displaymath}\displaystyle Impurity(C) = \frac{1}{\vert C\vert^2} \sum_{i=1}^{\vert C\vert}\sum_{j=1}^{\vert C\vert}D(C_{i},C_{j}) \end{displaymath}

Then, using standard CART techniques, we greedily find the question that gives the best information gain, and split clusters to minimize the summed impurity of the sub-clusters.

Run-time synthesis consists of selecting the appropriate cluster using the CART trees and then finding an optimal route through these candidate units using a Viterbi decoding algorithm.

Of course there are many degrees of freedom in such a system, including the definition of an acoustic distance - both the parameterization and their relative weights - which must correlate with human perception of speech. We have found that pitch synchronous Mel-frequency cepstral coefficents to be a useful representation, and use mean weighted Mahalanobis distances over the frames in two units with a duration penalty, [3] also included delta cepstral coefficients but at least in our more recent databases we have not found them useful.

An important by-product of this particular unit selection algorithm is a classification of the acoustically distinct units in a database. Each cluster in the tree represents an acoustically distinct typical unit. Thus, given a large database with broad enough coverage, we can automatically find out what acoustically distinct units are and what are contexts they are likely to appear in.

Of course with speech, you can never be completely sure that your distinctions are fixed, but they are reasonable approximations. The distinctions found in this process are database-specific, as well as speaker and style specific, but if the seed databases is of a reasonable coverage, they may be useful in defining selection criteria for a corpus with better coverage.


next up previous
Next: Selecting the best database Up: Optimal Data Selection for Previous: Initial attempts
Alan W Black 2001-08-26