next up previous
Next: Evaluation Up: Unit selection without a Previous: Labeling the data

Cluster based unit selection synthesis

The unit selection technique is that described in [6]. In this technique, units of the same type are collected together and an acoustic distance is calculated between each occurrence. A recursive splitting algorithm is used to find which high level questions can be used to split the data such that the mean acoustic distance between members of the partition is minimized. Thus clusters of acoustically similar units are indexed by trees of high level questions.

More formally, we define the acoustic distance $ D(U,V)$ between two units $ U$, and $ V$ where $ \vert V\vert > \vert U\vert$ as

$\displaystyle \displaystyle
P \frac{\vert U\vert}{\vert V\vert} \displaystyle \...
..._{j}\vert U\vert}{abs(F_{ij}(U)-F_{(i\frac{\vert V\vert}{\vert U\vert})j}(V))}
$

where $ P$ is a duration penalty, $ \vert U\vert$ is the number of frames in $ U$, $ W_{j}$ is the weight for parameter $ j$. $ F_{xy}(U)$ is the parameter $ y$ of frame $ x$ of unit $ U$, $ \sigma_{j}$ is the standard deviation of parameter $ j$, and there are $ n$ parameters. The term $ F_{(i\frac{\vert V\vert}{\vert U\vert})j}$ is $ F_{xy}$, where the $ x$ index is computed as $ i \times \frac{\vert V\vert}{\vert U\vert}$, and $ y = j$.

We can then define the impurity of a cluster as

$\displaystyle \displaystyle Impurity(C) = \frac{1}{\vert C\vert^2} \sum_{i=1}^{\vert C\vert}\sum_{j=1}^{\vert C\vert}D(C_{i},C_{j}) $

Then, using standard CART techniques, we greedily find the question that gives the best information gain, and split clusters to minimize the summed impurity of the sub-clusters.

The acoustic distance between each unit is calculated from the mahalanois euclidean distance between pitch synchronous vectors of Mel cepstrum coefficients plus coefficients for duration and F0.

This method is designed to automatically distinguish between acoustically distinct units based on context. It is this particular factor that we are exploiting in this case. As we are assuming no phonetic knowledge, the acoustics and letter contexts (plus higher level information) are being used to define the units that will be selected at run time.


next up previous
Next: Evaluation Up: Unit selection without a Previous: Labeling the data
Alan W Black 2002-10-01