Bhiksha's Projects

Latent Variable Decompositions for Speech and Audio Processing

A Statistical Generative Model for Spectrographic Data

A spectrogram is composed of a sequence of spectral vectors, each of which is computed from a short window of speech and represents the energy at a discrete set of frequencies at a particular time. We can consider each spectral vector to be a histogram, where the value of the spectrum at any frequency as representing the number of quanta of energy at that frequency at that time. Using this perspective, we can now model the spectral vector as a histogram, obtained from draws from a multinomial distribution. Our statistical model for speech is then obtained by analysis of the multinomials that the histograms that compose the spectrogram were drawn from.

If all this sounds a bit aracane, continue reading. Hopefully the idea becomes clearer below.

Lets start at the very beginning..


The multinomial distribution

Multinomial

A multinomial distribution is a distribution over a variable that can take a finite, discrete set of values. For instance, the distribution of balls drawn from an urn that contains balls of various colours is a multinomial. The probability that the ball that will be drawn in any given draw will be of a specific colour is proportional to the number of balls of that colour in the urn. If an urn has four times as many red balls as green balls, the probability of drawing a red ball is four times that of drawing a green ball.

In general, if P(x) is the fraction of balls in the urn that are of colour x then, P(x) is the probability that a randomly drawn ball will be of colour x

We will use the ball and urn metaphor frequently in the rest of this discussion.


The multinomial distribution model for a spectral vector

A spectrogram is a matrix composed of a number of columns. Each column is a spectral vector, representing the magnitude of the DFT of a short segment of speech (called a frame). In the figure below, the spectral vector at time t, St(f) (marked by the black rectangle in the spectrogram) is shown in the center panel. As we can see, it simply represents the energy at each of all frequencies

Any spectral vector can be viewed alternately as a histogram, where the height of the spectrum at any frequency represents the number of quanta of energy at that frequency. (We can account for non-integer values by assuming an unknown normalizing constant; this does not materially affect our discussion).

Multinomial model for spectrogram

By this model, the spectral vector at any time is obtained by draws from a multinomial distribution. Using the ball and urn metaphor, the balls in the urn now have frequencies marked on them (instead of being coloured). Each draw from the urn draws a particular frequency. For illustrative pruposes, we will assume that the drawing is performed by a "picker" who draws balls from the urn, notes the frequency on it, and returns the ball to the urn. The spectral vector has been obtained by drawing a large number of balls from the urn (replacing the ball in the urn after each draw) and plotting a histogram of the frequencies marked on the drawn balls.

The probability of drawing a frequency f (or rather, a ball with f marked on it) from the urn is Pt(f) . Note that this distribution is specific to the tth spectral vector (analysis frame). In other words, eeach spectral vector is drawn from a separate urn, with its own frame-specific distribution over frequencies.

It must be understood that by this model Pt(f) is not merely a normalized version of St(f) ; rather the latter is drawn from Pt(f) and is subject to all the variations that are possible in draws from a random variable.


The mixture multinomial distribution model for spectral vectors

The multinomial model for the spectrum described above is very simple. A simple extension is a mixture multinomial model. The ball-and-urn model for this process is as follows: the "picker" now has several urns, each of which has a different distribution over frequencies. At each draw, the picker now randomly first selects an urn, and then draws a ball from the urn. The spectral vector is the histogram obtained from several such draws.

Mixture multinomial model for spectrogram

The process is a mixture multinomial since it combines (mixes) several individual multinomials (i.e. urns). The histogram repersents a union of draws from all the component multinomials.

Let z represent the index of the component multinomials (urns). The probability that specific frequency f will be drawn in any draw is given by

Pt(f)= z=1 n wt,zPt(f|z)

Here wt,z represents the mixture weight of the zth component multinomial. It is the probability that the picker will select the zth urn in any draw. Note that all terms are specific to the distribution underlying the tth spectral vector.

The mixture multinomial is eventually just a multinomial. If one were to simply cover up all the urns in the above picture with a larger urn, as in the picture below, to the outside viewer it is merely a single urn.

Mixture multinomial model for spectrogram

As such, this model is no more interesting than a simple multinomial model. To make it interesting we make an additional assumption.


The mixture multinomial with shared components

We will now assume that the picker only has a fixed set of urns which remain the same for all spectral frames. However, the probability distribution with which he selects the urns, namely the mixture weights, are different for each spectral vector.

The probability of drawing a frequency f in any draw for the tth spectral vector now becomes

Pt(f)= z=1 n wt,zP(f|z)

Note above that the mixture weights wt,z are separate for each frame (i.e. they are dependent on t); however the component multinomials (the urns) are the same for all frames and are not dependent on t.

The component multinomials are thus characteristic of the source that produces the sound represented by the spectrogram, rather than being specific to any frame of the signal. Each component multinomial P(f|z) is a distribution over frequencies, having a value at every frequency, and can hence be viewed as a spectrum. In fact the component multinomials represent the spectral building blocks, i.e. the latent spectral structures in all sounds produced by the source.

Multinomial components are bases

Each spectral frame is simply a linear combination of these latent spectral structures, where the weights with which they are combined are the mixture weights for the frame.

So, if one were to somehow learn the component multinomials that compose the mixture multinomial distributions for all spectral vectors produced by source, one would have learned the spectral building blocks for the source. Which, as you may recall, is what we actually set out to do. We will generally refer to these bulding blocks as "bases" henceforth, for brevity.

It turns out that one can indeed learn these bases quite easily from a small set, sometimes as little as 5 seconds, of recordings of typical sounds produced by the source. But before we proceed, we digress briefly to take a geometric view of the model.


A geometric view of mixture mutinomials

A multinomial distribution can be viewed as a vector of numbers [x1x2x3...] where the numbers x1,x2 etc. sum to 1.0. If the multinomial can take K values, i.e. if the vector representing the multinomial has K components, all the vectors lie within a K-1 convex unit simplex on a K-1 dimensional plane which lies within the larger K dimensional space. The property of this plane is that the components of all vectors lying on it sum to 1.0. The specific property of the simplex is that all, in addition to summing to 1.0, all components of all vectors lying within it are also positive.

Thus, each of our bases is in fact a K-dimensional vector lying within a K-1 dimensional unit simplex.

The outermost of the basis vectors will, in turn, form a smaller simplex. The distribution underlying each spectral vector is a linear combination of the bases and lies within this simplex. This is illustrated in the figure below -- the outermost corners of the simplex represent the bases and the point within it is a linear combination of these corners. It is easy to show that every point within the simplex is a weighted combination of the corners, where the weights are all positive and also sum to 1.0.

Mixture multinomial model for spectrogram

In our model the bases from which the distributions for all spectral vectors are formed are the same. This means that the actual distribution underlying any spectral vector is a point within the simplex formed by the spectral bases for the source. As we progress along time (i.e. we progress sequentially through the spectral vectors that form a spectrogram) we are in fact tracing a trajectory through this simplex.

Mixture multinomial model for spectrogram

We will revisit this geometric perspective later.


Learning the bases

As mentioned earlier, the bases P(f|z) are learnt very easily from example (training) data. Given the spectrograms of training audio, they can be learned by iterations of the following three equations (derived via expectation maximization). Note that in the following equations we have assumed that there is only one example spectrogram whose tth spectral vector is given by St(f) . However, the equation is trivially extended to the case where we have multiple training examples (e.g. simply by concatenating all their spectrograms together).

1. Pt(z|f)= wt,zP(f|z)z' wt,z'P(f|z')

2. wt,z = f Pt(z|f)St(f) z' f Pt(z'|f)St(f)

3. P(f|z) = t Pt(z|f)St(f) f' z Pt(z|f')St(f')

In the above equations Pt(z|f)= is merely an intermediate variable required for the iteration. It represents the a posteriori probability of the zth component multinomial given a frequency f (i.e. our best guess for the probability that the ball was drawn from the zth urn, given that we observe that it was marked with f).

The computation also gives us wt,z , the mixture weights of the component multinomials for each spectral vector in the training data. For now, we can simply discard them once they are computed since they are not characteristic of the source; only of the training data.

So the overall procedure for learning latent spectral bases for any source is simply to obtain the spectrogram of some example audio generated by the source and to iterate the above three equations. The final P(f|z) terms obtained are the desired latent spectral building blocks.

Now lets see some examples of what we learn.


A Couple of Simple Demonstrations

Vowels:
In this example a female speaker has recorded a sequence of five vowels:

Piano notes.

<BGSOUND src="data/aeiou.wav">

We modelled this recording as having been produced by a combination of six latent spectral building blocks. Here are the six bases discovered. The frequency axis is the vertical axis.

Piano notes.

The above figure is not very informative (and not very clear in your browser either). Lets look at it differently.

Each basis contributes to every spectral vector in the recording. The contribution of the zth basis to the tth spectral vector is simply Etwt,fP(f|z) . Here Et= f St(f) represents the total energy in the tth frame.

In ball and urn terms, of all the draws that composed St(f) we estimate that Etwt,fP(f|z) were drawn from the zth urn. The contribution of each basis to the entire spectrogram can be computed in this manner, by computing its contribution to each frequency component of each spectral vector.

The figure below shows the contributions of each of the six bases to the signal shown above. Each row represents the contribution of one of the bases.

The spectrograms can be inverted to produce a time-domain speech signal, using an inverse short-time Fourier transform. The sounds produced in this manner from each of the spectrograms in the figure are also given below. These sounds represent the building blocks we have discovered for the recording above.

Piano notes.

<BGSOUND src="data/aeiourec1.wav"> <BGSOUND src="data/aeiourec2.wav"> <BGSOUND src="data/aeiourec3.wav"> <BGSOUND src="data/aeiourec4.wav"> <BGSOUND src="data/aeiourec5.wav"> <BGSOUND src="data/aeiourec6.wav">

We note that each of the bases has captured one of the vowels uttered by the speaker. In other words, we have discovered purely through analysis of the recording that the building blocks for this recording are the five vowels. Interestingly, the furth basis has simply captured the background noise. This too is one of the building blocks of the sound.

Musical Notes:
Here is another example, where we analyze a short bar from Bach's Fugue in G minor.

Piano notes.

<BGSOUND src="data/bachfugue.wav">

We modelled the sound as having been produced by a mixture of four bases. Here are the bases discovered and their corresponding contributions to the overall sound.

Piano notes.

<BGSOUND src="data/bachcomp1.wav"> <BGSOUND src="data/bachcomp2.wav"> <BGSOUND src="data/bachcomp3.wav"> <BGSOUND src="data/bachcomp4.wav">

We have correctly discovered that the building blocks for the bar are the four notes obtained. We have even discovered that the second note peaks twice in the piece.