A spectrogram is composed of a sequence of spectral vectors, each of which is computed from a short window of speech and represents the energy at a discrete set of frequencies at a particular time. We can consider each spectral vector to be a histogram, where the value of the spectrum at any frequency as representing the number of quanta of energy at that frequency at that time. Using this perspective, we can now model the spectral vector as a histogram, obtained from draws from a multinomial distribution. Our statistical model for speech is then obtained by analysis of the multinomials that the histograms that compose the spectrogram were drawn from.
If all this sounds a bit aracane, continue reading. Hopefully the idea becomes clearer below.
Lets start at the very beginning..
A multinomial distribution is a distribution over a variable that can take a finite, discrete set of values. For instance, the distribution of balls drawn from an urn that contains balls of various colours is a multinomial. The probability that the ball that will be drawn in any given draw will be of a specific colour is proportional to the number of balls of that colour in the urn. If an urn has four times as many red balls as green balls, the probability of drawing a red ball is four times that of drawing a green ball.
In general, if
We will use the ball and urn metaphor frequently in the rest of this discussion.
A spectrogram is a matrix composed of a number of columns. Each column is a spectral vector, representing the magnitude of the DFT of a short segment of speech (called a frame). In the figure below, the spectral vector at time t,
Any spectral vector can be viewed alternately as a histogram, where the height of the spectrum at any frequency represents the number of quanta of energy at that frequency. (We can account for non-integer values by assuming an unknown normalizing constant; this does not materially affect our discussion).
By this model, the spectral vector at any time is obtained by draws from a multinomial distribution. Using the ball and urn metaphor, the balls in the urn now have frequencies marked on them (instead of being coloured). Each draw from the urn draws a particular frequency. For illustrative pruposes, we will assume that the drawing is performed by a "picker" who draws balls from the urn, notes the frequency on it, and returns the ball to the urn. The spectral vector has been obtained by drawing a large number of balls from the urn (replacing the ball in the urn after each draw) and plotting a histogram of the frequencies marked on the drawn balls.
The probability of drawing a frequency
It must be understood that by this model
The multinomial model for the spectrum described above is very simple. A simple extension is a mixture multinomial model. The ball-and-urn model for this process is as follows: the "picker" now has several urns, each of which has a different distribution over frequencies. At each draw, the picker now randomly first selects an urn, and then draws a ball from the urn. The spectral vector is the histogram obtained from several such draws.
The process is a mixture multinomial since it combines (mixes) several individual multinomials (i.e. urns). The histogram repersents a union of draws from all the component multinomials.
Let
Here
The mixture multinomial is eventually just a multinomial. If one were to simply cover up all the urns in the above picture with a larger urn, as in the picture below, to the outside viewer it is merely a single urn.
As such, this model is no more interesting than a simple multinomial model. To make it interesting we make an additional assumption.
We will now assume that the picker only has a fixed set of urns which remain the same for all spectral frames. However, the probability distribution with which he selects the urns, namely the mixture weights, are different for each spectral vector.
The probability of drawing a frequency f in any draw for the tth spectral vector now becomes
Note above that the mixture weights
The component multinomials are thus characteristic of the source that produces the sound represented by the spectrogram, rather than being specific to any frame of the signal.
Each component multinomial
Each spectral frame is simply a linear combination of these latent spectral structures, where the weights with which they are combined are the mixture weights for the frame.
So, if one were to somehow learn the component multinomials that compose the mixture multinomial distributions for all spectral vectors produced by source, one would have learned the spectral building blocks for the source. Which, as you may recall, is what we actually set out to do. We will generally refer to these bulding blocks as "bases" henceforth, for brevity.
It turns out that one can indeed learn these bases quite easily from a small set, sometimes as little as 5 seconds, of recordings of typical sounds produced by the source. But before we proceed, we digress briefly to take a geometric view of the model.
A multinomial distribution can be viewed as a vector of numbers
Thus, each of our bases is in fact a K-dimensional vector lying within a K-1 dimensional unit simplex.
The outermost of the basis vectors will, in turn, form a smaller simplex. The distribution underlying each spectral vector is a linear combination of the bases and lies within this simplex. This is illustrated in the figure below -- the outermost corners of the simplex represent the bases and the point within it is a linear combination of these corners. It is easy to show that every point within the simplex is a weighted combination of the corners, where the weights are all positive and also sum to 1.0.
In our model the bases from which the distributions for all spectral vectors are formed are the same. This means that the actual distribution underlying any spectral vector is a point within the simplex formed by the spectral bases for the source. As we progress along time (i.e. we progress sequentially through the spectral vectors that form a spectrogram) we are in fact tracing a trajectory through this simplex.
We will revisit this geometric perspective later.
As mentioned earlier, the bases
1.
2.
3.
In the above equations
The computation also gives us
So the overall procedure for learning latent spectral bases for any source is simply to obtain the spectrogram of some example audio generated by the source and to iterate the above three equations. The final
Now lets see some examples of what we learn.
Vowels:
In this example a female speaker has recorded a sequence of five vowels: