Bhiksha's Projects

Latent Variable Decompositions for Speech and Audio Processing

Most data are composed of building blocks

Consider the following sound. Its just a sequence of notes.

Piano notes.

<BGSOUND src="data/pianonotes.wav">

Consider the following sound now. Its a set of chords that are composed of the notes from the previous example.

Piano chord.

<BGSOUND src="data/pianochord.wav">

Here is another example composed of the same notes (plus a few other notes).

Piano music.

<BGSOUND src="data/pianomusic.wav">

Consider the following example. It comprises a number of phonemes.

Phonemes.

<BGSOUND src="data/phonemepieces.wav">

This sound is a complete, recognizable sentence built only from the phonemes in the above example

Utterance.

<BGSOUND src="data/wholeutt.wav">

In both above examples, we note that a more complex sound has been built from a number of simpler building blocks -- music from notes, speech from phonemes.

In this work we attempt to derive (and manipulate) these building blocks directly from data. In particular, we will work with sounds, but generically we will work with other types of data as well.


Representing Sounds

In order to learn the building blocks properly, we need to represent sounds appropriately. We impose two requirements here. We note that spectral representations satisfy these requirements: the power spectra of two uncorrelated signals add. We will therefore represent signals spectrographically, as a sequence of power spectral vectors estimated from (overlapping) segments of signal. This converts a time signal such as the one below

Signal.

to an image such as this

Specgram.

Here, the horizontal axis represents time and the vertical axis represents frequency. The colour at any time and frequency represents the energy at that time, in that frequency (red is more, blue is less, anything else is in-between; in reality it is a numeric value -- a real number -- that is rendered as colour in these pictures).

The spectrogram of any signal can be derived by a short-time Fourier transform of the signal. The STFT typically returns a complex matrix of numbers (where the columns represent time and the rows represent frequencies); we however only retain the magnitude of these numbers and do not operate on the phase. Thus, each signal is converted to a matrix, that can be rendered as a two-dimensional image such as the one above.


Conventional methods of learning building blocks

The spectrographic representation of the speech signal is just a matrix. There are a number of techniques such as SVD, PCA, ICA etc. that can derive "bases", that can be viewed as linearly combinable building blocks for the data. Unfortunately, these do not fit the requirements we have posited above.

Here, for example, is the spectrogram of a mixture of two intermittent tones. The two obvious building blocks are the tones themselves.

Two tones Two tones PCA

The right panel shows the two bases discovered using PCA. The vertical axis is frequency in this panel. Both bases have both frequency components -- PCA is unable to learn the two separate frequency components. Worse still, the secon basis is observed to have negative values at one of the frequencies, while the other has a positive value at the same frequency. Any combination of the two will result in cancellation, which conflicts with our requirement of pure additivity. In any case, what does it mean to say our building block has negative energy at any frequency? The bases are physically meaningless.