Bhiksha's Projects

Latent Variable Decompositions for Speech and Audio Processing

Discrete data such as text are often modelled as having been generated by draws from a discrete random variable. Continuous-valued data such as images and sound spectra, on the other hand, are commonly modelled as draws from a continuous-valued RV. But how about the intersection of the two?

In this project we investigate this intermediate space. We model the discrete-valued support of the continuous-valued RV, as a discrete-valued RV, and the continuous value at the support as a normalized count of the number of draws of these discrete elements.

For example, the spectrogram of a speech signal shows the energy at a discrete set of frequencies, at a discrete number of time indices. By our model, time and frequency are treated as RVs, and the value of the spectrogram at any time-frequency as the count of the number of draws of that time-frequency pair from a discrete random process.

This model has some suprising properties, providing us with surprisingly simple algorithms for tasks such as monaural source separation, determination of atomic units from sounds, images, video and text, and even potential solutions to problems such as deblurring of images and deconvolution of sounds.

Mathematically, it can be shown to be identical to the popular technique of non-negative matrix factorization. However, it also provides us a simple framework for application of various priors, and also enables us to employ various statistical models and methods that have been developed for discrete data such as text. Conversely, the techniques we develop, particularly the model that obtains sparse overcomplete decompositions are observed to be effective models for d iscrete data.

Introduction: The notion of building blocks

The basic model: Modelling the support as a random variable

A simple demonstration: Discovering notes and vowel sounds

Evaluating the model

Expanding the bandwidth of a narrowband signal (includes demo)
Separating signals from mixtures (includes demos with speech+music, speech+noise, speech+speech)
Images and text

Is this model sufficient

Sparse and Overcomplete Decompositions

Evaluating the overcomplete model

Separating signals from mixtures (includes demos with speech+music, speech+noise, speech+speech)
Images and text

Convolutive and transform-invariant extensions

Examples and tests

Relations to other models

Extensions and other applications

A Fun Demo: Song personalization

Do you ever hear a song and think "this would sound so much better if the singer sang at a different pitch", or "I'd like to change the chord right here"? In this work presented at FRSM 2007 we tried this in the context of Indian pop music. [Link to appear]

Collaborators

All of this work was done in collaboration with (and co-authored by):

Paris Smaragdis, Adobe Inc.
Madhusudana Shashanka, Mars Inc.

The song personalization work, and all work related to automatic speech recognition were also co-authored by Rita Singh (CMU).

Latent Variable Decompositions for Speech and Audio Processing

A Fun Demo: Song personalization

Collaborators

List of publictions on the topic