Bhiksha's Projects

Latent Variable Decompositions for Speech and Audio Processing

Discrete data such as text are often modelled as having been generated by draws from a discrete random variable. Continuous-valued data such as images and sound spectra, on the other hand, are commonly modelled as draws from a continuous-valued RV. But how about the intersection of the two?

In this project we investigate this intermediate space. We model the discrete-valued support of the continuous-valued RV, as a discrete-valued RV, and the continuous value at the support as a normalized count of the number of draws of these discrete elements.

For example, the spectrogram of a speech signal shows the energy at a discrete set of frequencies, at a discrete number of time indices. By our model, time and frequency are treated as RVs, and the value of the spectrogram at any time-frequency as the count of the number of draws of that time-frequency pair from a discrete random process.

This model has some suprising properties, providing us with surprisingly simple algorithms for tasks such as monaural source separation, determination of atomic units from sounds, images, video and text, and even potential solutions to problems such as deblurring of images and deconvolution of sounds.

Mathematically, it can be shown to be identical to the popular technique of non-negative matrix factorization. However, it also provides us a simple framework for application of various priors, and also enables us to employ various statistical models and methods that have been developed for discrete data such as text. Conversely, the techniques we develop, particularly the model that obtains sparse overcomplete decompositions are observed to be effective models for d iscrete data.

  • Introduction: The notion of building blocks
  • The basic model: Modelling the support as a random variable
  • Evaluating the model
  • Is this model sufficient
  • Sparse and Overcomplete Decompositions
  • Evaluating the overcomplete model
  • Convolutive and transform-invariant extensions
  • Relations to other models
  • Extensions and other applications

    A Fun Demo: Song personalization

    Do you ever hear a song and think "this would sound so much better if the singer sang at a different pitch", or "I'd like to change the chord right here"? In this work presented at FRSM 2007 we tried this in the context of Indian pop music. [Link to appear]


    All of this work was done in collaboration with (and co-authored by):

    The song personalization work, and all work related to automatic speech recognition were also co-authored by Rita Singh (CMU).

    List of publictions on the topic