Speaker tracking using audio-visual statistical models

Abstract

I'll present excerpts of some work on a self-calibrating algorithm for audio-assisted visual tracking using a combination of simple audio and video models.
In most systems that handle digital media, the audio and video parts are treated separately. Such systems usually have subsystems specialised for tracking based on either modality and these are optimised independently. But in principle, a tracker that can exploit both modalities may achieve better performance than one which exploits only one. Not only could one modality compensate correctly for any temporary weakness of the other, but a combined model could exploit correlations between modalties.
With some video clips of simple tracking scenarios, I'll show that we can improve on the performances of simple individual models using a single probabilistic generative model. This model combines simple audio and video models, treating the true location as a hidden (unobserved, latent) variable which is inferred for tracking.
Work at Microsoft Research Summer 2001, with Hagai Attias and Nebojsa Jojic.

Back to the Main Page

Charles Rosenberg

Last modified: Tue Mar 12 18:01:04 EST 2002