Cheating with Imperfect Transcripts


Most speech recognition systems try to reconstruct a word sequence given an acoustic input, using prior information about the language being spoken. In some cases, there is more information available to the decoder than simply the acoustics. When decoding a television news broadcast, for example, the closed-caption information that is often recorded for hearing impaired viewers may also be available. While these captions are generally not completely accurate transcriptions, they can be considered to be a strong hint as to what was actually spoken.

In this paper, we present a formalization of this problem in terms of the source channel paradigm. We propose a simple translation model for mapping caption sequences to word sequences which updates the language model with the prior information inherent in the captions. We also describe an efficient implementation of the search in a Viterbi decoder, and present results using this system in the broadcast news domain.