In a number of applications, one often has access to distinct but overlapping views over the same information. For instance, a lecture may be supported by slides, a TV series may be accompanied by subtitles, or a conference in one language may be interpreted into another. Since most useful speech and language processing technologies such as speech recognition are not perfect, it would be desirable to be able to fuse these different perspectives in order to obtain improved performance.
In this work, a general method for combining multiple information streams which are, in part or as a whole, translations of each other, is presented. The algorithms developed for this purpose rely both on word lattices, representing posterior probability distributions over word sequences, and phrase tables, which map word sequences to their respective translations, to generate an alignment of the streams. From this alignment, we extract phrase pairs, and use them to compute a new most likely decoding of each stream, biased towards phrases in the alignment. This method was used in two different applications : transcription of simultaneously interpreted speeches in the European Parliament and of lectures supported by slides. In both of these scenarios, we achieved performance improvements when compared with speech recognition only baselines. We also demonstrate how recovering acronyms and words that cannot be found in the lattices can be used to enhance overall speech recognition performance, and propose a scheme to add new pronunciations to the recognition lexicon. Both of these techniques are also based on cross-stream information.
Towards the completion of the current work, we will also explore how rich transcription techniques, namely sentence segmentation and detection / recovery of disfluencies, can benefit from the information contained in parallel streams. Cues extracted from other streams will be used to supplement currently existing methods to help solve each of these problems. Finally, we will look at ways to use the information extracted from these streams in order to perform lightly supervised acoustic model training.
Alan Black (Chair)
Joâo neto (Co-Chair, IST)
Luisa Coheur (IST)
Isalbel Trancoso (IST)
Steve Renals (University of Edinburgh)
staceyy [atsymbol] cs.cmu.edu