As the amount of speech data available increases rapidly, so does the need for efficient search and understanding. Techniques such as Spoken Term Detection (STD), which focuses on finding instances of a particular spoken word or phrase in a corpus, try to address this problem by locating the query word with the desired meaning. However, STD may not provide the desired result, if the Automatic Speech Recognition (ASR) system in the STD pipeline has limited performance, or the meaning of the item retrieved is not the one intended. In this thesis, we propose different features that can improve the performance on search and understanding of noisy conversational speech.
First, we describe a Word Burst phenomenon which leverages the structural property of conversational speech. Word Burst refers to a phenomenon in conversational speech in which particular content words tend to occur in close proximity of each other as a byproduct of the topic under discussion. We design a decoder output rescoring algorithm according to Word Burst phenomenon to refine our recognition results for better STD performance. Our rescoring algorithm significantly reduced the false alarm that were produced by the STD system. We also leverage Word Burst as a feature for identifying recognition errors in conversational speech. Our experiments show that including Word Burst feature can provide significant improvement. With this feature, we demonstrate that higher level information, such as structural property can improve search and understanding without the need for language-specific resources or external knowledge.
Second, we identify the mismatch between different decoder output created by the same ASR system can be leveraged as a feature for better STD performance. After the decoding process of an ASR system, the result can be stored in the format of lattice or confusion networks. The lattice has richer historical information for each word, while the confusion network maintain a simple and more compact format. Each of this format contains unique information that is not presented in the other format. By combining the STD result generated from these two decoder output, we can achieve improvement on STD systems as well. This feature shows that unexplored information could be stored in different output generated by the identical ASR system.
Last but not least, we presented a feature based on distributed representations of spoken utterances. Distributed representations group similar words closer in a vector space according to its context. Every word that shows up in a regular context will be projected into the vector space closely to each other. As a feature space, we not only project the word in the space, but also project the utterances that contains multiple words into the space. We apply this feature to Spoken Word Sense Induction (SWSI) task, which differentiates target keyword instances by clustering according to context. We compare this approach with several existing approaches and shows that it achieves the best performance, regardless of the ASR quality.
Alexander Rudnicky (Chair)
Alan W Black
Alexander G Hauptmann
Gareth J.F. Jones (Dublin City University)