SCS Special Seminar

  • Associate Research Professor
  • Department of Electrical and Computer Engineering
  • Johns Hopkins University

End-to-End Speech Processing: From Pipeline to Integrated Architecture

Recently, the end-to-end automatic speech recognition (ASR) paradigm has attracted great research interest as an alternative to the conventional hybrid framework of deep neural networks and hidden Markov models. Using this novel paradigm, we can simplify the conventional ASR pipeline architecture by integrating such ASR components as acoustic, phonetic, and language models into a single neural network and optimize the whole system for the ultimate ASR objective: generating a correct label sequence. This talk introduces extensions of the basic end-to-end architecture by focusing on its integration function to tackle major problems faced by current ASR technologies in adverse environments including multilingual, multi-speaker, distant-talk, and sparse data conditions. For multilingual issues, we fully exploit the end-to-end ASR advantage of eliminating the need for linguistic information such as pronunciation dictionaries, and integrate to build a monolithic multilingual ASR system with a language-independent neural network architecture, which can recognize speech in 10 different languages. We also extend to integrate the framework with multi-speaker ASR, where the system directly decodes multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner. Another extension is to encompass microphone-array signal processing such as a state-of-the-art neural dereverberation and beamforming within the end-to-end framework. This architecture allows speech enhancement and ASR components to be jointly optimized to improve the ASR objective and leads to an end-to-end framework that works well in the presence of strong background noise and reverberation. Finally, we also introduce our on-going semi-supervised training using cycle-consistency, which enables us to leverage unpaired speech and/or text data by integrating ASR with text-to-speech within the end-to-end framework.

Shinji Watanabe is an Associate Research Professor at Johns Hopkins University, Baltimore, MD, USA. He received his B.S., M.S., and Ph.D. (Dr. Eng.) Degrees in 1999, 2001, and 2006 from Waseda University, Tokyo, Japan. He was a research scientist at NTT Communication Science Laboratories, Kyoto, Japan from 2001 to 2011, a visiting scholar in Georgia institute of technology, Atlanta, GA, USA in 2009, and a Senior Principal Research Scientist at Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA, USA from 2012 to 2017. His research interests include automatic speech recognition, speech enhancement, spoken language understanding, and machine learning for speech and language processing. He has been published more than 200 papers in major speech and machine learning journals and conferences. He served as an Associate Editor of the IEEE Transactions on Audio Speech and Language Processing and a member of several technical committees including the IEEE Signal Processing Society Speech and Language Technical Committee.

Faculty Host: Bhiksha Raj

Language Technologies Institute

For More Information, Please Contact: