Yoshiaki Ohshima, "Environmental Robustness in Speech Recognition using Physiologically- Motivated Signal Processing", Ph.D Thesis, CMU, December 1993. Abstract Environmental robustness is one of the most important factors that determines the success of a speech recognition system. Several different types of approaches to environmental robustness have been developed in recent years, including cepstral normalization to compensate for acoustical differences between training and testing conditions of a recognition system, the use of arrays of microphones to separate speech from noise sources arriving from different directions in space, and the use of signal processing schemes that are based on knowledge of the human auditory periphery. This thesis examines methods by which speech recognition systems can be made more environmentally robust by analyzing the performance of a representative model of the auditory periphery developed by Seneff. The major goals of the thesis are threefold. First, we document to a greater extent than had been done previously the extent to which the Seneff model reduces the degradation in speech recognition accuracy incurred when testing conditions include additive noise and/or distortion introduced by linear filtering that was not present when the system was trained. Second, we examine the extent to which individual components of the nonlinear neural transduction stage of the Seneff model contribute to recognition accuracy by evaluating recognition with individual components of the model removed from the processing. Third, we determine the extent to which the robustness provided by the Seneff model is complementary to and independent of the improvement in recognition accuracy already provided by existing successful acoustical pre-processing algorithms such as the CDCN algorithm. The Seneff model provides two types of outputs which describe the putative mean rate of neural activity in a narrow range of frequency and temporal synchrony of this activity to the incoming speech sound. These outputs can be regarded as different estimates of the spectral energy in the incoming speech. It is found that both the mean-rate and synchrony outputs of the Seneff model provide better recognition accuracy than what is obtained using conventional signal processing using cepstral coefficients derived from linear prediction, both when speech is subjected to artificially-added noise and when speech is modified by unknown linear filtering. Although there are 40 frequency-specific mean-rate and synchrony outputs in the original Seneff model, we found that no loss in recognition accuracy is incurred if classification decisions are made on the basis of 5 principal components of the mean-rate outputs or 10 principal components of the synchrony outputs. The neural transduction (NT) stage of the Seneff model consists of a cascade of components that perform half-wave rectification, short-term adaptation, lowpass filtering, and automatic gain control (AGC). Of these components, short-term adaptation appears to be the most important. Nevertheless, no component could be removed from the model without sacrificing recognition accuracy. We develop several ways of combining auditory processing with conventional cepstral processing using environmental normalization techniques such as CDCN. This is done either by normalizing the cepstra derived from the outputs of the auditory model or by normalizing the input speech waveform directly. We show that the recognition accuracy provided by physiologically-motivated signal processing can (in some circumstances) be further improved by combination with environmentally-normalized cepstral processing. In order to combine auditory processing with environmentally-normalized cepstral processing, we develop methods to resynthesize speech both from its cepstral representation, and from the out puts of the auditory model. The use of speech resynthesized from cepstral coefficients provides a modest improvement in recognition accuracy. The use of speech resynthesized from the outputs of the Seneff model does not improve accuracy, but it is intelligible and may be useful for speech enhancement.