Thomas M. Sullivan, Multi-Microphone Correlation-Based Processing for Robust Automatic Speech Recognition, Ph.D Thesis, ECE Department, CMU, August 1996. Abstract Speech recognition systems suffer from degradation in recognition accuracy when faced with input from noisy and reverberant environments. While most users prefer a microphone that is placed in the middle of a conference table, on top of a computer monitor, or mounted in a wall, the recognition accuracy obtained with such microphones is generally much worse than the accuracy obtained using a close-talking headset-mounted microphone. Unfortunately, headset-mounted microphones are often uncomfortable or impractical for users. Research in recent years on environmental robustness in speech recognition has concentrated on signal processing using the output of a single microphone to correct for differences in spectral coloration between microphones used in the training and testing environments, and to account for the effects of linear filtering and additive noises present in real testing environments. This thesis explores the use of microphone arrays to provide further improvements in speech recognition accuracy. A novel approach to multiple-microphone processing for the enhancement of speech input to an automatic speech recognition system is described and discussed. The system is loosely based on the processing of the binaural hearing system, but with extensions to an arbitrary number of input microphones. The processing includes bandpass filtering and nonlinear rectification of the signals from the microphones to model the effects of the peripheral auditory system, followed by cross-correlation within each frequency band of the outputs of the rectifiers from microphone to microphone. Estimates of the correlated energy within each frequency band are used as the basis for a feature set for an automatic speech recognition system. Speech recognition accuracy in natural environments and using artificially-added noise were compared using the new correlation-based system, conventional delay-and-sum beamforming, and traditional adaptive filtering using the Griffiths-Jim algorithm. It was found that the more computationally-costly correlation-based system provided substantially better recognition accuracy than previous approaches in pilot experiments using artificial stimuli, and in experiments using natural speech signals that were artificially corrupted by additive noise. The correlation-based system provided a consistent, but much smaller, improvement in recognition accuracy (relative to previous approaches) for experiments conducted using speech in two natural environments. It is also demonstrated that the benefit provided by microphone array processing is complementary to the benefit provided by single-channel environmental adaptation algorithms such as codeword-dependent cepstral normalization, regardless of which adaptation procedure is employed.