Newsgroups: comp.speech
Path: pavo.csi.cam.ac.uk!pipex!uunet!munnari.oz.au!cs.mu.OZ.AU!mullauna.cs.mu.OZ.AU!mfw
From: mfw@mullauna.cs.mu.OZ.AU (Michael John FEARN-WANNAN)
Subject: Re: Help on design of speech recognition
Message-ID: <9228109.19223@mulga.cs.mu.OZ.AU>
Keywords: speech recognition
Sender: news@cs.mu.OZ.AU
Organization: Computer Science, University of Melbourne, Australia
References: <lels.9.718355427@unpcs1.cs.unp.ac.za>
Date: Tue, 6 Oct 1992 23:55:33 GMT
Lines: 44

lels@unpcs1.cs.unp.ac.za (Leonard Els) writes:

>Hi,

>I am working on a Speaker Independant Speech Recognition system and was
>wondering what the best measurements are, to use in the analysis.

>I have come accross the following: Zero-Crossing rate, Energy, LPC,
>Cepstral Coefficients, and formant analysis

>Can I use formant analysis for speaker independant recognition? - or
>should I rather use something else - If formant analysis is not a good
>idea - why not?

If you were to take human speech recognition as a model you would
probably find that all of these analyses and more were being used. There
are all sorts of mechanisms at work in real live speech recognition. It
is unlikely that you can find any one mathematical/engineering type
analysis of a wave form which will give you all the information you need.

Aside from anything else you should remember that the whole hearing/listening/
understanding mechanism is dynamic. It appears that we can focus our ears in
much the same way that we can focus our eyes. There are preattentive
mechanisms which cause us to foveate (to borrow a term from vision) onto
potential events in a waveform (perhaps to help identify features in an
acoustic scene). The point about this is that in most human perception
issues we use multiple cues wherever they are available and we apply our
intelligence/world knowledge/experience to interpretting our sense data. We
often hear what we expect to hear and we listen or look for what we expect
to hear.

Perhaps you should consider using multiple analyses of a waveform and applying
them to some kind of probabilistic model. Maybe get them to cast some vote
about what the segment might represent. A neural network might be useful
or maybe some fuzzy logic, I'm only guessing here. Some of the features you
could extract from the waveform are prosody (i.e. pitch, intensity, duration) 
and formants but you should also look at the lexical, semantic, pragmatic,
phonological, morphological, syntactic aspects of the language in question
for assistance in such things as determining the locations of word boundaries,
disambiguating word/phrase/sentence meanings etc. 

Regards,
Michael

