Newsgroups: comp.speech
Path: lyra.csx.cam.ac.uk!doc.ic.ac.uk!daresbury!keele!uknet!EU.net!howland.reston.ans.net!agate!msuinfo!uchinews!ellis!afrancis
From: afrancis@ellis.uchicago.edu (alexander l francis)
Subject: Re: Tomorrows World Speech Recogniser
Message-ID: <1994Apr27.172257.18098@midway.uchicago.edu>
Sender: news@uchinews.uchicago.edu (News System)
Reply-To: afrancis@midway.uchicago.edu
Organization: University of Chicago
References: <1994Apr19.111807.8191@leeds.ac.uk> <jimn8CoIv7D.JJ@netcom.com> <berry.debruin.65.2DBBB2BC@mi.rulimburg.nl>
Date: Wed, 27 Apr 1994 17:22:57 GMT
Lines: 41


I also did not see the program, and am not a specialist in voice
recognition, but I am a phonetics student working with speech perception
and speech recognition.

I think that the sampling-rate of the DAT (as well as the overall 
frequency response range of the speakers used for playback) is 
crucial here -- DATs only play back frequencies up to 1/2 the sampling
rate, and it is quite likely that the speaker produced measureable
sound at frequencies around 10kHz - requiring a sampling rate above
20kHz to record accurately.  It would be a simple matter for the
VR system to check for relatively high-freq. sound in the incoming 
signal and reject any signal with an abrupt cutoff above some 
frequency boundary.  This would, however, not be as effective against 
analog recording!  I suspect the real cue for the VR system was simply
the different frequency distribution introduced by the shape and size
of the sound source used by the DAT (as opposed to that of the 
human).  I expect that any VR system that is versatile enough to 
cope with a gobstopper makes use of very little segmental information 
(identifying individual consonants/vowels) and depends much more 
heavily on prosodic information (fundamental frequency, timing/duration 
of syllables and breath groups, etc.)  It is clear that human beings 
make use of both of these types of cues, as well as linguistic 
information (e.g. "she must have said "sing" not "thing" because 
the word in question is in the right place for a verb") to identify 
speech, and probably voices as well.  

It would be interesting to find out if the VR system still recognized
the speaker after he/she smoked a few cigarettes, or had a bad cold,
or was interrupted in the middle of the phrase, or had to talk over
white noise, or another speaker.

Of course, the VR system might also just check for the noise of a motor 
driving the tape player.... 

-alex
-- 
     afrancis@midway.uchicago.edu     alex francis     (312)-752-2340

"Arguably, linguistics is ... the most hotly contested property in the realm...
 It is soaked with the blood of poets, theologians, philosophers, philologists,
