Newsgroups: comp.speech
Path: pavo.csi.cam.ac.uk!doc.ic.ac.uk!aixssc.uk.ibm.com!watnews.watson.ibm.com!newsgate.watson.ibm.com!news.ans.net!howland.reston.ans.net!wupost!gumby!newsxfer.itd.umich.edu!nntp.cs.ubc.ca!alberta!quartz.ucs.ualberta.ca!acs.ucalgary.ca!cpsc.ucalgary.ca!hill
From: hill@cpsc.ucalgary.ca (David Hill)
Subject: Re: Bandwidth, Telephones, and Speech Recognition
Message-ID: <CMrqJt.Du2@cpsc.ucalgary.ca>
Sender: news@cpsc.ucalgary.ca (News Manager)
Organization: University of Calgary Computer Science
References: <2ln6i1$cn2@risky.ecs.umass.edu> <1994Mar11.040356.8720@Princeton.EDU>
Date: Wed, 16 Mar 1994 17:43:04 GMT
Lines: 57

In article <1994Mar11.040356.8720@Princeton.EDU> Devin Hosea <0428870@phoenix.princeton.edu>  writes:
>Does anyone know the bandwidth of audio transmission over typical phone
>lines?  I heard that it can be as little as 3,000hz.  If so, how does this
>affect the prospects for good speech recognition over the phone?

Standard bandwidth of the phone system is 300 to 3400 Hz.  This covers the
first three formants (vocal resonances) pretty well.  It is generally agreed
that the higher formants contribute little to speech intelligibility.  However,
you lose most of the information about the spectrum of fricatives.

Since related but distinct fricatives have rather similar formant transistions,
it will be pretty difficult to distinguish them (e.g. "f" "s" "th" or "v"
"z" voiced-"th").  The distinction between related sounds that depend partly on
the presence versus absence of frication will also be more difficult (e.g "b"
and "v").

JCR Licklider looked at the intelligibility of speech in relation to the bands
transmitted (The manner in which and the extent to which speech can be
distorted and remain intelligible: JCR Licklider, Ann. Telecomm. 106801
July-August 1958).  He quotes the bands contributing equally to intelligibility
as: 1. up to 160 Hz; 2. 160-400 Hz; 3. 400-670 Hz; 4. 670-1000 Hz;
5. 1000-1420 Hz; 6. 1420-1900 Hz; 7. 1900-2450 Hz; 8. 2450-3100 Hz;
9. 3100-4000 Hz; 10. 4000-5100 Hz; 11. 5100-(approximately)9000 Hz.

A lot of this early work was done in the context of the intelligibility of
speech transmission systems.  You can see that the phone throws away roughly
3/11 ths. of the speech intelligibility, but is still usable by people because
spoken language has enoug redundancy.  However, don't try staking your life
on someone at the other end of a telephone recognising the difference between
"thin" "sin" and "fin", even if the recogniser is human.

Since machine recognition/understanding is much more fragile than human
recognition (partly it has a less adequate basis for exploiting the redundancy,
and partly it is less acoustivally sophisticated), telephone speech
presents a much harder problem for machines than high fidelity speech.

In information theory terms, the raw information (i.e. neglecting any
redundancy) in telephone speech is of the order of 30,000 bits/second
while hi-fi speech requires more like 120,000 bits per second.

Telephone companies employ permanent teams of test subjects to listen to their
telephone systems.  Nonsense syllables in phonetically balanced lists of words
(i.e. reflecting normal language statistics of sounds) provide one basis for
assessing the degree of impairment of intelligibility produced by the complete
system (handset, lines, etc.).  The bandwidth limitation is imposed
by line filters and, even now we are going digital, serves to reduce the
information bandwidth needed to transmit voice.  However, I am not up with
the latest developments in this area.  In principle, there's no reason, other
than cost, why we can't talk to each other in hi-fi speech.  That would make
machine recognition of telephone speech just that bit easier!

david

-- 
david hill: hill@cpsc.ucalgary.ca	|	Imagination is more
voice: 403-282-6481, fax: 403-282-6778	|	important than knowledge.
nextmail: hill@trillium.ab.ca		|		(Albert Einstein)
