Newsgroups: comp.speech
Path: pavo.csi.cam.ac.uk!doc.ic.ac.uk!agate!usenet.ins.cwru.edu!magnus.acs.ohio-state.edu!zaphod.mps.ohio-state.edu!cs.utexas.edu!uunet!munnari.oz.au!metro!parlo!andrewh
From: andrewh@ee.su.OZ.AU (Andrew Hunt)
Subject: comp.speech FAQ - 2nd draft
Message-ID: <1992Nov26.233939.17328@ucc.su.OZ.AU>
Sender: news@ucc.su.OZ.AU
Nntp-Posting-Host: parlo.ee.su.oz.au
Reply-To: andrewh@ee.su.OZ.AU (Andrew Hunt)
Organization: University of Sydney, Australia
Date: Thu, 26 Nov 1992 23:39:39 GMT
Lines: 1184


                       comp.speech

                Frequently Asked Questions
                ==========================

  (FAQ = "Frequently Asked Questions")

  Compiled: 27-Nov-92

This document is an attempt to answer commonly asked questions and to
reduce the bandwidth taken up by these posts and their associated replies.
If you have a question, please check this file before you post.

The FAQ is not meant to discuss any topic exhaustively.  It will hopefully
provide readers with pointers on where to find useful information.  It also
tries to list as much of the useful information and material available
elsewhere on the Internet.


This document is available for anonymous ftp from the comp.speech archive site
	svr-ftp.eng.cam.ac.uk:/comp.speech/FAQ


If you have not already read the Usenet introductory material posted to 
"news.announce.newusers", please do.  For help with FTP (file transfer
protocol) look for a regular posting of "Anonymous FTP List - FAQ" in
comp.misc, comp.archives.admin and news.answers amongst others.


The document is an EARLY DRAFT.  There are many unanswered questions and 
some answers are not comprehensive.  I hope that it will improve over the 
next few months.  If you have any comments, suggestions for inclusions, or 
answers then please post or email.

Thanks to all the people who mailing comments and suggestions since the
first posting of the FAQ.  There is enough new information to justify a
second posting.  The diffs are nearly as long as the original so I am
not posting a separate diff file.


Andrew Hunt
Speech Technology Research Group	email: andrewh@ee.su.oz.au
Department of Electrical Engineering	Ph:  61-2-692 4509
University of Sydney, NSW, Australia.	Fax: 61-2-692 3847


========================== Acknowledgements ===========================

Thanks to the following for their significant comments and contributions.

Barry Arons		<barons@media-lab.mit.edu>
Joe Campbell		<jpcampb@afterlife.ncsc.mil>
Oliver Jakobs		<jakobs@ldv01.Uni-Trier.de>
Tony Robinson		<ajr@eng.cam.ac.uk>
Mike S[?]		<mike%jim.uucp@wupost.wustl.edu>

Many others have provided useful information.  Thanks to all.


============================ Contents =================================

PART 1 - General
 
Q1.1: What is comp.speech?
Q1.2: Where are the comp.speech archives?
Q1.3: Common abbreviations and jargon.
Q1.4: What are related newsgroups and mailing lists?
Q1.5: What are related journals and conferences?
Q1.6: What speech databases are available?
Q1.7: Speech File Formats, Conversion and Playing.
Q1.8: What "Speech Laboratory Environments" are available?
 
PART 2 - Signal Processing for Speech
 
Q2.1: What speech sampling and signal processing hardware can I use?
Q2.2: What signal processing techniques are for speech technology?
Q2.3: How do I convert to/from mu-law format?
 
PART 3 - Speech Coding and Compression
 
Q3.1: Speech compression techniques.
Q3.2: What are some good references/books on coding/compression?
Q3.3: What software is available?
 
PART 4 - Speech Synthesis
 
Q4.1: What is speech synthesis?
Q4.2: How can speech synthesis be performed?
Q4.3: What are some good references/books on synthesis?
Q4.4: What software/hardware is available?
 
PART 5 - Speech Recognition
 
Q5.1: What is speech recognition?
Q5.2: How can I build a very simple speech recogniser?
Q5.2: What does speaker dependent/adaptive/independent mean?
Q5.3: What does small/medium/large/very-large vocabulary mean?
Q5.4: What does continuous speech or isolated-word mean?
Q5.5: How is speech recognition done?
Q5.6: What are some good references/books on recognition?
Q5.7: What packages are available?
 
PART 6 - Speaker Recognition/Verification
 
Q6.1: What is speaker recognition/verification?
Q6.2: Where is speaker recognition used?
Q6.3: What are techniques for speaker recognition?
Q6.4: How good is speaker recognition?
Q6.5: What are some good references/books on speaker recognition?
Q6.6: What packages are available?
 
PART 7 - Natural Language Processing
 
Q7.1: What is NLP?
Q7.2: What are some good references/books on NLP?
Q7.3: What software is available?
 
=======================================================================

PART 1 - General

Q1.1: What is comp.speech?

comp.speech is a newsgroup for discussion of speech technology and 
speech science.  It covers a wide range of issues from application of 
speech technology, to research, to products and lots more.

By nature speech technology is an inter-disciplinary field and the 
newsgroup reflects this.  However, computer application should be the 
basic theme of the group.

The following is a list of topics but does not cover all matters related 
to the field - no order of importance is implied.

[1] Speech Recognition - discussion of methodologies, training, techniques, 
results and applications.  This should cover the application of techniques 
including HMMs, neural-nets and so on to the field.

[2] Speech Synthesis - discussion concerning theoretical and practical
issues associated with the design of speech synthesis systems.

[3] Speech Coding and Compression - both research and application matters.

[4] Phonetic/Linguistic Issues - coverage of linguistic and phonetic issues 
which are relevant to speech technology applications.  Could cover parsing, 
natural language processing, phonology and prosodic work.

[5] Speech System Design - issues relating to the application of speech
technology to real-world problems.  Includes the design of user interfaces, 
the building of real-time systems and so on.

[6] Other matters - relevant conferences, books, public domain software, 
hardware and related products.

------------------------------------------------------------------------

Q1.2: Where are the comp.speech archives?

comp.speech is being archived for anonymous ftp.

	ftp site:	svr-ftp.eng.cam.ac.uk (or 129.169.24.20).  
	directory:	comp.speech/archive

comp.speech/archive contains the articles as they arrive.  Batches of 100
articles are grouped into a shar file, along with an associated file of
Subject lines.

Other useful information is also available in comp.speech/info.

------------------------------------------------------------------------

Q1.3: Common abbreviations and jargon.

ANN   - Artificial Neural Network.
ASR   - Automatic Speech Recognition.
ASSP  - Acoustics Speech and Signal Processing
AVIOS - American Voice I/O Society
CELP  - Code excited linear prediction.
COLING - Computational Linguistics
DTW   - Dynamic time warping.
FAQ   - Frequently asked questions.
HMM   - Hidden markov model.
IEEE  - Institute of Electrical and Electronics Engineers
JASA  - Journal of the Acoustic Society of America
LPC   - Learned predictive coding.
LVQ   - Linear vector quantisation.
NLP   - Natural Language Processing.
NN    - Neural Network.
TTS   - Text-To-Speech (i.e. synthesis).
VQ    - Vector Quantisation.

------------------------------------------------------------------------

Q1.4: What are related newsgroups and mailing lists?


NEWGROUPS

comp.ai - Artificial Intelligence newsgroup.  
     Postings on general AI issues, language processing and AI techniques.
     Has a good FAQ including NLP, NN and other AI information.

comp.ai.nlang-know-rep - Natural Language Knowledge Representation
     Moderated group covering Natural Language.

comp.ai.neural-nets - discussion of Neural Networks and related issues.  
     There are often posting on speech related matters - phonetic recognition,
     connectionist grammars and so on.

comp.compression - occasional articles on compression of speech.
     FAQ for comp.compression has some info on audio compression standards.

comp.dcom.telecom - Telecommunications newsgroup.
     Has occasional articles on voice products.

comp.dsp - discussion of signal processing - hardware and algorithms and more.
     Has a good FAQ posting.
     Has a regular posting of a comprehensive list of Audio File Formats.

comp.multimedia - Multi-Media discussion group.
     Has occasional articles on voice I/O.

sci.lang - Language.  
     Discussion about phonetics, phonology, grammar, etymology and lots more.

alt.sci.physics.acoustics - some discussion of speech production & perception.

alt.binaries.sounds.misc - posting of various sound samples
alt.binaries.sounds.d - discussion about sound samples, recording and playback.


MAILING LISTS

ECTL - Electronic Communal Temporal Lobe
     Founder & Moderator: David Leip
     Moderated mailing list for researchers with interests in computer speech 
     interfaces. This list serves a broad community including persons from 
     signal processing, AI, linguistics and human factors.
     
     To subscribe, send the following information to: 
        ectl-request@snowhite.cis.uoguelph.ca
        name, institute, department, daytime phone & e-mail address

     To access the archive, ftp snowhite.cis.uoguelph.ca, login as anonymous,
     and supply your local userid as a password.  All the ECTL things can be
     found in pub/ectl.

Prosody Mailing List
	Unmoderated mailing list for discussion of prosody.  The aim is
	to facilitate the spread of information relating to the research
	of prosody by creating a network of researchers in the field.
	If you want to participate, send the following one-line
	message to "listserv@purccvm.bitnet" :-

		subscribe prosody Your Name

Digital Mobile Radio
     Covers lots of areas include some speech topics including speech 
     coding and speech compression.
     Mail Peter Decker (dec@dfv.rwth-aachen.de) to subscribe.

------------------------------------------------------------------------

Q1.5: What are related journals and conferences?

Try the following commercially oriented magazines...

	Speech Technology

Try the following technical journals...

	IEEE Speech Processing (from Jan 93)
	Computational Linguistics (COLING)
	Computer Speech and Language
	Journal of the Acoustical Society of America (JASA)
	Transactions of IEEE ASSP
	AVIOS Journal

Try the following conferences...

 ICASSP	    Intl. Conference on Acoustics Speech and Signal Processing (IEEE)
 ICSLP	    Intl. Conference on Spoken Language Processing
 EUROSPEECH European Conference on Speech Communication and Technology
 AVIOS      American Voice I/O Society Conference
 SST        Australian Speech Science and Technology Conference

------------------------------------------------------------------------

Q1.6: What speech databases are available?

A wide range of speech databases have been collected.  These databases 
are primarily for the development of speech synthesis/recognition and for 
linguistic research.  Unfortunately, almost all the information listed
here refers to the English language.  

Some databases are free but most appear to be available for a small cost.
The databases normally require lots of storage space - do not expect to be 
able to ftp all the data you want.

[There are too many to list here in detail - perhaps someone would like to 
 set up a special posting on speech databases?]


LINGUISTIC DATA CONSORTIUM (LDC)

Information about the Linguistic Data Consortium is available via
anonymous ftp from:	ftp.cis.upenn.edu	(130.91.6.8)
in the directory:	/pub/ldc

Here are some excerpts from the files in that directory:

Briefly stated, the LDC has been established to broaden the collection
and distribution of speech and natural language data bases for the
purposes of research and technology development in automatic speech
recognition, natural language processing and other areas where large
amounts of linguistic data are needed.

Here is the brief list of corpora:

   * The TIMIT and NTIMIT speech corpora
   * The Resource Management speech corpus (RM1, RM2)
   * The Air Travel Information System (ATIS0) speech corpus
   * The Association for Computational Linguistics - Data Collection 
     Initiative text corpus (ACL-DCI)
   * The TI Connected Digits speech corpus (TIDIGITS)
   * The TI 46-word Isolated Word speech corpus (TI-46)
   * The Road Rally conversational speech corpora (including "Stonehenge" 
     and "Waterloo" corpora)
   * The Tipster Information Retrieval Test Collection
   * The Switchboard speech corpus ("Credit Card" excerpts and portions
     of the complete Switchboard collection)

Further resources to be made available within the first year (or two):

   * The Machine-Readable Spoken English speech corpus (MARSEC)
   * The Edinburgh Map Task speech corpus
   * The Message Understanding Conference (MUC) text corpus of FBI 
     terrorist reports
   * The Continuous Speech Recognition - Wall Street Journal speech 
     corpus (WSJ-CSR)
   * The Penn Treebank parsed/tagged text corpus
   * The Multi-site ATIS speech corpus (ATIS2)
   * The Air Traffic Control (ATC) speech corpus
   * The Hansard English/French parallel text corpus
   * The European Corpus Initiative multi-language text corpus (ECI) 
   * The Int'l Labor Organization/Int'l Trade Union multi-language 
     text corpus (ILO/ITU)
   * Machine-readable dictionaries/lexical data bases (COMLEX, CELEX)

The files in the directory include more detailed information on the 
individual databases.  For further information contact

	Elizabeth Hodas
	441 Williams Hall
	University of Pennsylvania
	Philadelphia, PA 19104-6305
	Phone:   (215) 898-0464
	Fax:     (215) 573-2175
	e-mail:  ehodas@walnut.ling.upenn.edu


Center for Spoken Language Understanding (CSLU)

1. The ISOLET speech database of spoken letters of the English alphabet. 
The speech is high quality (16 kHz with a noise cancelling microphone).  
150 speakers x 26 letters of the English alphabet twice in random order.  
The "ISOLET" data base can be purchased for $100 by sending an email request 
to vincew@cse.ogi.edu.  (This covers handling, shipping and medium costs).  
The data base comes with a technical report describing the data.

2. CSLU has a telephone speech corpus of 1000 English alphabets.  Callers 
recite the alphabet with brief pauses between letters.  This database is 
available to not-for-profit institutions for $100. The data base is described 
in the proceedings of the International Conference on Spoken Language 
Processing.  Contact vincew@cse.ogi.edu if interested.

------------------------------------------------------------------------

Q1.7: Speech File Formats, Conversion and Playing.

Section 2 of this FAQ has information on mu-law coding.

A very good and very comprehensive list of audio file formats is prepared
by Guido van Rossum.  The list is posted regularly to comp.dsp and
alt.binaries.sounds.misc, amongst others.  It includes information on 
sampling rates, hardware, compression techniques, file format definitions, 
format conversion, standards, programming hints and lots more.  It is much
too long to include within this posting.

It is also available by ftp 
	from: 		ftp.cwi.nl
	directory:	/pub 
	file:	 	AudioFormats<version>

------------------------------------------------------------------------

Q1.8: What "Speech Laboratory Environments" are available?

First, what is a Speech Laboratory Environment?  A speech lab is a
software package which provides the capability of recording, playing,
analysing, processing, displaying and storing speech.  Your computer
will require audio input/output capability.  The different packages
vary greatly in features and capability - best to know what you want
before you start looking around.

Most general purpose audio processing packages will be able to process speech
but do not necessarily have some specialised capabilities for speech (e.g.
formant analysis).

Apparently, the following articles provides a reasonable survey.

  Read, C., Buder, E., & Kent, R. "Speech Analysis Systems: An Evaluation"
  Journal of Speech and Hearing Research, pp 314-332, April 1992.


Package: Entropic Signal Processing System (ESPS) and Waves
Description: ESPS is a very comprehensive set of speech analysis/processing 
	tools for the UNIX environment.  The package includes UNIX commands, 
	and a comprehensive C library (which can be accessed from other 
	languages).  Waves is a graphical front-end for speech processing.  
	Speech waveforms, spectrograms, pitch traces etc can be displayed, 
	edited and processed in X windows (and Sunview ?).
Cost: ?
Contact: Entropic Research Laboratory, Washington Research Laboratory,
	600 Pennsylvania Ave, S.E. Suite 202, Washington, D.C. 20003
	(202) 547-1420.  email - ?


Can anyone provide information on capability and availability of the
following packages?

	VIEW (for UNIX?)
	Khoros,
	Ptolemy,
	ILS ("Interactive Laboratory System")
	CLS
	Signalyse (for Mac)
	MacSpeech Lab (for Mac)
	Audlab
	SpeechViewer (PC)



=======================================================================

PART 2 - Signal Processing for Speech

Q2.1: What speech sampling and signal processing hardware can I use?

In addition to the following information, have a look at the Audio File
format document prepared by Guido van Rossum (referred to above).


Product: Sun standard audio port (SPARC 1 & 2)
Input:  1 channel, 8 bit mu-law encoded (telephone quality)
Output: 1 channel, 8 bit mu-law encoded (telephone quality)

Product:  Ariel
Platform: Sun + others?
Input:  2 channels, 16bit linear, sample rate 8-96kHz (inc 32, 44.1, 48kHz).
Output: 2 channels, 16bit linear, sample rate 8-50kHz (inc 32, 44.1, 48kHz).
Contact:

Can anyone provide information on Soundblaster, Mac, NeXT and other hardware?

[Help is needed to source more info.  How about the following format?]

Product:  xxx
Platform: PC, Mac, Sun, ...
Rough Cost (pref $US):
Input: e.g. 16bit linear, 8,10,16,32kHz.
Output: e.g. 16bit linear, 8,10,16,32kHz.
DSP: signal processing support
Other:
Contact:

------------------------------------------------------------------------

Q2.2: What signal processing techniques are for speech technology?

This question is far to big to be answered in a FAQ posting.  Fortunately
there are many good books which answer the question!

Some good introductory books include

   Digital processing of speech signals; L. R. Rabiner, R. W. Schafer.
   Englewood Cliffs; London: Prentice-Hall, 1978

   Voice and Speech Processing; T. W. Parsons.
   New York; McGraw Hill 19??

   Computer Speech Processing; ed Frank Fallside, William A. Woods
   Englewood Cliffs: Prentice-Hall, c1985

   Digital speech processing : speech coding, synthesis, and recognition
   edited by A. Nejat Ince; Kluwer Academic Publishers, Boston, c1992

   Speech science and technology; edited by Shuzo Saito
   pub. Ohmsha, Tokyo, c1992

   Speech analysis; edited by Ronald W. Schafer, John D. Markel
   New York, IEEE Press, c1979

   Douglas O'Shaughnessy -- Speech Communication: Human and Machine
   Addison Wesley series in Electrical Engineering: Digital Signal Processing,
   1987.

------------------------------------------------------------------------

Q2.3: How do I convert to/from mu-law format?

Mu-law coding is a form of compression for audio signals including speech.
It is widely used in the telecommunications field because it improves the
signal-to-noise ratio without increasing the amount of data.  Typically,
mu-law compressed speech is carried in 8-bit samples.  It is a companding
technqiue.  That means that carries more information about the smaller signals
than about larger signals.  Mu-law coding is provided as standard for the
audio input and output of the SUN Sparc stations 1&2 (Sparc 10's are linear).


On SUN Sparc systems have a look in the directory /usr/demo/SOUND.  Included
are table lookup macros for ulaw conversions.  [Note however that not all
systems will have /usr/demo/SOUND installed as it is optional - see your
system admin if it is missing.]


OR, here is some sample conversion code in C.

# include <stdio.h>

unsigned char linear2ulaw(/* int */);
int ulaw2linear(/* unsigned char */);

/*
** This routine converts from linear to ulaw.
**
** Craig Reese: IDA/Supercomputing Research Center
** Joe Campbell: Department of Defense
** 29 September 1989
**
** References:
** 1) CCITT Recommendation G.711  (very difficult to follow)
** 2) "A New Digital Technique for Implementation of Any
**     Continuous PCM Companding Law," Villeret, Michel,
**     et al. 1973 IEEE Int. Conf. on Communications, Vol 1,
**     1973, pg. 11.12-11.17
** 3) MIL-STD-188-113,"Interoperability and Performance Standards
**     for Analog-to_Digital Conversion Techniques,"
**     17 February 1987
**
** Input: Signed 16 bit linear sample
** Output: 8 bit ulaw sample
*/

#define ZEROTRAP    /* turn on the trap as per the MIL-STD */
#undef ZEROTRAP
#define BIAS 0x84   /* define the add-in bias for 16 bit samples */
#define CLIP 32635

unsigned char linear2ulaw(sample) int sample; {
  static int exp_lut[256] = {0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3,
                             4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
                             5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
                             5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
                             6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
                             6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
                             6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
                             6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
                             7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
                             7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
                             7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
                             7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
                             7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
                             7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
                             7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
                             7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7};
  int sign, exponent, mantissa;
  unsigned char ulawbyte;

  /* Get the sample into sign-magnitude. */
  sign = (sample >> 8) & 0x80;		/* set aside the sign */
  if(sign != 0) sample = -sample;		/* get magnitude */
  if(sample > CLIP) sample = CLIP;		/* clip the magnitude */

  /* Convert from 16 bit linear to ulaw. */
  sample = sample + BIAS;
  exponent = exp_lut[( sample >> 7 ) & 0xFF];
  mantissa = (sample >> (exponent + 3)) & 0x0F;
  ulawbyte = ~(sign | (exponent << 4) | mantissa);
#ifdef ZEROTRAP
  if (ulawbyte == 0) ulawbyte = 0x02;	/* optional CCITT trap */
#endif

  return(ulawbyte);
}

/*
** This routine converts from ulaw to 16 bit linear.
**
** Craig Reese: IDA/Supercomputing Research Center
** 29 September 1989
**
** References:
** 1) CCITT Recommendation G.711  (very difficult to follow)
** 2) MIL-STD-188-113,"Interoperability and Performance Standards
**     for Analog-to_Digital Conversion Techniques,"
**     17 February 1987
**
** Input: 8 bit ulaw sample
** Output: signed 16 bit linear sample
*/

int ulaw2linear(ulawbyte) unsigned char ulawbyte; {
  static int exp_lut[8] = { 0, 132, 396, 924, 1980, 4092, 8316, 16764 };
  int sign, exponent, mantissa, sample;

  ulawbyte = ~ulawbyte;
  sign = (ulawbyte & 0x80);
  exponent = (ulawbyte >> 4) & 0x07;
  mantissa = ulawbyte & 0x0F;
  sample = exp_lut[exponent] + (mantissa << (exponent + 3));
  if(sign != 0) sample = -sample;

  return(sample);
}



=======================================================================

PART 3 - Speech Coding and Compression

Q3.1: Speech compression techniques.

Can anyone provide a 1-2 page summary on speech compression?  Topics to
cover might include common technqiues, where speech compression might be 
used and perhaps something on why speech is difficult to compress.

  [The FAQ for comp.compression includes a few questions and answers
   on the compression of speech.]

------------------------------------------------------------------------

Q3.2: What are some good references/books on coding/compression?

 Douglas O'Shaughnessy -- Speech Communication: Human and Machine
 Addison Wesley series in Electrical Engineering: Digital Signal Processing,
 1987.

------------------------------------------------------------------------

Q3.3: What software is available?

Note: there are two types of speech compression technique referred to below. 
Lossless technqiues preserve the speech through a compression-decompression
phase.  Lossy techniques do not preserve the speech prefectly.  As a general
rule, the more you compress speech, the more the quality degardes.


Package:     shorten - a lossless compressor for speech signals
Platform:    UNIX
Description: A lossless compressor for speech signals.  It will compile and 
             run on UNIX workstations and will cope with a wide variety of
             formats.  Compression is typically 50% for 16bit clean speech 
             sampled at 16kHz.
Availability: Anonymous ftp svr-ftp.eng.cam.ac.uk: /misc/shorten-0.4.shar

Package:     CELP 3.2 (U.S. Fed-Std-1016 compatible coder)
Platform:    Sun (the makefiles & source can be modified for other platforms)
Description: CELP is lossy compression technqiue.
	     The U.S. DoD's Federal-Standard-1016 based 4800 bps code excited
             linear prediction voice coder version 3.2 (CELP 3.2) Fortran and
             C simulation source codes.
Contact:     Joe Campbell <jpcampb@afterlife.ncsc.mil>
Availability: Anoymous ftp to furmint.nectar.cs.cmu.edu (128.2.209.111):
             celp.audio.compression (C src in celp.audio.compression/celp32c).
             Thanks to Vince Cate <vac+@cs.cmu.edu> for providing this site :-)
             The CELP release package is also available, at no charge,
             on DOS disks from:
                Bob Fenichel
                National Communications System, Washington, D.C. 20305, USA
                Ph: 1-703-692-2124    Fax: 1-703-746-4960
             The following documents are vital to successful real-time
             implementations and they are also available from Bob Fenichel
             (they're unavailable electronically):
             "Details to Assist in Implementation of Federal Standard 1016
                CELP," National Communications System, Office of Technology &
                Standards, 1992. Technical Information Bulletin 92-1.
             "Telecommunications: Analog-to-Digital Conversion of Radio
                Voice by 4,800 bit/second Code Excited Linear Prediction
                (CELP)," National Communications System, Office of
                Technology & Standards, 1991. Federal Standard 1016.
 
Package:     32 kbps ADPCM
Platform:    SGI and Sun Sparcs
Description: 32 kbps ADPCM C-source code (G.721 compatibility is uncertain)
Contact:     Jack Jansen
Availablity: Anoymous ftp to ftp.cwi.nl: pub/adpcm.shar
 
Package:     GSM 06.10
Platform:    Runs faster than real time on most Sun SPARCstations
Description: GSM 06.10 is lossy compression technqiue.
	     European GSM 06.10 provisional standard for full-rate speech
             transcoding, prI-ETS 300 036, which uses RPE/LTP (residual
             pulse excitation/long term prediction) coding at 13 kbit/s.
Contact:     Carsten Bormann <cabo@cs.tu-berlin.de>
Availablity: An implementation can be ftp'ed from:
                tub.cs.tu-berlin.de: /pub/tubmik/gsm-1.0.tar.Z
                                    +/pub/tubmik/gsm-1.0-patch1
                or as a faster but not always up-to-date alternative:
                       liasun3.epfl.ch: /pub/audio/gsm-1.0pl1.tar.Z
 

=======================================================================

PART 4 - Speech Synthesis

Q4.1: What is speech synthesis?

Speech synthesis is the task of transforming written input to spoken output.
The input can either be provided in a graphemic/orthographic or a phonemic
script, depending on its source.

------------------------------------------------------------------------

Q4.2: How can speech synthesis be performed?

There are several algorithms.  The choice depends on the task they're used
for.  The easiest way is to just record the voice of a person speaking the
desired phrases.  This is useful if only a restricted volume of phrases and
sentences is used, e.g. messages in a train station, or schedule information
via phone.  The quality depends on the way recording is done.

More sophisticated but worse in quality are algorithms which split the 
speech into smaller pieces.  The smaller those units are, the less are they
in number, but the quality also decreases.  An often used unit is the phoneme,
the smallest linguistic unit.  Depending on the language used there are about
35-50 phonemes in western European languages, i.e. there are 35-50 single
recordings. The problem is combining them as fluent speech requires fluent
transitions between the elements. The intellegibility is therefore lower, but
the memory required is small.

A solution to this dilemma is using diphones. Instead of splitting at the 
transitions, the cut is done at the center of the phonemes, leaving the 
transitions themselves intact. This gives about 400 elements (20*20) and
the quality increases.

The longer the units become, the more elements are there, but the quality 
increases along with the memory required. Other units which are widely used
are half-syllables, syllables, words, or combinations of them, e.g. word stems
and inflectional endings.

------------------------------------------------------------------------

Q4.3: What are some good references/books on synthesis?

The following are good introductory books/articles.

   Douglas O'Shaughnessy -- Speech Communication: Human and Machine
   Addison Wesley series in Electrical Engineering: Digital Signal Processing,
   1987.

   D. H.  Klatt, "Review of Text-To-Speech Conversion for English", Jnl. of
   the Acoustic Society of America (JASA), v82, Sept. 1987, pp 737-793.


MITalk (also known as DECtalk) is one of the most sucessful speech synthesis
systems around.  It is described in:

   John Allen, Sharon Hunnicut and Dennis H. Klatt, "From Text to Speech: 
   The MITalk System", Cambridge University Press, 1987.

------------------------------------------------------------------------

Q4.4: What software/hardware is available?

There appears to be very little Public Domain or Shareware speech synthesis 
related software available for FTP.  However, the following are available.
Strictly speaking, not all the following sources are speech synthesis - all
are speech output systems.


SIMTEL-20
The following is a list of speech related software available from SIMTEL-20 
and its mirror sites for PCs.  

The SIMTEL internet address is WSMR-SIMTEL20.Army.Mil [192.88.110.20].
Try looking at your nearest archive site first.

Directory PD1:<MSDOS.VOICE>
 Filename   Type Length   Date   Description
 ==============================================
 AUTOTALK.ARC  B   23618  881216  Digitized speech for the PC
 CVOICE.ARC    B   21335  891113  Tells time via voice response on PC
 HEARTYPE.ARC  B   10112  880422  Hear what you are typing, crude voice synth.
 HELPME2.ARC   B    8031  871130  Voice cries out 'Help Me!' from PC speaker
 SAY.ARC       B   20224  860330  Computer Speech - using phonemes
 SPEECH98.ZIP  B   41003  910628  Build speech (voice) on PC using 98 phonemes
 TALK.ARC      B    8576  861109  BASIC program to demo talking on a PC speaker
 TRAN.ARC      B   39766  890715  Repeats typed text in digital voice
 VDIGIT.ZIP    B  196284  901223  Toolkit: Add digitized voice to your programs
 VGREET.ARC    B   45281  900117  Voice says good morning/afternoon/evening


Other Sources

Package:     Text to phoneme program
Platform:    unknown
Description: Text to phoneme program.  Possibly based on Naval Research Lab's 
	     set of text to phoneme rules.
Availablity: By FTP from "shark.cse.fau.edu" (131.91.80.13) in the directory
		/pub/src/phon.tar.Z

Package:     "Speak" - a Text to Speech Program
Platform:    Sun SPARC
Description: Text to speech program based on concatenation of pre-recorded
	     speech segments.  A function library can be used to integrate
	     speech output into other code.
Hardware:    SPARC audio I/O
Availablity: by FTP from "wilma.cs.brown.edu" as /pub/speak.tar.Z


Package:     xxx
Platform:    (PC, Mac, Sun, NeXt etc)
Rough Cost:  (if appropriate)
Description: (keep it brief)
Hardware:    (requirement list)
Availablity: (ftp info, email contact or company contact)


Can anyone provide information on the following packages?

    MacIntalk (Mac) - formant based speech synthesis
    Narrator (Amiga) - formant based synthesis
    Bliss software
    CSRE software
    JSRU software
    Klatt Software
    Sensimetrics products

Can anyone provide information on speech synthesis chip sets?

Please email or post suitable information for this list.  Commercial and 
research packages are both appropriate.  

[This list may be large enough to justify a separate posting]


=======================================================================

PART 5 - Speech Recognition

Q5.1: What is speech recognition?

Automatic speech recognition is the process by which a computer maps an 
acoustic speech signal to text.

Automatic speech understanding is the process by which a computer maps an 
acoustic speech signal to some form of abstract meaning of the speech.

------------------------------------------------------------------------

Q5.2: How can I build a very simple speech recogniser?

Doug Danforth provides a detailed account in article 253 in the comp.speech
archives - also available as file info/DIY_Speech_Recognition.

The first part is reproduced here.

  QUICKY RECOGNIZER sketch:
  
  Here is a simple recognizer that should give you 85%+ recognition
  accuracy.  The accuracy is a function of WHAT words you have in
  your vocabulary.  Long distinct words are easy.  Short similar
  words are hard.  You can get 98+% on the digits with this recognizer.
  
  Overview:
  (1) Find the begining and end of the utterance.
  (2) Filter the raw signal into frequency bands.
  (3) Cut the utterance into a fixed number of segments.
  (4) Average data for each band in each segment.
  (5) Store this pattern with its name.
  (6) Collect training set of about 3 repetitions of each pattern (word).
  (7) Recognize unknown by comparing its pattern against all patterns
      in the training set and returning the name of the pattern closest
      to the unknown.

Many variations upon the theme can be made to improve the performance.
Try different filtering of the raw signal and different processing methods.

------------------------------------------------------------------------

Q5.2: What does speaker dependent/adaptive/independent mean?

A speaker dependent system is developed (trained) to operate for a single
speaker.  These systems are usually easier to develop, cheaper to buy and
more accurate, but are not as flexible as speaker adaptive or speaker
independent systems.

A speaker independent system is developed (trained) to operate for any
speaker or speakers of a particular type (e.g. male/female, American/English).
These systems are the most difficult to develop, most expensive and currently
accuracy is not as good.  They are the most flexible.

A speaker adaptive system is developed to adapt its operation for new 
speakers that it encounters usually based on a general model of speaker
characteristics.  It lies somewhere between speaker independent and speaker 
adaptive systems.

Each type of system is suited to different applications and domains.

------------------------------------------------------------------------

Q5.3: What does small/medium/large/very-large vocabulary mean?

The size of vocabulary of a speech recognition system affects the complexity,
processing requirements and the accuracy of the system.  Some applications
only require a few words (e.g. numbers only), others require very large 
dictionaries (e.g. dictation machines).

There are no established definitions but the following may be a helpful guide.

	small vocabulary - tens of words
	medium vocabulary - hundreds of words
	large vocabulary - thousands of words
	very-large vocabulary - tens of thousands of words.

------------------------------------------------------------------------

Q5.4: What does continuous speech or isolated-word mean?

An isolated-word system operates on single words at a time - requiring a 
pause between saying each word.  This is the simplest form of recognition 
to perform, because the pronunciation of the words tends not affect each 
other.  Because the occurrences of each particular word are similar they are 
easier to recognise.

A continuous speech system operates on speech in which words are connected
together, i.e. not separated by pauses.  Continuous speech is more difficult
to handle because of a variety of effects.  First, it is difficult to find
the start and end points of words.  Another problem is "coarticulation" -
the production of each phoneme is affects by the production of surrounding
phonemes and the so the start and end of words are affected by the preceding
and following words.  The recognition of continuous speech is also affected by
the rate of speech (fast speech tends to be harder).

------------------------------------------------------------------------

Q5.5: How is speech recognition done?

A wide variety of techniques are used to perform speech recognition. 
There are many types of speech recognition.  There are many levels of
speech recognition/processing/understanding.

Typically speech recognition starts with the digital sampling of speech.
The next stage would be acoustic signal processing.  Common techniques 
include a variety of spectral analyses, LPC analysis, the cepstral transform,
cochlea modelling and many, many more.

The next stage will typically try to recognise phonemes, groups of phonemes 
or words.  This stage can be achieved by many processes such as DTW (Dynamic
Time Warping),  HMM (hidden Markov modelling), NNs (Neural Networks), and
sometimes expert systems.  In crude terms, all these processes to recognise
the patterns of speech.  The most advanced systems are statistically 
motivated.

Some systems utilise knowledge of grammar to help with the recognition 
process.

Some systems attempt to utilise prosody (pitch, stress, rhythm etc) to
process the speech input.

Some systems try to "understand" speech.  That is, they try to convert the
words into a representation of what the speaker intended to mean or achieve
by what they said.

------------------------------------------------------------------------

Q5.6: What are some good references/books on recognition?

Some general introduction books on speech recognition:

   Speech recognition by machine; W.A. Ainsworth
   London: Peregrinus for the Institution of Electrical Engineers, c1988

   Speech synthesis and recognition; J.N. Holmes
   Wokingham: Van Nostrand Reinhold, c1988

   Douglas O'Shaughnessy -- Speech Communication: Human and Machine
   Addison Wesley series in Electrical Engineering: Digital Signal Processing,
   1987.

   Electronic speech recognition: techniques, technology and applications
   edited by Geoff Bristow,  London: Collins, 1986

   Readings in Speech Recognition; edited by Alex Waibel & Kai-Fu Lee.
   San Mateo: Morgan Kaufmann, c1990

More specific books/articles:

   Hidden Markov models for speech recognition; X.D. Huang, Y. Ariki, M.A. Jack.
   Edinburgh: Edinburgh University Press, c1990

   Automatic speech recognition: the development of the SPHINX system;
   by Kai-Fu Lee; Boston; London: Kluwer Academic, c1989

   Prosody and speech recognition; Alex Waibel
   (Pitman: London) (Morgan Kaufmann: San Mateo, Calif) 1988

   S. E. Levinson, L. R. Rabiner and M. M. Sondhi, "An Introduction to the 
   Application of the Theory of Probabilistic Functions of a Markov Process 
   to Automatic Speech Recognition" in Bell Syst. Tech. Jnl. v62(4),
   pp1035--1074, April 1983

   R. P. Lippmann, "Review of Neural Networks for Speech Recognition", in
   Neural Computation, v1(1), pp 1-38, 1989.

------------------------------------------------------------------------

Q5.7: What packages are available?

Package Name: Votan
Platform: MS-DOS, SCO UNIX
Description: Isolated word and continuous speech modes, speaker dependant
	and (limited) speaker independent.  Vocab size is 255 words or up to a 
	fixed memory limit - but it is possible to dynamically load different 
	words for effectively unlimited number of words.
Rough Cost: Approx US $1,000-$1,500
Requirements: Cost includes one Votan Voice Recognition ISA-bus board
	for 386/486-based machines.  A software development system is also 
	available for DOS and Unix.
Misc:	Up to 8 Votan boards may co-exist for 8 simultaneous voice users. 
	A telephone interface is also available. There is also a 4GL and a 
	software development system.
	Apparently there is more than  one version - more info required.
Contact: 800-877-4756, 510-426-5600


Package: HTK (HMM Toolkit) V1.4
Platform: Any 32 bit machine with an ANSI C compiler.  Can run under Unix, 
	DEC VAX/VMS, MPW/Apple Macintosh, PC compatibles under OS/2 etc. 
Description: HTK is a software toolkit for building continuous density HMM
	based speech recognisers that has been developed by Cambridge 
	University Engineering Deapartment Speech Group.  It consists of a 
	number of library modules and a number of tools.  Functions include 
	speech analysis, training tools, recognition tools, results analysis, 
	and an interactive tool for speech labelling. Many standard forms of 
	continuous density HMM are possible.  Can perform isolated word or 
	connected word speech recognition.  Can model whole words, sub-word 
	units.  Can perform speaker verification and other pattern recognition 
	work using HMMs.
Cost: 950 pounds sterling (industrial and commercial); 450 pounds sterling 
	(academic). Shipping 5 pounds (UK), 20 pounds (overseas).  
	VAT at 17.5% is payable on UK orders. The price includes full source 
	and documentation and a site license for the use of HTK.
Requirements: A 32 bit machine and a true 32 bit ANSI C compiler. 
	Systems containing a very large number of HMMs can require large 
	amounts of memory.
Misc: Details on HTK are available on the comp.speech archive site
	svr-ftp.eng.cam.ac.uk: /comp.speech/info/HTK_recognition
	HTK is supplied as ANSI C source (about 30,000 lines) with 
	extensive documentation (about 250 pages). HTK is currently in 
	use at more than 50 sites worldwide.
Contact: For futher information email either Phil Woodland (pcw@eng.cam.ac.uk) 
	or Steve Young (sjy@eng.cam.ac.uk).
	Orders should be sent to: Lynxvale Limited, 20 Trumpington Street
	Cambridge, CB2 1QA, England (FAX: +44 223 332797) stating choice of 
	distribution media (MSDOS or Apple Macintosh 3.5" diskette)

Package Name: xxx
Platform:     PC, Mac, UNIX, ....
Description:  (e.g. isolated word, speaker independent...)
Rough Cost:   (if applicable)
Requirements: (hardware/software needs - if applicable)
Misc:
Contact:      (email, ftp or address)


Can anyone provide info on

	DragonDictate
	SayIt (from Qualix)
	Voice Navigator (from Articulate Systems)
	IN3 Voice Command


I would like information on any software/hardware/packages that you know about.
Commercial, public domain and research packages are all appropriate.

[If there is enough information a separate posting could be started.]


=======================================================================

PART 6 - Speaker Recognition/Verification

Q6.1: What is speaker recognition/verification?

------------------------------------------------------------------------

Q6.2: Where is speaker recognition used?

------------------------------------------------------------------------

Q6.3: What are techniques for speaker recognition?

------------------------------------------------------------------------

Q6.4: How good is speaker recognition?

------------------------------------------------------------------------

Q6.5: What are some good references/books on speaker recognition?

------------------------------------------------------------------------

Q6.6: What packages are available?


=======================================================================

PART 7 - Natural Language Processing

There is a lot of useful information on the following questions in the 
FAQ for comp.ai.  The FAQ lists available software and useful references.
Included is a substantial list of software, documentation and other info
which is available by ftp.

------------------------------------------------------------------------

Q7.1: What is NLP?

Natural Language Processing is a field of great breadth.  It covers areas
from syntax and semantic analysis of text, to methods of "understanding"
texts, to methods of generating text from abstract representations, to 
language translations and more.

------------------------------------------------------------------------

Q7.2: What are some good references/books on NLP?

Any recommendations?  A few references/books for each area such as parsing,
translation, knowledge representation etc, would be suitable.

The FAQ for the "comp.ai" newsgroup includes some useful refs also.

------------------------------------------------------------------------

Q7.3: What software is available?

The FAQ for the "comp.ai" newsgroup lists a variety of language processing 
software that is available.  That FAQ is posted monthly.

Natural Language Software Registry
The Natural Language Software Registry is available from the German Research 
Institute for Artificial Intelligence (DFKI) in Saarbrucken.

The current version details 
 + speech signal processors, e.g. Computerized Speech Lab (Kay Electronics)
 + morphological analyzers, e.g. PC-KIMMO (Summer Institute for Linguistics)
 + parsers, e.g. Alveytools (University of Edinburgh)
 + knowledge representation systems, e.g. Rhet (University of Rochester)
 + multicomponent systems, such as ELU (ISSCO), PENMAN (ISI), Pundit (UNISYS),
        SNePS (SUNY Buffalo),
 + applications programs (misc.)

This document is available on-line via anonymous ftp to ftp.dfki.uni-sb.de 
directory: registry; or: (tira.uchicago.edu, IP 128.135.96.31), by email 
to registry@dfki.uni-sb.de.

If you have developed a piece of software for natural language processing 
that other researchers might find useful, you can include it by returning 
a description form, available from the same source.


===========================================================================

Cheers,

Andrew Hunt
Speech Technology Research Group		Ph:  61-2-692 4509
Dept. of Electrical Engineering			Fax: 61-2-692 3847
University of Sydney, NSW, 2006, Australia	email: andrewh@ee.su.oz.au
