Archive-name: comp-speech-faq/part1
Last-modified: 1995/01/11


              COMP.SPEECH FAQ POSTING - PART 1/3


[Note: this document has been automatically extracted from a WWW site:
        http://www.speech.su.oz.au/comp.speech
This may introduce some formatting errors.]

   
Comp.Speech Frequently Asked Questions

   The Frequently Asked Questions (FAQ) is a regular posting to
   comp.speech which attempts to answer some of the regular questions in
   the comp.speech newsgroup.
   
   The FAQ is not meant to discuss any topic exhaustively. It will
   hopefully provide readers with pointers on where to find useful
   information, especially material available on the Internet.
   
   If you have not already read the Usenet introductory material posted
   to "news.announce.newusers", please do. For help with FTP (file
   transfer protocol) look for a regular posting of "Anonymous FTP List -
   FAQ" in comp.misc, comp.archives.admin or news.answers.
   
   This FAQ is posted every 4 weeks to comp.speech, comp.answers &
   news.answers.
   
   It is also available for anonymous ftp from the comp.speech archive
   site :
     * ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/FAQ-complete
       
   Or from the news.answers ftp site (and its mirrors)
     * ftp://rtfm.mit.edu/pub/usenet/news.answers/comp-speech-faq/*
       
   Or on the World Wide Web
     * http://www.speech.su.oz.au/comp.speech
       
   Or by sending email to mail-server@rtfm.mit.edu with the following
   line in the body of the message:
     * send usenet/news.answers/comp-speech-faq/*
       
Admin

   Not much to report this month. Hopefully, February should see some
   major catch-up work.
   
FAQ Sections

   The FAQ is divided into the following sections:
     * FAQ Contents
       
     * List of Speech Technology Products and Software
       
     * FAQ Section 1: General Information on Speech Technology
     * FAQ Section 2: Signal Processing
     * FAQ Section 3: Speech Coding and Compression
     * FAQ Section 4: Natural Language Processing
     * FAQ Section 5: Speech Synthesis
     * FAQ Section 6: Speech Recognition
       
Comp.Speech FTP Site

   The comp.speech ftp site (which is described in Q1.2) contains the
   following:
     * Newsgroup Archives
     * Data Resources
     * General Information
     * Software
       
Acknowledgements

   Hundreds of people have made contributions to the comp.speech FAQ over
   the last two years; there are too many to name individually. Special
   thanks go to Tony Robinson and Joe Campbell who have been particularly
   helpful.
   
Maintainence

   The FAQ posting and the Comp.Speech WWW Site are maintained by
    
    Andrew Hunt
    ---
    Speech Technology Research Group
    Dept. of Electrical Engineering
    University of Sydney, NSW, 2006, Australia
    Ph: 61-2-351 4509
    Fax: 61-2-351 3847
    email: andrewh@speech.su.oz.au


===========================================================================

   
                           COMP.SPEECH FAQ CONTENTS
                                       
Introduction

     * Overview
     * List of Packages
       
Section 1 : General Information on Speech Technology

     * Q1.1 What is comp.speech?
     * Q1.2 Where are the comp.speech archives?
     * Q1.3 Common abbreviations and jargon.
     * Q1.4 What are related newsgroups and mailing lists?
     * Q1.5 What are related journals and conferences?
     * Q1.6 What resources are available as handicap aids?
     * Q1.7 What speech data is available?
     * Q1.8 Speech File Formats, Conversion and Playing.
     * Q1.9 What "Speech Laboratory Environments" are available?
     * Q1.10 Miscelaneous Software and Other Resources.
       
Section 2 : Signal Processing for Speech

     * Q2.1 What sampling do I need for speech?
     * Q2.2 How do I find the pitch of a speech signal?
     * Q2.3 How do I find the start and end points of a speech signal?
     * Q2.4 Where can I find FFT software?
     * Q2.5 What signal processing techniques are used in speech
       technology?
     * Q2.6 What speech sampling and signal processing hardware can I
       use?
     * Q2.7 How do I convert to/from mu-law format?
       
Section 3 : Speech Coding and Compression

     * Q3.1 Speech compression techniques.
     * Q3.2 What are some good references/books on coding/compression?
     * Q3.3 What software is available? (Includes CELP & G.7xx)
       
Section 4 : Natural Language Processing

     * Q4.1 What are some good references/books on NLP?
     * Q4.2 What NLP software is available?
       
Section 5 : Speech Synthesis

     * Q5.1 What is speech synthesis?
     * Q5.2 How can speech synthesis be performed?
     * Q5.3 What are some good references/books on synthesis?
     * Q5.4 What software/hardware is available?
       
Section 6 : Speech Recognition

     * Q6.1 What is speech recognition?
     * Q6.2 How can I build a very simple speech recogniser?
     * Q6.3 What does speaker dependent/adaptive/independent mean?
     * Q6.4 What does small/medium/large/very-large vocabulary mean?
     * Q6.5 What does continuous speech or isolated-word mean?
     * Q6.6 How is speech recognition done?
     * Q6.7 What are some good references/books on recognition?
     * Q6.8 What speech recognition packages are available?


===========================================================================

   
FAQ: List of Packages

    The comp.speech FAQ provides information on a range of software,
   hardware and resources.
   
Speech Data

     * Phonemic Samples
     * Linguistic Data Consortium (LDC)
     * Center for Spoken Language Understanding (CSLU)
     * PhonDat - A Large Database of Spoken German
     * Oxford Acoustic Phonetic Database
       
Speech Processing Environments

     * Entropic Signal Processing System (ESPS) and Waves
     * CSRE: Canadian Speech Research Environment
     * OGI Speech Tools
     * Matlab plus Signal Processing Toolbox
     * Signalyze 3.0 from InfoSignal
     * Kay Elemetrics CSL (Computer Speech Lab) 4300
     * MacSpeech Lab II (MSL II)
     * N!Power
     * Ptolemy
     * Khoros
     * SpeechViewer II
       
Other Resources

     * CMU Dictionary
     * Another Dictionary
     * BEEP dictionary
     * CUVOLAD dictionary
     * MRC database
     * Network Audio System
     * NEVOT (1.4v) from AT&T; BL
     * Human Audio Perception Document
     * Homophone List
     * Auditory Toolbox for Matlab
     * Auditory Modeller 1
     * Auditory Modeller 2
       
Audio I/O Hardware

     * Sun standard audio port (SPARC I & II)
     * Sun standard audio port (SPARC 10 & 20)
     * Ariel Signal Processors
     * IBM RS/6000 ACPA (Audio Capture and Playback Adapter)
     * Sound Galaxy NX , Aztech Systems
     * Sound Galaxy NX PRO, Aztech Systems
     * ATI Stereo F/X Sound Board
     * Various PC Sound Cards
       
Compression Software and Hardware

     * File format conversion
     * shorten - a lossless compressor for speech signals
     * 32 kbps ADPCM
     * GSM 06.10 Compression
     * G.721/722/723 Compression
     * G.728 Compression
     * G.728 LD-CELP vocoder
     * U.S.F.S. 1016 CELP vocoder for DSP56001
     * 8 Kbit/s CELP on the TMS320C5x family of DSP chips
     * CELP 3.2a & LPC
       
Natural Language Processing

     * Natural Language Software Registry (NLSR) - NLP Tools
     * Part of Speech Tagger
       
Speech Synthesis

     * Orator Text-to-Speech Synthesizer
     * Text to phoneme program (1)
     * Text to phoneme program (2)
     * Text to phoneme program (3)
     * Text to speech program
     * "Speak" - a Text to Speech Program
     * TheBigMouth - a Text to Speech Program
     * TextToSpeech Kit
     * SGI Developers Toolbox Synthesiser
     * rsynth
     * SENSYN speech synthesizer
     * spchsyn.exe
     * CSRE: Canadian Speech Research Environment
     * Eloquence (currently an alpha release)
     * JSRU
     * Klatt-style synthesiser
     * DECTalk
     * Speech Manager and PlainTalk
     * Various Mac Speech Output Applications
     * MacinTalk
     * Monologue by Creative Labs
     * Lernout & Hauspie Text-To-Speech SDK
     * Tinytalk
     * Narrator - narrator.device
     * Infovox Product Range
     * SIMTEL-20
       
Speech Recognition

     * HM2007 - Speech Recognition Chip
     * Voice Blaster Ver. 4.0
     * Votan
     * Entropic's HTK (HMM Toolkit)
     * DragonDictate version 3.0
     * DragonDictate for Windows
     * DragonVoiceTools
     * IBM Personal Dictation System
     * Osborne Personal Dictation System (in Australia)
     * VoiceServer for Windows
     * IN3 Voice Command for Windows
     * IN3 Voice Command
     * Phonetic Engine 400 (PE400) - Speech Systems, Inc.
     * SayIt
     * Kurzweil Voice for Windows 1.0
     * D6006 Voice Control Processor
     * Speech Commander - Listen for Windows
     * Voice-Trek 2.0
     * Visus SpeechKit
     * recnet
     * Lotec Speech Recognition Package
     * Myers' Hidden Markov Model software
     * Voice Command Line Interface
     * DATAVOX - French
     * PowerSecretary
     * ICSS system from IBM
     * Creative VoiceAssist


===========================================================================

   
FAQ SECTION 1 - General

  Q1.1: WHAT IS COMP.SPEECH?
  
   Comp.speech is a newsgroup for discussion of speech technology and
   speech science. It covers a wide range of issues from application of
   speech technology, to research, to products and lots more. By nature
   speech technology is an inter-disciplinary field and the newsgroup
   reflects this. However, computer application is the basic theme of the
   group.
   
   The following is a list of topics but does not cover all matters
   related to the field (no order of importance is implied).
     * Speech Recognition - discussion of methodologies, training,
       techniques, results and applications. This should cover the
       application of techniques including HMMs, neural-nets and so on to
       the field.
       
     * Speech Synthesis - discussion concerning theoretical and
       practical issues associated with the design of speech synthesis
       systems.
       
     * Speech Coding and Compression - both research and application
       matters.
       
     * Phonetic/Linguistic Issues - coverage of linguistic and phonetic
       issues which are relevant to speech technology applications. Could
       cover parsing, natural language processing, phonology and prosodic
       work.
       
     * Speech System Design - issues relating to the application of
       speech technology to real-world problems. Includes the design of
       user interfaces, the building of real-time systems and so on.
       
     * Other matters - relevant conferences, jobs, books, software,
       hardware, and products.
       
     _________________________________________________________________
   
  Q1.2: WHERE ARE THE COMP.SPEECH ARCHIVES?
  
   comp.speech is being archived for anonymous ftp.
     * ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/archive/
       
   comp.speech/archive contains the articles as they arrive. Batches of
   100 articles are grouped into a shar file, along with an associated
   file of Subject lines.
   
   Other useful information is also available in comp.speech/info.
     _________________________________________________________________
   
  Q1.3: COMMON ABBREVIATIONS AND JARGON.
     * ANN - Artificial Neural Network.
     * ASR - Automatic Speech Recognition.
     * ASSP - Acoustics Speech and Signal Processing
     * AVIOS - American Voice I/O Society
     * CELP - Code-book Excited Linear Prediction.
     * COLING - Computational Linguistics
     * DTW - Dynamic Time Warping.
     * FAQ - Frequently Asked Questions.
     * HMM - Hidden Markov Model.
     * IEEE - Institute of Electrical and Electronics Engineers
     * JASA - Journal of the Acoustic Society of America
     * LPC - Linear Predictive Coding.
     * LVQ - Learned Vector Quantisation.
     * NLP - Natural Language Processing.
     * NN - Neural Network.
     * TI - Texas Instruments.
     * TIMIT - A large speech corpus from TI and MIT - see Q1.7
     * TTS - Text-To-Speech (i.e. synthesis).
     * VQ - Vector Quantisation.
       
     _________________________________________________________________
   
  Q1.4: WHAT ARE RELATED NEWSGROUPS AND MAILING LISTS?
  
      Newsgroups
      
   comp.ai - Artificial Intelligence newsgroup.
          Postings on general AI issues, language processing and AI
          techniques. Has a good FAQ including NLP, NN and other AI
          information.
          
   comp.ai.nat-lang - Natural Language Processing Group
          Postings regarding Natural Language Processing. Set up to cover
          a broard range of related issues and different viewpoints.
          
   comp.ai.nlang-know-rep - Natural Language Knowledge Representation
          Moderated group covering Natural Language.
          
   comp.ai.neural-nets - discussion of Neural Networks and related
          issues.
          There are often posting on speech related matters - phonetic
          recognition, connectionist grammars and so on.
          
   comp.compression - occasional articles on compression of speech.
          FAQ for comp.compression has some info on audio compression
          standards.
          
   comp.dcom.telecom - Telecommunications newsgroup.
          Has occasional articles on voice products.
          
   comp.dsp - discussion of signal processing - hardware and algorithms
          and more.
          Has a good FAQ posting. Has a regular posting of a
          comprehensive list of Audio File Formats.
          
   comp.multimedia - Multi-Media discussion group.
          Has occasional articles on voice I/O.
          
   sci.lang - Language.
          Discussion about phonetics, phonology, grammar, etymology and
          lots more.
          
   alt.sci.physics.acoustics
          Some discussion of speech production & perception.
          
   alt.binaries.sounds.misc - posting of various sound samples
          
   alt.binaries.sounds.d - discussion about sound samples, recording
          and playback.
          
      Mailing Lists
      
   ECTL - Electronic Communal Temporal Lobe
          Founder & Moderator: David Leip. Moderated mailing list for
          researchers with interests in computer speech interfaces. This
          list serves a broad community including persons from signal
          processing, AI, linguistics and human factors. To subscribe,
          send your name, institute, department, daytime phone and email
          address to:
          
          + ectl-request@snowhite.cis.uoguelph.ca
            
   The ECTL archive site is
          
          + ftp://snowhite.cis.uoguelph.ca/pub/ectl
            
   Prosody Mailing List
          Unmoderated mailing list for discussion of prosody. The aim is
          to facilitate the spread of information relating to the
          research of prosody by creating a network of researchers in the
          field. If you want to participate, send the following one-line
          message to
          
          + listserv@msu.edu
          + subscribe prosody Your Name
            
   foNETiks
          A moderated monthly newsletter distributed by e-mail. It
          carries job advertisements, notices of conferences, and other
          news of general interest to phoneticians, speech scientists and
          others The editors are Linda Shockey and Gerry Docherty. To
          subscribe send the following 1 line message to
          
          + mailbase@mailbase.ac.uk
          + join fonetiks your_first_name your_second_name
            
   Digital Mobile Radio
          Covers lots of areas include some speech topics including
          speech coding and speech compression. Mail Peter Decker
          dec@dfv.rwth-aachen.de to subscribe.
          
     _________________________________________________________________
   
  Q1.5: WHAT ARE RELATED JOURNALS AND CONFERENCES?
  
   Try the following commercially oriented magazine:
     * Voice News - monthly industry newsletter
    Stoneridge Technical Services
    PO Box 1891, Rockville, MD, 20850, USA
    Phone: (301) 424-0114
     * Voice Technology News
     * Voice Processing Magazine (1-800-854-3112)
     * Speech Technology (no longer published)
       
   Try the following technical journals (some contact addresses below):-
     * IEEE Transactions on Speech and Audio Processing (from Jan 93)
     * IEEE Signal Processing Magazine (from Jan 93)
     * IEEE Transactions on Acoustics, Speech, and Signal Processing
       (ASSP) (now obsolete)
     * Computational Linguistics (COLING)
     * Computer Speech and Language
     * Journal of the Acoustical Society of America (JASA)
     * AVIOS Journal
     * ASR News
       
   Try the following conferences:-
     * ICASSP Intl. Conference on Acoustics Speech and Signal Processing
       (IEEE)
     * ICSLP Intl. Conference on Spoken Language Processing
     * EUROSPEECH European Conference on Speech Communication and
       Technology
     * AVIOS American Voice I/O Society Conference
     * SST Australian Speech Science and Technology Conference
       
   Here are a few contact addresses:- 
   
   Publications:
          IEEE Transactions on Speech and Audio Processing (from Jan 93)
          IEEE Transactions on Acoustics, Speech, and Signal Processing
          (ASSP) - now obsolete.
          
   Organization:
          Institute of Electrical and Electronics Engineers (IEEE)
          
   Contact:
          IEEE Service Center
          445 Hoes Lane, PO Box 1331, Piscataway, NJ 08855, USA
          Phone: 1-800-678-IEEE or (201)981-0060 
          
   Publications:
          Computer Speech and Language
          
   Contact:
          Academic Press, Ltd.
          24-28 Oval Rd, London NW1, England
          
   Price:
          $136 (Institutions), $58 (Individuals) 
          
   Publications:
          Association for Computational Linguistics
          
   Organization:
          Association for Computational Linguistics
          MIT Press Journals
          55 Hayward St, Cambridge, MA 02142, USA
          Phone: (617)253-2889
          
     _________________________________________________________________
   
  Q1.6: WHAT RESOURCES ARE AVAILABLE AS HANDICAP AIDS?
  
   Can anyone provide information on speech technology aids for the deaf,
   blind, speech impaired, physically impaired and other groups who may
   benefit from speech technology?
   
    SpeechViewer II
     * Platform: IBM Machines from Mod 25 on.
     * Description: SpeechViewer II is a speech therapy tool. It
       provided graphical feedback of various speech features so that
       speech impaired individuals can improve their speech. It works
       with an audio bandwidth of 7.3 Khz and thus allows the therapist
       to work with sustained vowels and fricatives. A wide range of
       graphics are used to provide adequate variability to hold client
       interest. An extensive set of statistics are gathered which allows
       a therapist to do research or keep therapy records. The speech
       therapy modules are:
          + Awareness - Sound, Loudness, Pitch, Voicing Onset, Voicing
          + Skill Building - Pitch, Voicing, Phonology
          + Patterning - Pitch & Loudness - Waveform & Spectrogram,
            Spectra
          + Clinical Management - Profiles, Models, Client Data
     * Hardware: Requires an IBM M-ACPA (Multimedia-Audio Capture
       Playback Adapter). It has a TI TMS320C25 DSP chip. The input
       sampling rate is 44.1 Khz stereo, 88.2 Khz mono. This is a 16 bit
       card. It has the following jacks: mic in, stereo line in, stereo
       line out, speaker out. Note: This card is being replaced by Mwave
       technology. For more info on Mwave contact Texas Instruments.
     * Price:
          + The software is $2130 list, $1491 educational, part number
            92F2066.
          + The M-ACPA is $370 list, $222 educational, part number
            92F3378.
          + The MicroChannel adapter part number is 92F3379 (same price).
     * Contact: The Psychological Corporation (TPC) [IBM Authorized
       Remarketer]
    Phone: 1-800-228-0752 or contact IBM on 1-800-426-4832.
    
     _________________________________________________________________
   
  Q1.7: WHAT SPEECH DATA IS AVAILABLE?
  
   A wide range of speech databases have been collected. These databases
   are primarily for the development of speech synthesis/recognition and
   for linguistic research.
   
   Some databases are free but most appear to be available for a small
   cost. The databases normally require lots of storage space - do not
   expect to be able to ftp all the data you want.
   
    Phonemic Samples
     * First, some basic data. The following ftp sites have samples of
       English phonemes (American accent I believe) in Sun audio format
       files. See Question 1.8 for information on audio file formats.
          + ftp://sounds.sdsu.edu/.1/phonemes: This ftp site appears to
            be obsolete. Does anyone know a new address?
          + ftp://phloem.uoregon.edu/pub/Sun4/lib/phonemes : There
            appears to be some config problem with this ftp server.
          + ftp://sunsite.unc.edu/pub/multimedia/sun-sounds/phonemes
            
    Linguistic Data Consortium (LDC)
     * Briefly stated, the LDC has been established to broaden the
       collection and distribution of speech and natural language data
       bases for the purposes of research and technology development in
       automatic speech recognition, natural language processing and
       other areas where large amounts of linguistic data are needed.
       Here is list of some of the corpora:
          + The TIMIT and NTIMIT speech corpora
          + The Resource Management speech corpus (RM1, RM2)
          + The Air Travel Information System (ATIS0) speech corpus
          + The Association for Computational Linguistics - Data
            Collection Initiative text corpus (ACL-DCI)
          + The TI Connected Digits speech corpus (TIDIGITS)
          + The TI 46-word Isolated Word speech corpus (TI-46)
          + The Road Rally conversational speech corpora (including
            "Stonehenge" and "Waterloo" corpora)
          + The Tipster Information Retrieval Test Collection
          + The Switchboard speech corpus ("Credit Card" excerpts and
            portions of the complete Switchboard collection)
     * Further resources made available in the first year (or two):
          + The Machine-Readable Spoken English speech corpus (MARSEC)
          + The Edinburgh Map Task speech corpus
          + The Message Understanding Conference (MUC) text corpus of FBI
            terrorist reports
          + The Continuous Speech Recognition - Wall Street Journal
            speech corpus (WSJ-CSR)
          + The Penn Treebank parsed/tagged text corpus
          + The Multi-site ATIS speech corpus (ATIS2)
          + The Air Traffic Control (ATC) speech corpus
          + The Hansard English/French parallel text corpus
          + The European Corpus Initiative multi-language text corpus
            (ECI)
          + The Int'l Labor Organization/Int'l Trade Union multi-language
            text corpus (ILO/ITU)
          + Machine-readable dictionaries/lexical data bases (COMLEX,
            CELEX)
     * Detailed information about the Linguistic Data Consortium is
       available by anonymous from the address below. The files in the
       directory include more detailed information on the individual
       databases.
          + ftp://ftp.cis.upenn.edu/pub/ldc
     * For further information contact
    Linguistic Data Consortium
    441 Williams Hall, University of Pennsylvania
    Philadelphia, PA 19104-6305
    Phone: +1 (215) 898-0464
    Fax: +1 (215) 573-2175
    e-mail: ldc@unagi.cis.upenn.edu
    
    Center for Spoken Language Understanding (CSLU)
     * The ISOLET speech database of spoken letters of the English
       alphabet. The speech is high quality (16 kHz with a noise
       cancelling microphone). 150 speakers x 26 letters of the English
       alphabet twice in random order. The ISOLET data base can be
       purchased for $100 by sending an email request to
       vincew@cse.ogi.edu. (This covers handling, shipping and medium
       costs). The data base comes with a technical report describing the
       data.
     * CSLU has a telephone speech corpus of 1000 English alphabets.
       Callers recite the alphabet with brief pauses between letters.
       This database is available to not-for-profit institutions for
       $100. The data base is described in the proceedings of the
       International Conference on Spoken Language Processing.
          + Contact vincew@cse.ogi.edu if interested.
     * CSLU has released for universities its Continuous English Speech
       Corpus. The corpus contains recorded speech from 690 different
       speakers, with label files at various levels - including word
       level and phonetic labels. The data were collected as part of the
       OGI Multi-language telephone corpus. CSLU provides speech corpora
       to all universities without charge. To order a corpus, print the
       license agreement/order form, complete it, and fax it to the CSLU.
       A description of the corpora and an order form are available by
       anonymous ftp:
          + ftp://speech.cse.ogi.edu/pub/releases
     * Contact: Mike Noel -
    email: noel@cse.ogi.edu Phone: (503) 690-1309
    
    PhonDat - A Large Database of Spoken German
     * The PhonDat continuous speech corpora are now available on CD-ROM
       media (ISO 9660 format).
          + PhonDat I (Diphone Corpus) : 6 CDs (1140.- DM)
          + PhonDat II (Train Enquiries Corpus): 1 CD ( 190.- DM)
     * PhonDat I comprises approx. 20.000, PhonDat II approx. 1500 signal
       files in high quality 16-bit 16 KHz recording. The corpora come
       with documentation containing the orthographic transcription and a
       citation form of the utterances, as well as a detailed file format
       description. A narrow phonetic transcription is available for
       selected files from corpus I and II.
     * For information and orders contact
    Barbara Eisen
    Institut fuer Phonetik
    Schellingstr. 3 / II
    D 80799 Munich 40
    Tel: +49 / 89 / 2180 -2454 or -2758
    Fax: +49 / 89 / 280 03 62
    
    Oxford Acoustic Phonetic Database
     * Available on compact disc, from J. Pickering and B. Rosner. It
       contains data on vowel-consonant and consonant-vowel combinations
       in both stressed and unstressed locations. The language covered
       include French, German, Hungarian, Italian, Japanese, British
       English, Spanish and English. For further information write to
    Electronic Publishing, Oxford University
    Press, Walton Street, Oxford OX2 6DP, UK.
    The ISBN is 0-19-268086-2
     * Contact:
    Prof. B. Rosner
    Dept. of Experimental Psychology
    South Parks Rd, Oxford, OX1 3UD, UK
    email: burton.rosner@wolfson.ox.ac.uk
    
     _________________________________________________________________
   
  Q1.8: SPEECH FILE FORMATS, CONVERSION AND PLAYING.
  
   Section 2 of this FAQ has information on mu-law coding.
   
   A very good and very comprehensive list of audio file formats is
   prepared by Guido van Rossum. The list is posted regularly to comp.dsp
   and alt.binaries.sounds.misc, amongst others. It includes information
   on sampling rates, hardware, compression techniques, file format
   definitions, format conversion, standards, programming hints and lots
   more. It is also available by ftp from
     * ftp://ftp.cwi.nl/pub/audio/AudioFormats.part1,2
       
     _________________________________________________________________
   
  Q1.9: WHAT "SPEECH LABORATORY ENVIRONMENTS" ARE AVAILABLE?
  
   First, what is a Speech Laboratory Environment? A speech lab is a
   software package which provides the capability of recording, playing,
   analysing, processing, displaying and storing speech. Your computer
   will require audio input/output capability. The different packages
   vary greatly in features and capability - best to know what you want
   before you start looking around.
   
   Most general purpose audio processing packages will be able to process
   speech but do not necessarily have some specialised capabilities for
   speech (e.g. formant analysis).
   
   The following article provides a good survey.
     * Read, C., Buder, E., & Kent, R. "Speech Analysis Systems: An
       Evaluation" Journal of Speech and Hearing Research, pp 314-332,
       April 1992.
       
    Entropic Signal Processing System (ESPS) and Waves
     * Platform: Range of Unix platforms.
     * Description: ESPS is a comprehensive set of speech
       analysis/processing tools for the UNIX environment. The package
       includes UNIX commands, and a comprehensive C library (which can
       be accessed from other languages). Waves is a graphical front-end
       for speech processing. Speech waveforms, spectrograms, pitch
       traces etc can be displayed, edited and processed in X windows and
       Openwindows (versions 2 & 3). Waves also includes a signal
       labelling utility which provides multiple feature labelling and
       useful features for fast labelling of large speech databases.
       Entropic also distributes HTK (the Hidden Markov Model Toolkit).
       HTK is described in Section 6 of this FAQ.
     * Cost: On request.
     * Contact:
    Entropic Research Laboratory, Washington Research Laboratory
    600 Pennsylvania Ave, S.E. Suite 202, Washington, D.C. 20003
    (202) 547-1420
    email - info@entropic.com
    
    CSRE: Canadian Speech Research Environment
     * Platform: IBM/AT-compatibles
     * Description: CSRE is a microcomputer-based system designed to
       support speech research. CSRE provides a low-cost facility in
       support of speech research, using mass-produced and
       widely-available hardware. The project is non-profit, and relies
       on the cooperation of researchers at a number of institutions and
       fees generated when the software is distributed. Functions include
       speech capture, editing, and replay; several alternative spectral
       analysis procedures, with color and surface/3D displays; parameter
       extraction/ tracking and tools to automate measurement and support
       data logging; alternative pitch-extraction systems; parametric
       speech (KLATT80) and non-speech acoustic synthesis, with a variety
       of supporting productivity tools; and an experiment generator, to
       support behavioral testing using a variety of common testing
       protocols. A paper about the whole package can be found in:
          + Jamieson D.G. et al, "CSRE: A Speech Research Environment",
            Proc. of the Second Intl. Conf. on Spoken Language
            Processing, Edmonton: University of Alberta, pp. 1127-1130.
     * Hardware: Can use a range of data aqcuisition/DSP hardware
     * Cost: Distributed on a cost recovery basis.
     * Availability: For more information on availability contact
    Krystyna Marciniak
    email march@uwovax.uwo.ca
    Tel (519) 661-3901 Fax (519) 661-3805.
   For technical information
    email ramji@uwovax.uwo.ca
     * Note: Also included in Q5.4 on speech synthesis packages.
       
    OGI Speech Tools
     * Developers from the Center for Spoken Language Understanding
       (CSLU) at the Oregon Graduate Institute of Science and Technology
       (Portland Oregon)
     * Platform: Unix
     * Description: The OGI Speech tools include :
          + An X windows display tool (LYRE) for displaying data in a
            time synchronous fashion for a. the speech signal b.
            spectrograms c. phoneme labels, and other information.
          + A Neural Network (NOPT) training package.
          + An set of C library routines (LIBNSPEECH) for the
            manipulation of speech data, including: a. PLP Analysis, b.
            Rasta PLP Analysis, c. Linear Predictive Coding, d. Mel
            Cepstrum Coding, e. Fast Fourier Transform
          + A set of utilities for converting file formats such as ADC,
            NIST, mu-law, binary files, and ascii. Includes filtering.
          + A database utility (find_phone) to automate speech database
            related enquiries. It allows the user to specify a particular
            label or set of labels in a given context, display all
            occurrences of the label, and relabel the occurrences if
            desired.
          + A Vector-Quantizer based on the Linde Buzo and Gray (LBG)
            algorithm.
          + A set of PERL Scripts which have been used mainly to automate
            the use of the OGI Speech Tools.
          + MAN Pages for all routines and programs developed, as well as
            a User manual in both in postscript and tex format.
     * Misc: Software is written in ANSI C.
     * Availability: By anonymous ftp from
          + ftp://speech.cse.ogi.edu/pub/tools/
     * Contact: Try tools@cse.ogi.edu
       
    Matlab plus Signal Processing Toolbox
     * Platform: Wide range
     * Description: Matlab (MATrix LABoratory) is a technical computing
       environment for numerical computation and visualization based on a
       matrix oriented, interpreted programming language. The programming
       environment provides support for the development of customized
       operations, along with debugging facilities and a graphical user
       interface toolkit. Audio output is provided.
       
       A specialised Signal Processing Toolbox is available which
       provides many functions which are useful for speech analysis. It
       includes filter design, spectral estimation, statistical signal
       processing, waveform generation, and signal and spectrogram
       display.
       
       A specialised Auditory Toolbox is available which contains
       functions useful to people interested in auditory/cochlear models.
       A more detailed description is given in Q1.10.
     * Price: On request.
     * Contact: The Math Works Inc.
    24 Prime Park Way, Natick, MA 01760-1500 USA
    Ph: 1-508-653 1415 Fax: 1-508-653 6284
    Email: info@mathworks.com
     * FTP: ftp://ftp.mathworks.com
     * WWW: http://www.mathworks.com/
       
    Signalyze 3.0 from InfoSignal
     * Platform: Macintosh
     * Description: Signalyze's basic conception revolves around up to
       100 signals, displayed synchronously in HyperCard fashion on
       "cards". The program offers a complement of signal editing
       features, quite a few spectral analysis tools, manual scoring
       tools, pitch extraction routines, a good set of signal
       manipulation tools, and extensive input-output capacity.
       
       Handles multiple file formats: Signalyze, MacSpeech Lab,
       AudioMedia, SoundDesigner II, SoundEdit/MacRecorder, SoundWave,
       three sound resource formats, and ASCII-text. Sound I/O: Direct
       sound input from MacRecorder and similar devices, AudioMedia,
       AudioMedia II and AD IN, some MacADIOS boards and devices, Apple
       sound input (built-in microphone). Sound output via Macintosh
       internal sound, via SoundManager 3.0, some MacADIOS boards and
       devices as well as via the Digidesign 16-bit boards.
       
       It has a range of capabilities for creating, editing and
       manipulating label files with flexibility in labelling format.
     * Compatibility: MacPlus and higher (including II, IIx, IIcx,
       IIci, IIfx, IIvx, IIvi, Portable, all PowerBooks, Centris and
       Quadras). Takes advantage of large and multiple screens and 16/256
       color/grayscales. System 7.0 compatible. Runs in background with
       adjustable priority.
     * Misc: A demo available upon request. Manuals and tutorial
       included. It is available in English, French, and German. An
       UPDATER to version 2.48 is now available in:
          + - The UNIL Gopher server (see last page of InfoSignal News 8)
          + - The LAIP FTP server. Address: MACFL4082.unil.ch, machine
            no. 130.223.104.31
   Also available are a demo program, and current questions and answers.
     * Cost: Individual licence US$350, site license US$500, plus
       shipping. Upgrades from version 2.0 are available.
     * Contact:
    North America - Network Technology Corporation
    91 Baldwin St., Charlestown MA 02129
    Fax: 617-241-5064 Phone: 617-241-9205
   Elsewhere contact
    InfoSignal Inc.
    C.P. 73, 1015 LAUSANNE, Switzerland,
    FAX: +41 21 691-1372,
    Email: 76357.1213@COMPUSERVE.COM.
    
    Kay Elemetrics CSL (Computer Speech Lab) 4300
     * Platform: Minimum IBM PC-AT compatible with extended memory (min
       2MB) with at least VGA graphics. Optimal would be 386 or 486
       machine with more RAM for handling larger amounts of data.
     * Description: Speech analysis package, with optional separate LPC
       program for analysis/synthesis. Uses its own file format for data,
       but has some ability to export data as ascii. The main
       editing/analysis prog (but not the LPC part) has its own macro
       language, making it easy to perform repetitive tasks. Probably not
       much use without the extra LPC program, which also allows
       manipulation of pitch, formant and bandwidth parameters.
       
       Hardware includes an internal DSP board for the PC (requires ISA
       slot), and an external module containing signal processing chips
       which does A/D and D/A conversion.
     * Misc: A programmers kit is available for programming signal
       processing chips (experts only). A speaker and microphone are
       supplied. Manuals are included.
     * Cost: Recently approx 6000 pounds sterling.
     * Contact:
    UK distributors are Wessex Electronics,
    114-116 North Street, Downend, Bristol, B16 5SE
    Tel: 0272 571404.
   In the USA contact:
    Kay Elemetrics Corp,
    12 Maple Avenue, PO Box 2025, Pine Brook, NJ 07058-9798
    Tel:(201) 227-7760
    
    MacSpeech Lab II (MSL II)
     * Platform: Macintosh
     * Description: A sound analysis and acquisition for Macs. MSL II
       delivers the most common functions for speech analysis (FFTs,
       LPCs, f0 extraction, etc.) & produces grayscale spectrographic
       displays. Can be used for various speech technology and phonetic
       training tasks. The software an trade off accuracy and speech.
     * Hardware: Requires MacADIOS ("Macintosh Analog/Digital
       Input/Output System") hardware for speech I/O at 12/16 bits.
     * Misc: Software no longer updated by GW Instruments; MSL
       soft/hardware will not perform input/output on Quadras, for
       example, though analysis seems fine. Known to operate properly on
       systems as high as IIcx & II fx.
     * Cost: $4990 (in May '92 price list; no MSL soft/hardware package
       listed in January '93).
     * Contact:
    GW Instruments
    35 Medford Street, Somerville, MA 02143
    Phone: (617) 625-4096 Fax: (617) 625-1322
    
    N!Power
     * Platform: SUN, DEC and HP workstations.
     * Description: An object-oriented software package with a MOTIF
       GUI interface and a range of functionality for data
       analysis/editing, signal analysis, speech processing, real-time
       A/D and D/A, and 2D/3D interactive graphics. N!Power replaces ILS.
       
       N!Power can provide a Block Diagram user interface, menus,
       pop-ups, and a high-level IEEE standard symbolic scripting
       language. You can customize the blocks, menus and pop-ups with
       mouse point-and-click operations.
     * Contact:
    Signal Technology, Inc.
    104 W. Anapamu, Suite J, Santa Barbara, CA 93101-3126
    Phone: 805-899-8300 FAX: 805-899-4344
    email: larry@signal.com
    
    Ptolemy
     * Platform: Sun SPARC, DecStation (MIPS), HP (hppa).
     * Description: Ptolemy provides a highly flexible foundation for
       the specification, simulation, and rapid prototyping of systems.
       It is an object oriented framework within which diverse models of
       computation can co-exist and interact. Ptolemy can be used to
       model entire systems.
       
       Ptolemy has been used for a broad range of applications including
       signal processing, telecomunications, parallel processing,
       wireless communications, network design, radio astronomy, real
       time systems, and hardware/software co-design. Ptolemy has also
       been used as a lab for signal processing and communications
       courses. Ptolemy has been developed at UC Berkeley over the past 3
       years. Further information, including papers and the complete
       release notes, is available from the FTP site.
     * Cost: Free
     * Availability: The source code, binaries, and documentation are
       available by anonymous ftp from
          + ftp://ptolemy.berkeley.edu/pub/README
            
    Khoros
     * Description: Public domain image processing package with a basic
       DSP library. Not particularly applicable to speech, but not bad
       for the price.
     * Cost: Free
     * Availability: By anonymous ftp from ftp://pprg.eece.unm.edu
       
    SpeechViewer II
     * Description: Speech Therapy Tool. See the detailed description
       in the handicap section - Q1.6.
       
     _________________________________________________________________
   
  Q1.10: MISCELANEOUS SOFTWARE AND OTHER RESOURCES.
  
    CMU dictionary
     * Description: Phonemic transcriptions of 100,000 words with
       American English pronunciation.
     * Availability: By anonymous ftp from the directory
          + ftp://ftp.cs.cmu.edu/project/fgdata/dict
   with the files README, cmudict.0.2.Z, cmulex.0.1.Z, phoneset.0.1
       
    Dictionary
     * Description: A comprehensive word list which should contain most
       common American words, abbreviations, hyphenations, and even
       incorrect spellings. The word lists were compiled from a number of
       sources: commercial news services, UseNet news postings, existing
       dictionaries, name lists, company lists, UNIX man pages, project
       Gutenberg's E-texts, project Wordnet, received mailings, etc. The
       current size is 460,000 words.
     * Availability: By anonymous ftp from
          + ftp://wocket.vantage.gte.com:/pub/standard_dictionary
   
       Note 1: There seems to be some sort of network problem reaching
       the server.
       Note 2: There is a README file which explains the file formats.
       
    BEEP dictionary
     * Description: Phonemic transcriptions of 100,000 English words.
       (British English pronunciations)
     * Availability: By anonymous ftp from the file
          + svr-ftp.eng.cam.ac.uk/comp.speech/data/beep-0.3.tar.Z
            
    CUVOLAD dictionary
     * Description: Computer Usable Version of the Oxford Advanced
       Learner's Dictionary Has British English pronunciations and parts
       of speech
     * Availability: By anonymous ftp from the directory
          + ftp://black.ox.ac.uk/ota/dicts/710
            
    MRC database
     * Description: The Medical Research Council Psycholinguistic
       Database Has British English pronunciations, parts of speech, word
       frequency and lots of other information.
     * Availability: By anonymous ftp from the directory
          + ftp://black.ox.ac.uk/ota/dicts/1054
            
    Network Audio System Release 1.1
     * Platforms: Various (includes SunOS, Solaris, SGI)
     * Description: A device-independent mechanism for transferring,
       playing and recording audio signals over a network. Has a range of
       features suited to networks.
     * Cost: Free
     * Availability: By anonymous ftp from
          + ">ftp.x.org:/contrib/audio/nas/netaudio-1.2.tar.gz">ftp://ftp.x.org:/contrib/audio/nas/netaudio-1.2.tar.gz
   Also available in the same directory are document files and some
       sample sounds.
       
    AF version AF3R1
     * Platforms: DEC workstations (Alpha and MIPS), SparcStation, SGI
     * Description: The AF System is a device-independent
       network-transparent system including client applications and audio
       servers. With AF, multiple audio applications can run
       simultaneously, sharing access to the actual audio hardware.
       
       The AF3R1 distribution of AF includes server support for Digital
       RISC systems running Ultrix, Digital Alpha AXP systems running
       OSF/1, SGI Indigo running IRIX 4.0.5, Sun Microsystems
       SPARCstations running SunOS 4.1.3, and Sun Microsystems
       SPARCstations running Solaris 2.3. The servers support audio
       hardware ranging from the built-in CODEC audio on SPARCstations
       and Personal DECstations to 48 KHz stereo audio using the DECaudio
       TURBOchannel module or the SPARCstation DBRI interface
     * Availability: The source kit is distributed by anonymous ftp
       from
          + ftp://crl.dec.com/pub/DEC/AF
     * Contact: af-request@crl.dec.com
          + http://www.research.digital.com/CRL/projects/AF/home.html
            
    NEVOT (1.4v) from AT&T; BL
     * Platforms: Sun Sparc Station (SunOS 4.1.x) and Silicon Graphics
     * Description: Audio-conferencing tool which supports both
       point-to-point and broadcasting of audio using multicast IP. Audio
       encoding:
          + PCM 64kb/s 8-bits u-law encoded 8KHz PCM (G.711)
          + ADPCM 32 kb/s [Sun only] (G.721)
          + DVI ADPCM 32 kb/s
          + ADPCM 24 kb/s [Sun only] (G.723)
          + CELP 4.8 kb/s
          + LPC 2.4 kb/s
   Source is available.
     * Availability: by anonymous ftp from
          + ftp://gaia.cs.umass.edu/pub/hgschulz/nevot
     * Contact: Henning Schulzrinne (hgs@researh.att.com)
       
    Human Audio Perception Document
     * Description: Document prepared by Argiris Kranidiotis on the
       human audio perception system. It lists a number of references,
       gives plenty of numbers and some equations.
     * Availability: by anonymous ftp from the comp.speech archive site
          +
            ftp://svr-ftp.eng.cam.ac.uk/comp.speech/info/HumanAudioPercept
            ion
     * Contact:
    Argiris A. Kranidiotis
    University Of Athens, Informatics Department
    email: akra@zeus.di.uoa.ariadne-t.gr
    
    Homophone List
     * A list of homophones in General American English is available by
       anonymous FTP from the comp.speech archive site:
          +
            ftp://svr-ftp.eng.cam.ac.uk/comp.speech/data/homophones-1.01.t
            xt
            
    Auditory Toolbox for Matlab
     * Description: This toolbox provides extensions to Matlab which
       are useful to people interested in auditory/cochlear modeling.
       [Matlab is described is the previous section.] This toolbox has
       been tested on both Macintosh and Unix computers. It includes the
       following major models:
          + Lyon's Passive Long Wave Cochlear Model (our conventional
            model)
          + Patterson-Holdsworth ERB Filter bank with Meddis Hair cell
          + Seneff's Auditory Model (Stages I and II)
          + MFCC (Mel-scale frequency cepstral coefficients from the ASR
            world)
          + Spectrogram
          + Correlogram generation and pitch modeling
          + Simple vowel synthesis
     * Availability: By anonymous FTP from the following site:
          + ftp://ftp.apple.com/pub/malcolm
   The following files are available:
          + 419487 AuditoryToolbox.mif.Z
          + 1372976 AuditoryToolbox.psc.Z
          + 573215 AuditoryToolbox.sea.hqx
          + 92160 AuditoryToolbox.tar
          + 36405 AuditoryToolbox.tar.Z
   The ".mif.Z" file is a Unix compressed version of the FrameMaker
       documentation. The ".psc.Z" file is a Unix compressed version of
       the Postscript documentation. The ".tar" and ".tar.Z" files are
       Unix TAR archives containing all of the m-functions and C-MEX
       source code. Finally, the ".sea.hqx" file is a Macintosh
       self-extracting archive that has been encoded using BinHex. We do
       provide precompiled version of the three MEX function for the
       Macintosh.
     * Misc: Our lawyers ask you to remind you that there is no
       warranty. We've done some testing but we undoubtably missed
       things.
     * Contact:
    Malcolm Slaney: Interval Resarch.
    Email: malcolm@interval.com
    
    Auditory Modeller 1
     * Description: John Holdsworth's implementation of a gammatone
       filter bank and Roy Patterson's spiral model, in C (with X-window
       display).
     * Availability: By anonymous ftp from
          + ftp://ftp.mrc-apu.cam.ac.uk/pub/aim
            
    Auditory Modeller 2
     * Description: Lowel O'Mard's implementation of peripheral
       filtering, Ray Meddis's hair cell model and other stuff in C (as a
       library of routines).
     * Availability: By anonymous ftp from
          + ftp://suna.lut.ac.uk/public/hulpo/lutear
            
     _________________________________________________________________




Andrew Hunt
  ---
Speech Technology Research Group		Ph:  61-2-351 4509
Dept. of Electrical Engineering			Fax: 61-2-351 3847
University of Sydney, NSW, 2006, Australia	email: andrewh@speech.su.oz.au