		      Corpora Available through
		    The Linguistic Data Consortium
				   
			       May 1994
				   
			       Contents

I.    The DARPA Resource Management Corpora (RM1 and RM2) Series . . .1
      
      Speaker-Dependent Training Data, September 1989. . . . . . . . .3
      
      Speaker-Independent Training Data, November 1989 . . . . . . . .3
      
      Development Test and Evaluation Test Data and Scoring and Speech
           Header Software, January 1990 . . . . . . . . . . . . . . .3
     
       Extended Resource Management Continuous Speech Speaker-Dependent
           Corpus (RM2), September 1990  . . . . . . . . . . . . . . .4

II.   DARPA Acoustic-Phonetic Continuous Speech Corpus (TIMIT),
	   October 1990. . . . . . . . . . . . . . . . . . . . . . . .4

III.  Texas Instruments-Developed Studio Quality Speaker-Independent
           Connected-Digit Corpus (TIDIGITS), February 1991. . . . . .5

IV.   Air Travel Information System (ATIS0) Corpus . . . . . . . . . .6
      
      Spontaneous Speech Pilot Corpus and Relational Database, 
           November, 1990. . . . . . . . . . . . . . . . . . . . . .  6
      
      Read Versions of Spontaneous Data and 
           Adaptation Data, November, 1990 . . . . . . . . . . . . .  7
      
      Speaker-Dependent Training Data, 
           December, 1990 and September 1991 . . . . . . . . . . . .  7
      
      Multi-Site Air Travel Information System (ATIS2) Corpora
           October 1993. . . . . . . . . . . . . . .. . . . . . . . . 8
      
      Multi-Site Air Travel Information System (ATIS3) Corpora
           Spring 1994 . . . . . . . . . . . . . . . . . . . . . . . .8


V.    ARPA Continuous Speech Recognition Pilot Corpus (WSJ0),
           August 1993 . . . . . . . . . . . . . . . . . . . . . . .  9
    
      ARPA Continuous Speech Recognition Corpus (WSJ1),
           Spring 1994 . . . . . . . . . . . . . . . . . . . . . . . 10

VI.    Road Rally Conversational Speech Corpora (RDRALLY1), 
           September 1991. . . . . . . . . . . . . . . . . . . . . . 11

VII.   Texas Instruments-Developed 46-Word Speaker-Dependent Isolated Word 
           Corpus (TI46) . . . . . . . . . . . . . . . . . . . . . . 12  

VIII.  Switchboard Corpus Excerpts, Credit Card Conversations. . . . 12  

IX.    Switchboard Corpus, Recorded Telephone Conversations. . . . . 13  

X.     NTIMIT Telephone Network Acoustic-Phonetic Continuous 
           Speech Corpus. . . . . . . . . . . . . . . . . . . . . . .13
	   
XI.    ACL/DCI  . . . . . . . . . . . . . . . . . . . . . . . . . . .14 

XII.   TIPSTER Information Retrieval. . . . . . . . . . . . . . . . .15  

XIII.  Penn Treebank  . . . . . . . . . . . . . . . . . . . . . . . .16 

XIV.   HCRC Map Task Corpus . . . . . . . . . . . . . . . . . . . . .16

XV.    CELEX Lexical Database.  . . . . . . . . . . . . . . . . . . .17 

XVI.   United Nations Parallel Text Corpus  . . . . . . . . . . . . .18 

XVII.  OGI Spelled and Spoken Telephone Corpus. . . . . . . . . . . .19 

XVIII. OGI Multi-Language Corpus. . . . . . . . . . . . . . . . . . .19

XIX.   SPIDRE Speaker Identification Corpus . . . . . . . . . . . . .20

XX.    YOHO Database. . . . . . . . . . . . . . . . . . . . . . . . .20
 
XXI.   Future Plans . . . . . . . . . . . . . . . . . . . . . . . . .21 


I.    The DARPA Resource Management Corpora (RM1 and RM2) Series

The DARPA Resource Management Continuous Speech Corpora consist of
digitized and transcribed speech for use in designing and evaluating
continuous speech recognition systems.  Speaker-dependent,
speaker-adaptive and speaker-independent recognition modes are
accommodated.  The corpus consists of read sentences from a
(nominally) 1000-word language model of a naval resource management
task.  The complete corpus consists of over 25,000 utterances from
more than 160 speakers representing a variety of American dialects.
The material was recorded at 16KHz, with 16-bit resolution, using a
Sennheiser HMD-414 headset microphone. It is divided into sections for
speech recognition system training, development testing, and
evaluation.

The material was recorded using a Sennheiser HMD-414 headset
microphone at 20kHz with 16-bit quantization, downsampled to 16kHz.

All RM  sentences are consistent with a limited language model that
allows queries about ships, ports, etc., along with commands to
control a graphics display system, but little else.  There is no
"official" language model, but a simple non-probabilistic word-pair
grammar that provides complete coverage of the sentences in this
corpus is provided.

The Resource Management text corpus was designed at BBN Laboratories,
Inc. and SRI International.  BBN also developed and made available the
"Word-Pair" grammar that has been used in the benchmark tests.  Texas
Instruments, Inc. recruited the subjects and recorded and digitized
the speech.  For more information about the design and collection of
this corpus see: P. Price, W.M. Fisher, J. Bernstein and D.S. Pallett,
"The DARPA 1000-Word Resource Management Database for Continuous
Speech Recognition", Proceedings of the 1988 International Conference
on Acoustics, Speech and Signal Processing (Paper S.13.21, pp. 651-
654).

A series of benchmark speech recognition performance assessment tests
were conducted beginning in March 1987 using this corpus in
conjunction with standardized scoring software.  For more information
see D.S.  Pallett, "Benchmark Tests for DARPA Resource Management
Database Performance Evaluations", in Proceedings of the 1989
International Conference on Acoustics, Speech and Signal Processing
(Paper S10.b.6, pp. 536-539) and related papers in the Proceedings of
the February 1989, October 1989, June 1990, and February 1991 DARPA
Speech and Natural Language Workshops [published by Morgan Kaufman
Publishers, Inc. 2929 Campus Drive, San Mateo, CA 94403].

The first portion of the Resource Management Corpus (RM1) comprises
four CD-ROM discs conforming to the ISO-9660 data format, in three
releases.

The fourth release, a speaker-dependent extension of the Resource
Management Corpus (RM2), with four speakers each reading a total of
2400 sentence utterances for extended speaker-dependent system
training, has been added to the series and comprises two CD-ROM discs.


Ia.   DARPA Resource Management Continuous Speech Database (RM1) 
      Speaker-Dependent Training Data, September 1989
      NIST Speech Discs 2-1.1, 2-2.1 (2 discs)

A two-CD-ROM set containing the Speaker-Dependent Training Data: 12
subjects, each reading a set of 600 "training sentences", 2 "dialect"
sentences, and 10 "rapid adaptation" sentences, for a total of 7344
recorded sentence utterances.  The 600 sentences designated as
"training sentences" were selected to provide good coverage of the
lexicon, covering 97% of the lexical items in the corpus.  The 12
speaker-dependent subjects (7 male and 5 female) were chosen to
represent each of the 12 largest phonetic clusters and were relatively
fluent readers with no obvious speech problems.


Ib.   DARPA Resource Management Continuous Speech Database (RM1) 
      Speaker-Independent Training Data, November 1989
      NIST Speech Disc 2-3.1 (1 disc)

A single CD-ROM disc containing the Speaker-Independent Training Data:
80 speakers each read the 2 "dialect" sentences plus 40 sentences from
the Resource Management text corpus, for a total of 3360 recorded
sentence utterances.  Any given sentence from a set of 1600 Resource
Management sentence texts was recorded by two subjects, while no
sentence was read twice by the same subject.

Ic.   DARPA Resource Management Continuous Speech Database 
      Development Test and Evaluation Test Data and Scoring and Speech
      Header Software, January 1990
      NIST Speech Disc 2-4.1 (1 disc)

A single CD-ROM disc containing all speaker-dependent and
speaker-independent system test material used in 5 DARPA benchmark
tests conducted in March and October of 1987, June 1988, and February
and October 1989, along with scoring and diagnostic software and
documentation for those tests.  Documentation is also provided
outlining other use of the Resource Management training and test
material at CMU in development of the SPHINX speaker-independent
system.  Example output and scored results for state-of-the-art
speaker-dependent and speaker-independent systems (i.e., the BBN
BYBLOS and CMU SPHINX systems) for the October 1989 benchmark tests
are included.

SPeech HEader REsources (SPHERE), the NIST-developed library of
software to manipulate the speech file header structure for
NIST-produced speech corpora on CD-ROM is included.  In cooperation
with the European Multi-lingual Speech input/output Assessment
Methodology and Standardization Project (SAM), software is provided to
permit conversion of the speech data on this series of CD-ROMs into
the corresponding "associated file" format used within the SAM
community.

Id.   DARPA Extended Resource Management Continuous Speech Speaker-
      Dependent Corpus (RM2), September 1990
      NIST Speech Discs 3-1.2, 3-2.2 (2 discs)

This 2-disc set forms a speaker-dependent extension to the Resource
Management (RM1) corpus.  The corpus consists of a total of 10,508
sentence utterances (2 male and 2 female speakers each speaking 2,652
sentence texts).  These include the 600 "standard" Resource Management
speaker-dependent training sentences, 2 dialect calibration sentences,
10 rapid adaptation sentences, 1800 newly-generated extended training
sentences, 120 newly-generated development-test sentences, and 120
newly-generated evaluation-test sentences.  The evaluation-test
material on the discs was used as the test set for the June 1990 DARPA
SLS Resource Management Benchmark Tests (see the Proceedings.)

The RM2 corpus was recorded at Texas Instruments.  The NIST speech
recognition scoring software originally distributed on the RM1 "Test"
Disc (CD2-4.1) was adapted for the RM2 sentences used for the June
1990 tests, and is included on these discs as well as the SPHERE
speech file header manipulation software.


II.   DARPA Acoustic-Phonetic Continuous Speech Corpus (TIMIT), October
      1990
      NIST Speech Disc 1-1.1

The TIMIT corpus of read speech has been designed to provide speech
data for the acquisition of acoustic-phonetic knowledge and for the
development and evaluation of automatic speech recognition systems.
TIMIT contains speech from 630 speakers from 8 major dialects of22
American English, each speaking 10 phonetically rich sentences.  The
TIMIT corpus includes time-aligned orthographic, phonetic, and word
transcriptions as well as speech waveform data for each utterance.
Text corpus design was a joint effort among the Massachusetts
Institute of Technology (MIT), SRI International (SRI), and Texas
Instruments, Inc. (TI).  The speech was: recorded at TI under
conditions similar to Resource Management (i.e., use of a Sennheiser
head-mounted microphone in a quiet environment, digitizing the speech
at a 20kHz sampling rate and downsampling to 16kHz for distribution);
transcribed at MIT; and verified and prepared for CD-ROM production by
the National Institute of Standards and Technology (NIST).

The TIMIT corpus transcriptions have been hand verified.  Test and
training subsets, balanced for phonetic and dialectal coverage, have
been selected and specified.  Tabular computer-searchable information
is included as well as written documentation.


III.  Texas Instruments-Developed Studio Quality Speaker-Independent
      Connected-Digit Corpus (TIDIGITS), February 1991
      NIST Speech Discs 4-1.1, 4-2.1, 4-3.1 (3 discs)

This three-disc set of CD-ROMs contains a corpus of speech which was
originally designed and collected at Texas Instruments, Inc. (TI) for
the purpose of "designing and evaluating algorithms for
speaker-independent recognition of connected digit sequences."  The
corpus contains read utterances from 326 speakers (111 men, 114 women,
50 boys, and 51 girls) each speaking 77 digit sequences, with each of
the speaker groups partitioned into test and training subsets.

The corpus was collected at Texas Instruments in 1982 in a quiet
acoustic enclosure using an Electro-Voice RE-16 Dynamic Cardiod
microphone, digitized at 20kHz.  The waveform files are in the NIST
SPHERE format.


IV.   Air Travel Information System (ATIS0) Corpus

During 1989 and 1990, the DARPA Spoken Language Program initiated
plans for development of a "common corpus" for both speech recognition
and natural language research, using "spontaneous goal-directed"
speech, rather than "read speech."  The common task domain that was
chosen is termed the "Air Travel Information System" (ATIS).

Users make spoken inquiries to simulated (or prototypical) ATIS speech
understanding systems to obtain air travel information.  An ATIS
system contains a "standardized" relational database derived from the
Official Airlines Guide.  The initial ATIS relational database
contains information relevant to travel among 9 major airports serving
11 cities (Dallas and Fort Worth each being served by DFW, and
Baltimore and Washington, DC by BWI).

Answers to inquiries are expressed in a "canonical answer
specification" (CAS) language are compared with "canonical" answers,
to measure performance.

During 1990, Texas Instruments developed a pilot corpus in the ATIS
domain (ATIS0), using a "Wizard-of-Oz" (WOZ) or "PNAMBIC" ("Pay No
Attention to the Man BehInd the Curtain") system to simulate an ATIS
SLS. (See Hemphill, Godfrey and Doddington's paper ``The ATIS Spoken
Language Systems Pilot Corpus'' in the Proceedings of the June 1990
DARPA Speech and Natural Language Workshop).  There are a number of
auxiliary files associated with each utterance in the ATIS corpora,
including an orthographic transcription and, for answerable queries, a
``reference answer''.  The ATIS0 Corpus is now available on a total of 6
CD-ROMs on 3 releases.

Further information on the ATIS domain, on the test paradigm, and on
ATIS-domain benchmark tests can be found in the Proceedings of the
DARPA Speech and Natural Language Workshops held in October 1989, June
1990 and February 1991. (Morgan Kaufman, Publishers, Inc., 2929 Campus
Drive, San Mateo, CA 94403.  ISBN numbers: 1-55860-112-0,
1-55860-157-0, and 1-55860-207-0.)


IVa.  DARPA Air Travel Information System (ATIS0)
      Spontaneous Speech Pilot Corpus and Relational Database, 
      November, 1990
      NIST Speech Disc 5-1.1 (1 disc)

The first ATIS0 release contains spontaneous utterances elicited in a
"Wizard-of-Oz" simulation of a spoken language system capable of
providing air travel information derived from the initial simplified
version of the Official Airline Guide, along with the relational
database containing the travel information (excluding connecting
flights). Thirty-six speakers participated in the ATIS0 data
collection effort, yielding a total of 912 utterances.  Waveform data
(at 16kHz sample rate, 16 bit quantization) are provided for both the
close-talking (Sennheiser) and desk-top (Crown PCC-160) microphones.

IVb.  DARPA Air Travel Information System (ATIS0)
      Read Versions of Spontaneous Data and 
      Adaptation Data, November, 1990
      NIST Speech Disc 5-2.1 (1 disc)

The second ATIS0 release contains "read" versions of the spontaneous
utterances for 20 of the speakers included in the first ATIS0 release
(NIST Speech Disc 5-1.1).  There are a total of 478 "read" productions
of the spontaneous utterances.  A set of 40 read "adaptation"
sentences is also included for each of the 20 speakers.  Waveform data
(at 16kHz sample rate, 16 bit quantization) are provided for both the
close-talking (Sennheiser) and desk-top (Crown PCC-160) microphones.
 
IVc.  DARPA Air Travel Information System (ATIS0)
      Speaker-Dependent Training Data, 
      December, 1990 and September 1991
      NIST Speech Discs 5-3.1, 5-4.1, 5-5.1 and 5-6.1 (4 discs)

The third ATIS0 release contains "read" speech in the ATIS domain for
10 of the speakers included in the first release (NIST Speech Disc
5-1.1).  The ten speakers read a total of 3171 utterances, or
approximately 317 utterances per speaker.  It was collected for the
purpose of training speaker-dependent speech recognition systems for
the ATIS0 domain.  Two of the four discs in this release contain data
for the close-talking (Sennheiser) microphone, and the other two
contain corresponding data for the desk-top (Crown PCC-160)
microphone. The total number of utterance waveform files (for both
microphones) on the four discs is 6342.

IVd.  DARPA Multi-Site Air Travel Information System (ATIS2)
      November 1990
      NIST Speech Discs 12-1.1 to 12-4.1 (4 disks)

Early in 1991, the DARPA SLS research community's rate of collection
of ATIS-domain data was accelerated by pooling data collected at five
sites: AT\&T, BBN, CMU, MIT's Laboratory for Computer Science, and SRI.
The resultant ATIS2 corpus contains approximately 15,000 utterances
from approximately 450 subjects.  All of the utterances have been
transcribed and almost 10,000 of the utterances have been annotated
with categorizations and canonical reference answers.  Unlike the
ATIS0 corpus, much of the data in ATIS2 was collected using partially
or fully-automated data collection systems.  The fully-automated data
collection systems were, in fact, working ATIS prototypes.

For ATIS2, the 10-city relational database that was used as the
knowledge base for the Pilot (ATIS0) was revised to accommodate
connecting flights and fares and some table headings were renamed.

In addition to training data, the February and November '92 ATIS
Benchmark Tests are included as well.  Each contains approximately
1,000 utterances from the pool of data collected by the five sites.

IVe.   ARPA Multi-Site Air Travel Information System (ATIS3)
       Spring 1994
       NIST Speech Discs 17-1.1 - 17-3.1 (3 discs)

In 1992 the relational database included flight information between 46
cities/52 airports while the schema remained identical to that used in
the ATIS2 10-city database. A new multi-site data collection cycle
ensued early in 1993 and data was collected at BBN, CMU, MIT, SRI, and
NIST.  The NIST data was collected using systems provided by BBN and
SRI.

The ATIS3 training corpora includes over 774 scenarios completed by
137 subjects yielding a total of over 7,300 utterances.  All
utterances have been transcribed and 2,900 of them have been
categorized and annotated with canonical reference answers.

Two 1000-utterance test sets were drawn from the data pooled by the
collection sites.  The first test set was used in the December '93
ARPA ATIS benchmark tests, the second has been reserved for the
December '94 benchmark tests.  The first test set accompanies this
release.


V1.  ARPA Continuous Speech Recognition Corpora
     August 1993 WSJ0
     NIST Speech Discs 11-1.1 - 11-12.1 (12 discs)

During 1991, the DARPA Spoken Language Program initiated efforts
to build a new corpus to support research on large-vocabulary
Continuous Speech Recognition (CSR) systems.

The initial portion of the CSR Corpora consists primarily of read
speech with texts drawn from a machine-readable corpus of Wall Street
Journal (WSJ) text.  Some spontaneous dictation is included in this
pilot corpus, in addition to the read speech.  The read portion of the
data was collected using 5,000-word and 20,000-word subsets of the WSJ
text corpus.  The spontaneous dictation portion of the corpus was
collected using journalists who dictated hypothetical news articles.

Data collection at MIT's Laboratory for Computer Science, SRI
International and Texas Instruments, yielded approximately 40 hours of
speech and over 31,000 utterances.  Two microphones were used (a
close-talking Sennheiser microphone, and secondary microphones of
varying types), for a total of approximately 80 hours of speech.

V2.  ARPA Continuous Speech Recognition Corpus (WSJ1)
     August 1993
     NIST Speech Discs  13-1.1 - 13-34.1 (34 discs)

The complete WSJ1 corpus contains approximately 78,000 training
utterances (~73 hours of speech), 4,000 of which are the result of
spontaneous dictation by journalists with varying degrees of
experience in dictation.  The corpus contains approximately 8,200
"conventional" development test utterances (~8 hours of speech), 6,800
of which are from spontaneous dictation.  As with the pilot corpus,
the entire corpus was collected using 2 microphones, so the amount of
speech in the entire corpus is about 162 hours.

In early 1993, a "Hub and Spoke" test paradigm was designed, calling
for eleven test sets, each a specific variation of the basic or
``hub'' condition.  The eleven Hub and Spoke Development and
Evaluation Test sets each contain approximately 7500 waveforms (~11
hours of speech).

WSJ1 waveforms have been compressed by about 2:1 using the
SPHERE-embedded ``Shorten'' compression algorithm developed at
Cambridge University.


VI.   Road Rally Conversational Speech Corpora (RDRALLY1)
      September1991
      NIST Speech disc 6-1.1 (1 disc)

The "Road Rally" corpora were designed for the development and testing
of word-spotting systems.  The corpora consist of two sub-corpora: (1)
the ``Stonehenge'' corpus and (2) the ``Waterloo'' corpus.  Stonehenge
was collected using telephone handsets modified to contain a high
quality microphone.  To gather conversational data, two talkers were
located in separate rooms, given a road map, and asked to participate
in a road rally planning task.  The digitized speech was filtered
using a 300 Hz to 3300Hz PCM FIR bandpass filter to simulate telephone
quality.

Twenty words were selected as keywords, and text files (``Key Word
Marking Files'') were developed which mark key word occurrences and
locations.

The Stonehenge corpus contains 3 "styles" of speech data: (1) the
spontaneous conversations, (2) a read paragraph, containing at least
one occurrence of each of the key words, and (3) a set of read
"carrier" sentences. There are 80 speakers (52 males and 28 females).

The Waterloo corpus was collected as an extension of Stonehenge,
providing similar domain material, but collected under different
conditions, and is intended for use in training models of keywords in
the conversational portion of the Stonehenge corpus.  The Waterloo
material was collected from 56 speakers (28M, 28F) using conventional
telephone handsets and dialed-up telephone lines in the Massachusetts
area, and consists of a read passage, only.  (The "read" passage for
Waterloo is not the same as that in Stonehenge.)  For this release,
the naturally band-limited telephone handset and line speech data were
subsequently filtered with the same 300 Hz to 3300 Hz PCM FIR bandpass
filter that was used for this release's Stonehenge data.

Suggested wordspotting training and test procedures, are outlined for
use with these corpora.


VII.   Texas Instruments-Developed 46-Word Speaker-Dependent Isolated Word
      Corpus (TI46), September 1991
      NIST Speech Disc 7-1.1 (1 disc)

This CD-ROM contains a corpus of speech which was originally designed
and collected at Texas Instruments, Inc. (TI) in 1980, and used
initially in performance assessment tests of isolated-word
speaker-dependent technology. (See "Speech Recognition: Turning Theory
to Practice" by G. R. Doddington and T. B. Schalk, in IEEE Spectrum,
Vol. 18, No. 9, September 1981.)

The 46-word vocabulary consists of two sub-vocabularies: (1) the TI
20-word vocabulary (consisting of the digits zero through nine plus
the words "enter", "erase", "go", "help", "no", "rubout", "repeat",
"stop", "start", and "yes", and (2) the TI 26-word "alphabet set"
(consisting of the letters "a" through "z").

The corpus contains read utterances from 16 speakers (8 males and 8
females) each speaking 26 utterances of the 46-word vocabulary: 16
tokens designated as training and 10 as testing tokens.

The corpus was collected at Texas Instruments in a quiet acoustic
enclosure using an Electro-Voice RE-16 Dynamic Cardiod microphone at
12.5kHz sample rate with 12-bit quantization.  The files are in NIST
SPHERE format, and have a ".wav" filename extension.


VIII.    Switchboard Corpus Excerpts, Credit Card Conversations,
	May, 1992
	NIST Speech Disc 8-1.2

This CD-ROM contains 35 conversations on the topic of ``Credit Card
Use''. Most but not all can also be found in the Switchboard Corpus
(see below).  The conversations can be used in training and testing
wordspotting systems.  In addition to 2-channel mu-law encoded audio
waveform files, the disc contains transcriptions, time-alignments, and
wordspotting targets.

The Switchboard Corpus was collected at Texas Instruments and produced
on CD-ROM at the National Institute of Standards and Technology.


IX.	Switchboard Corpus
	Recorded Telephone Conversations, October 1992
	NIST Speech Discs 9-1.1, 9-3.1 to 9-27.1

SWITCHBOARD is a collection of about 2400 two-sided telephone
conversations among 543 speakers (302 male, 241 female) from all areas
of the United States. A computer-driven "robot operator" system
handled the calls, giving the caller appropriate recorded prompts,
selecting and dialing another person to take part in a conversation,
introducing a topic for discussion, and recording the speech from the
two subjects into separate channels until the conversation was
finished.  About 70 topics were provided, of which about 50 were used
frequently.  Selection of a topic and a second
participant for a given caller was based on the following two
constraints: (1) no two speakers would converse together more than
once, and (2) no one spoke more than once on a given topic.
In other words, every conversation represents a unique combination of
twopersons, and a new topic for each.

The waveform files were recorded into two channels directly from the
T1 digital telephone circuits, at an 8kHz sample rate
and 8-bit mu-law quantization. Complete orthographic transcriptions
were made for each conversation, with codes to identify overlapping
portions (both speakers talking at the same time), certain non-speech
events (laughter, coughs, etc), and interruptions/hesitations.  Each
conversation was also rated by transcribers for various quality
factors (amount of cross-talk between channels, static and background
noise, topicality, etc).  In addition, each transcription was
verified, and then used in a forced speech-recognition algorithm
to establish timing marks for word and utterance boundaries;
transcriptions are provided in the corpus in both "plain text" and
"time-aligned" forms.

The corpus is distributed in a notebook-style binder with 28 CD-ROMs,
(27 containing speech data, and one containing all transcription
data).  Preparation of the data for CD-ROM production was done by
NIST.  The waveform files use the NIST SPHERE format.
 

X.	NYNEX NTIMIT Corpus
	Telephone Network Acoustic-Phonetic Continuous Speech Corpus,
	August 1992
	NIST Speech Disc 10-1.1, 10-2.1

The NYNEX Science and Technology Laboratories have produced a
"telephonized" version of the TIMIT corpus, by transmitting all 6300
TIMIT utterances through a handset and across various NYNEX telephone
channels in a controlled manner.  The data have been prepared for
CD-ROM production by NIST.  Waveform files use the NIST SPHERE format.
For more information about NTIMIT, see "NTIMIT: A Phonetically
Balanced, Continuous speech, Telephone Bandwidth Speech Database", by
C.  Jankowski, et al. in Volume 1 of Proceedings of ICASSP-90, pp.
109-112).


XI.	Association for Computational Linguistics Data Collection
	Initiative (ACL/DCI)
	September, 1991

The ACL Data Collection Initiative disk contains text from: Wall Street
Journal, copyright 1987, 1988, 1989, provided by Dow Jones, Inc.; the
Collins English Dictionary, Copyright 1979, William Collins Sons \&
Co., Ltd.; scientific abstracts provided by the U.S. Department of
Energy; and a variety of gramatically tagged and parsed materials from
the Treebank project at the University of Pennsylvania, copyright
1990,1991, University of Pennsylvania. The total amount of
uncompressed text is 620 Mbytes.

The many formats in which the originals of these texts came have all,
to one extent or another, been mapped into a markup language
consistent with the SGML standard (ISO 8879).

The format of the material from the Wall Street Journal uses a
labelled bracketing, expressed in the style of SGML, although no
formal SGML DTD is provided. The tag set has been modified by turning
the Dow Jones header categories into tags and by creating ad hoc tages
such as ``<dateline>.'' The original datelines are presented as
separate text units; the text is divided and tagged into paragraphs
and sentences with each sentence presented on a single line. Nothing
has been done to modify the typographical methods used to subdivide
headlines and stories into sections, nor are any of the text features
within sentences (quotes, ellipsis, etc.) normalized.

The Collins English Dictionary is present in two forms. One form was
approximately parsed into fielded records as an exercise in learning a
language called ``FIT'', by a student working under the direction of
Lloyd Nakatani at AT\&T Bell Laboratories during the summer of 1990.
The original digital image of the typographer's tape that the database
version was prepared from had serious flaws that were not detected and
corrected until later; the corrected version, a clean typographer's
tape, is presented in a separate directory. A properly-analyzed
database version will be provided in the future.  The documentation
includes notes developed during the new attempt to analyze the tape
from scratch.

The Department of Energy abstracts reside in files that are
approximately one megabyte each. The original 950 separators have
been replaced with newlines, and space padding between articles was
removed.  An acronym dictionary that was extracted from the database
as an indication of the material's topic areas has been included in a
separate directory.

Provisional material from the Penn Treebank project is divided into
two subdirectories on this disk. The subdirectory ``postext'' contains
text with part-of-speech annotations; ``parstext'' contains text with
syntactic bracketing.


XII.	TIPSTER Information Retrieval Text Research Collection

The TIPSTER project is sponsored by the Software and Intelligent
Systems Technology Office of the Advanced Research Projects Agency
(ARPA/SISTO) in an effort to significantly advance the state of the
art in effective document detection (information retrieval) and data
extraction from large, real-world data collections.

The detection data is comprised of a new test collection built at NIST
to be used both for the TIPSTER project and the related TREC project.
The TREC project has many other participating information retrieval
research groups, working on the same task as the TIPSTER groups, but
meeting once a year in a workshop to compare results (similar to MUC).
The test collection built at NIST consists of 3 disks (gigabytes) of
documents, 150 topics, and the answers (relevant documents) for those
topics.

The documents in the test collection are varied in style, size, and
subject domain.  The first disk contains material from the Wall Street
Journal (1986, 1987, 1988, 1989), the AP Newswire (1989), the Federal
Register (1989), information from Computer Select disks (Ziff-Davis
Publishing), and short abstracts from the Department of Energy.  The
second disk contains information from the same sources, but from
different years.  The third disk contains more information from the
Computer Select disks, plus material from the San Jose Mercury News
(1991), more AP newswire (1990), and about 250 megabytes of formatted
U.S. Patents.  The format of all the documents is relatively clean and
easy to use, with SGML-like tags separating documents and document
fields.  There is no part-of-speech tagging or breakdown into
individual sentences or paragraphs as the purpose of this collection
is to test retrieval against real-world data.

A preliminary version of the test collection is available from LDC,
with a final version to be ready in the fall.


XIa.	TIPSTER Information Retrieval Text Research Collection
	Vol. 1, March 1992

           /ap         Associated Press Newswire material, copyright 1989
           /fr         Federal Register material, 1989
           /wsj        Wall Street Journal, copyright 1987, 1988, 1989
           /doe        Department of Energy abstracts


XIb.	TIPSTER Information Retrieval Text Research Collection
	Vol. 2, July 1992

           /ap          Associated Press Newswire material, copyright 1988
           /fr          Federal Register, 1988
           /wsj         Wall Street Journal, copyright 1990, 1991, 1992
           /ziff        Ziff-Davis Publishing, copyright 1989, 1990
           /doe         Department of Energy abstracts

XIc.    TIPSTER Information Retrieval Text Research Collection
        Vol. 3, April 1993

            /ap          Associated Press material, copyright 1990
            /patents     U.S. Patent documents, 1983-1991
            /sjm         San Jose Mercury News, copyright 1991



XIII.	The Penn Treebank Project
  	Preliminary Release, Version 0.5

This CD-ROM contains over 1.6 million words of hand-parsed material
from the Dow Jones News Service, plus an additional 1 million words
tagged for part-of-speech. This material is a subset of the corpus for
the current DARPA large-vocabulary speech recognition project.

It also contains the first fully parsed version of the Brown Corpus,
which has also been completely retagged using the Penn Treebank tag
set. Also included are tagged and parsed data from Department of
Energy abstracts, IBM computer manuals, MUC-3, and ATIS.

In addition, the CD-ROM includes source code for several software
packages, including tgrep, which permits the user to search for
specific constituents in tree structures.

XIV.	The HCRC Map Task Corpus
	Disc No. 1-4, 5-8
	1992

The Map Task Corpus is a set of 8 CD-ROMs containing a total of about
18 hours of spontaneous speech that was recorded from 128 two-person
conversations, involving 64 different speakers (32 female, 32 male,
all adults, each taking part in four conversations).  The 64 speakers
were all students at the University of Glasgow, 61 of them being
native Scots.  The conversations were carried out in an experimental
setting, in which each participant has a schematic map in front of
them, not visible to the other. Each map is comprised of an outline
and roughly a dozen labelled features (e.g. "a white cottage", "an oak
forest", "Green Bay", etc). Most features are common to the two maps,
but not all. One map has a route drawn in, the other does not. The
task is for the participant without the route to draw one on the basis
of discussion with the participant with the route. In addition to the
conversations, each speaker provides a wordlist reading, consisting of
the major vocabulary items contained in the conversations.

The experimental design allows a number of different phonemic,
syntactico-semantic and pragmatic contrasts to be explored in a
controlled way.  In particular, maps and feature names were designed
to allow for controlled exploration of phonological reductions of
various kinds in a number of different referential contexts, and to
provide, via varying patterns of matches and mis-matches between the
two maps, a range of different stimuli for referent negotiation.  Also
the conditions of the conversations were carefully balanced: In half
of them the talkers were strangers, in half friends; in half of them
the talkers could see each other's faces, in half they could not.

The waveform data are provided in "raw" (headerless) files (16-bit
samples, 20 kHz sample rate, 2 channels per conversation), and
alternative header files are provided for use with software based on
either the NIST "SPHERE" header structure or the European "SAM"
header structure.  Text transcriptions are provided for each
conversation, along with PostScript files of the map images used in
the experiments.  Additional materials include full documentation of
the experimental design and data collection protocol, resources for
using SGML tools on the transcriptions and other text materials, and
an extensive set of source code for performing basic signal
processing functions on the waveform data, such as down-sampling,
de-multiplexing, channel summation, and D/A conversion for Sun
workstations (including playback of segments selected via inspection
of transcripts in Emacs).  


XV.	Celex Lexical Database
        October 1993
        Max Planck Institute for Psycholinguistics, 
        Center for Lexical Informaiton

This corpus contains ASCII versions of the CELEX lexical databases of
English (version 2.5), Dutch (version 3.1) and German (version 2.0).
CELEX was developed as a joint enterprise of the University of
Nijmegen, the Institute for Dutch Lexicology in Leiden, the Max Planck
Institute for Psycholinguistics in Nijmegen, and the Institute for
Perception Research in Eindhoven.  Pre-mastering and CD-ROM production
was done by the LDC.

        For each language, this CD-ROM contains detailed information on :

          1.)  the orthography (variations in spelling, hyphenation),

          2.)  the phonology (phonetic transcriptions, variations in
               pronunciation,
               syllable structure, primary stress),

          3.)  the morphology (derivational and compositional structure,
               inflectional paradigms),

          4.)  the syntax (word class, word-class specific
               subcategorizations, argument structures) and

          5.)  word frequency (summed word and lemma counts, based on 
               recent and representative text corpora).


        The databases have not been tailored to fit any particular
database management program.  Instead, the information is in ASCII
files in a UNIX directory tree that can be queried with tools such as
AWK or ICON.  Unique identity numbers allow the linking of information
from different files. Some kinds of information have to be computed
on-line; wherever necessary, AWK functions have been provided to
recover this information.  README files specify the details of their
use.

        A detailed User Guide describing the various kinds of lexical
information available is supplied.  All sections of this guide are
POSTSCRIPT files, except for some additional notes on the German
lexicon in plain ASCII.

XVI.  United Nations Parallel Text Corpus  (English, French, Spanish)
      Version 1.0

This set of three compact discs contains documents provided
to the LDC by the United Nations, for use in research on machine
translation technology.  The documents come from the Office of
Conference Services at the UN in New York, and are drawn from
archives that span the period between 1988 and 1993.  

This publication contains the English, French and Spanish archives,
with data from each language stored on a separate disc in the set.
Care has been taken to arrange the document files in a parallel
directory structure for each language, so that corresponding
translations of a document are found directly by means of the
directory paths and file names.

All parallel files in this corpus are English-based: for every file on
the English disc, there will be a corresponding file on either the
French or Spanish disc, or both.  Tables are included on all discs to
assist in determining which parallels are present.  Due to the nature
and organization of UN translation services and the original
electronic text archives, the process of finding and sorting out
parallel documents yielded a numerous gaps, with many files in each
language having no parallel in other languages.

In preparing the text for publication, we have applied a
fully-compliant SGML format (Standard Generalized Markup Language).
For those researchers who use SGML, a working DTD (Document Type
Definition) is provided on each disc.  For those who do not need SGML
markup, a simple script is included that can be used to filter out the
SGML-specific material, and leave only the plain text.  The character
set used is the 8-bit ISO 8859-1 Latin1, in which accented letters and
some other non-ASCII characters occupy the upper 128 entries of the
character table.


OGI Spelled and Spoken Telephone Corpus


        The OGI Spelled and Spoken Telephone Corpus consists of speech
recordings from over 3650 telephone calls, each made by a different
speaker to an automated prompting/recording system installed at the
Oregon Graduate Institute. Speakers were asked to say their name,
where they were calling from, and where they grew up; they were asked
to answer a couple of yes/no questions, and to spell their first and
last names; many were also asked to repeat a few specific words, and
to recite the letters of the alphabet.

Each response to a prompt is stored as a separate waveform file, and
the files are organized according to prompt (response type); all
responses from a given call have a unique caller-index number as part
of the file named, so that responses can easily be sorted by speaker.
Waveform data are stored in compressed form, using the NIST SPHERE 2.0
software package, which is available separately at no charge to users.
SPHERE 2.0 provides the decompression software needed to extract the
waveform data, as well as tools for accessing and modifying file
headers.

Time-aligned phonetic transcriptions are provided for a subset of
responses, and a complete log of each (giving speaker sex, quality
judgments, and orthographic transcriptions of all responses) is
included in a form suitable for use as a relational data base.


OGI Multi-Language Corpus 

The corpus consists of responses to prompts spoken over commercial
telephone lines by speakers of English, Farsi(Persian), French,
German, Hindi, Japanese, Korean, Mandarin Chinese, Spanish, Tamil and
Vietnamese.  It contains a total of 1927 calls, an average of 175
calls per language.

Speech was collected using an automated system that answered the
telephone, played digitized prompts in the appropriate language to
request the speech samples, and digitized the callers' responses for a
designated period of time.

Log files are included that provide a set of automatic measurements
made on each utterance. In addition, some utterances were
automatically segmented into broad phonetic catagories. The speech
data are compressed, with NIST SPHERE headers.


SPIDRE Speaker Identification Corpus (April '94)

This is 2-CD subset of the SWITCHBOARD collection (see above),
selected for speaker ID research, and with special attention to
telephone instrument variation.  It contains training and testing data
for experiments in closed or open set recognition or verification.
Combining the two sides of the conversations also permits speaker
change detection, or speaker monitoring, experiments.

There are 45 ``target'' speakers; four conversations from each target
are included, of which two are from the same handset. There are also
100 calls in which no target appears.  Since all conversations are
two-sided, this results in 180 target sides and 180 + 200 = 380
nontarget sides.

Except for truncations of a few longer calls at 5 minutes, the call
themselves are as described under SWITCHBOARD.


Air Traffic Control Corpus (ATC0)

The Air Traffic Control Corpus (ATC0) is an eight-disc set of recorded
speech for use in supporting research and development activities in
the area of robust speech recognition in domains similar to air
traffic control (several speakers, noisy channels, relatively small
vocabulary, constrained languaged, etc.)  The audio data on these
discs is composed of voice communication traffic between various
controllers and pilots.

The audio files are 8 KHz, 16-bit linear sampled data, representing
continuous monitoring, without squelch or silence elimination, of a
single FAA frequency for one to two hours.

Full transcripts, including the start and end times of each
transmission, are provided for each audio file.

ATC0 consists of three subcorpora, one for each airport in which the
transmissions were collected -- Dallas Fort Worth (DFW), Logan
International (BOS), and Washington National (DCA). The subcorpora are
also available separately.

The complete disc-set contains approximately 70 hours of controller
and pilot transmissions collected via antennas and radio receivers
which were located in the vicinity of the Dallas Fort Worth, Logan
International and Washington National airports.

Detailed information regarding the collection process and the
equipment used can be found on each disc in the file, "atc.doc" in
the "/doc" directory.

The ATC0 Corpus was collected by Texas Instruments under contract to
ARPA.  It was produced on CD-ROM by the National Institute of
Standards and Technology for distribution by the Linguistic Data
Consortium.


YOHO

The YOHO database is the only large scale, scientifically controlled
and collected, high-quality speech database for speaker authentication
testing at high confidence levels. The YOHO Database contains:

  * "Combination lock" phrases (e.g., 36-24-36)
  * Collected over 3 month period in a real-world office environment
  * 4 enrollment sessions per subject with 24 phrases per session
  * ~10 test sessions per subject with 4 phrases per session
  * 8 kHz sampling with 3.8 kHz analog bandwidth
  * 1.5 gigabytes of data

 The primary database for this research was collected by ITT under a
U.S. Government contract administered by the author, Joseph Campbell.
This database is already in digital form, so the first signal
processing block of the verification system, signal
conditioning and acquisition is taken care of.


XVII. Future Corpora to be Produced


KING Speaker Identification Corpus (Summer '94)

KING-92 is a new version of the KING Corpus prepared for publication
on CD-ROM.  The corpus was created for research in the area of
free-text speaker identification and verification collected by ITT in
both Nutley, New Jersey and San Diego.  There are twenty-five New
Jersey speakers and twenty-six San Diego speakers, all male. There are
ten sessions of each speaker, each record with both a wideband and a
narrowband version.  Each session has thirty to sixty seconds of
compacted (long silences removed).  Sessions were recorded a week to a
month apart.

The collection method used in KING was to establish a connection over
long distance lines between test subject and an interlocutor each at
an ITT laboratory location.  The phones used by the test subjects were
equipped with a high quality microphone; consequently two parallel
recordings were made of that side conversation, while the
interlocutor's side was not recorded.  The two parties carried out a
variety of tasks designed to elicit natural-sounding speech from the
recorded subject: interpreting a picture, drawing a problem,
describing a picture, etc. The results have been downsampled to 8 KHz,
with 16-bit linear samples.h and the wideband microphone speech

A peculiar anomaly of the narrowband San Diego data is the phenomenon
known as "The Great Divide". There is an apparent change in the
channel characteristics between session 1-5 and sessions 6-10
identification algorithms causing generally poor performance across
the divide as a result.  (For the New the composite transfer functions
resemble those for San Diego session 1-5.)

Speech-to-noise ratios average about 10 db worse for the New Jersey
narrowband data than for the San Diego one.  The ratio is less than 20
db for over half the New Jersey narrowband files.  Straight text
transcriptions of the conversations are included.  Though phonetic
markings were to be made for the corpus, they were found to have
serious inconsistencies and have been discarded.  The corpus consists
of two CDs, one containing the wideband data and the other the
narrowband.


ECI

The ECI Corpus has 48 subcorpora. The total size of these is roughly
92 million (lexical) words.  The languages are sorted by size. Numbers
in brackets are the corpus numbers.

Language        Thousands of Words

German          (70) 34291 (09)  191 (65)   20 (28) 187
                (29)    59 (30)   76 (47)   24 (59)  50
                (71)    21 (70A) 999                    = 35918
French          (31)  4775 (04) 4121 (28)  187 (29)  59
                (30)    76 (47)   24 (51)    6 (59)  50
                (71)    21 (32) 1667                    = 10986
Spanish         (31)  4500 (13)  830 (14) 1041 (15) 447
                (47)    24 (32) 1667 (59)   50 (71)  21 =  8580
English         (31)  4222 (36) 1141 (74)   95 (28) 187
                (47)    24 (51)    6 (56)   97 (59)  50
                (71)    21 (32) 1667                    =  7510
Dutch           (03)  5500 (02)  600 (47)   24 (71)  21 =  6145
Czech           (44)  4726                              =  4726
Italian         (11)  3518 (42)  303 (58)   13 (29)  59
                (30)    76 (47)   24 (71)   21          =  4014
Chinese         (78)  2895                              =  2895
Greek           (10)  2515 (47)   24 (59)   50 (71)  21 =  2610
Norwegian       (41)  2226                              =  2226
Swedish         (37)  1718                              =  1718
Serb/Croat/Slov (24)   700 (56)  289                    =   989
Tibetan         (76)   834                              =   834
Portuguese      (60)   675 (47)   24 (71)   21          =   720
Malay           (80)   563                              =   563
Russian         (73)   364                              =   364
Japanese        (57)   203                              =   203
Turkish         (20)   173 (20A) 110                    =   283
Albanian        (82)   205                              =   205
Gaelic          (55)   141                              =   141
Estonian        (39)   100                              =   100
Usbek           (81)    88                              =    88
Latin           (74)    75                              =    75
Danish          (47)    24 (71)   21                    =    45
Lithuanian      (89)    20                              =    20
Bulgarian       (84)     5                              =     5

Total                                                   = 91969



MACROPHONE

MACROPHONE, the American English contribution to the international
POLYPHONE database of telephone speech corpora in various languages,
consists of approximately 200,000 utterances by 5000 speakers. It is
designed to provide material sufficient and suitable for research,
development, and evaluation of automatic speech recognition technology
for common telephone applications, such as shopping, transportation,
database access, and autodialing.  In addition to application-oriented
phrases and numerous digit strings, seven sentences are spoken by each
talker to provide ensemble phoneme, diphone and triphone coverage of
the language.  The spoken material also refers to times, locations,
monetary amounts, spellings, and interactive operations.

The utterances were collected automatically over the telephone network
by recording directly from a T1 connection in 8 kHz, 8-bit mu-law
format.  The participants, roughly equal numbers of males and females,
were solicited by a marketing firm from all regions of the United
States.  They ranged in age from the teens to the seventies, and
represented a broad range of educations and incomes as well.  Each
recorded utterance is accompanied by an orthographic transcription
which also notes any unusual acoustic events or anomalies.








