Newsgroups: comp.speech
Path: pavo.csi.cam.ac.uk!doc.ic.ac.uk!agate!howland.reston.ans.net!zaphod.mps.ohio-state.edu!cs.utexas.edu!csc.ti.com!tilde.csc.ti.com!trdc000.trdc.ti.com!picone
From: picone@trdc001.trdc.ti.com (Joe Picone)
Subject: Re: Linguistic Data Consortium
In-Reply-To: cig@duke.cs.duke.edu's message of 15 Mar 93 18:48:49 GMT
Message-ID: <PICONE.93Mar16055908@trdc001.trdc.ti.com>
Sender: usenet@trdc.ti.com
Nntp-Posting-Host: trdc001
Organization: Tsukuba Research and Development Center
References: <732221328@tigris.cs.duke.edu>
Date: Mon, 15 Mar 1993 20:59:08 GMT
Lines: 100

From: graff@chestnut.ling.upenn.edu (David Graff)
Newsgroups: comp.speech
Subject: Here's info speech corpora from LDC
Keywords: LDC, speech data bases
Date: 9 Oct 92 20:54:03 GMT
Organization: Linguistic Data Consortium
Nntp-Posting-Host: chestnut.ling.upenn.edu

Information about the Linguistic Data Consortium is now available via
anonymous ftp from:
			ftp.cis.upenn.edu	(130.91.6.8)
in the directory:
			/pub/ldc

(Note that the numeric site ID may change at some unspecified point in
the future, but the name will be kept constant.)

Here are some excerpts from the README file in that directory:

Briefly stated, the LDC has been established to broaden the collection
and distribution of speech and natural language data bases for the
purposes of research and technology development in automatic speech
recognition, natural language processing, and other areas where large
amounts of linguistic data are needed.

The documents currently available in this directory are: a paper that
explains the background, rationale and goals for the LDC, a brief list
of the various data bases that are currently or soon to be available,
and a couple of tables summarizing these corpora.  Each document is
present in one or both of two forms: a compressed postscript file
(*.ps.Z), and/or an uncompressed ASCII file (*.txt); if you would like
hard-copy of the postscript versions mailed to you, please contact me
or:
	Elizabeth Hodas
	441 Williams Hall
	University of Pennsylvania
	Philadelphia, PA 19104-6305
	Phone:   (215) 898-0464
	Fax:     (215) 573-2175
	e-mail:  ehodas@walnut.ling.upenn.edu

Here is the brief list of corpora:

	     Resources of the Linguistic Data Consortium

			     October 1, 1992

Resources now available to LDC members include:

	* The TIMIT and NTIMIT speech corpora
	* The Resource Management speech corpus (RM1, RM2)
	* The Air Travel Information System (ATIS0) speech corpus
	* The Association for Computational Llinguistics - Data
		Collection Initiative text corpus (ACL-DCI)
	* The TI Connected Digits speech corpus (TIDIGITS)
	* The TI 46-word Isolated Word speech corpus (TI-46)
	* The Road Rally conversational speech corpora (including
		"Stonehenge" and "Waterloo" corpora)
	* The Tipster Information Retrieval Test Collection
	* The Switchboard speech corpus ("Credit Card" excerpts and
		portions of the complete Switchboard collection)

Further resources to be made available within the first year (or two)
include:

	* The Machine-Readable Spoken English speech corpus (MARSEC)
	* The Edinburgh Map Task speech corpus
	* The Message Understanding Conference (MUC) text corpus of
		FBI terrorist reports
	* The Continuous Speech Recognition - Wall Street Journal
		speech corpus (WSJ-CSR)
	* The Penn Treebank parsed/tagged text corpus
	* The Multi-site ATIS speech corpus (ATIS2)
	* The Air Traffic Control (ATC) speech corpus
	* The Hansard English/French parallel text corpus
	* The European Corpus Initiative multi-language text corpus
		(ECI) 
	* The Int'l Labor Organization/Int'l Trade Union
		multi-language text corpus (ILO/ITU)
	* Machine-readable dictionaries/lexical data bases (COMLEX,
		CELEX)

During the period between July 1992 and July 1993, the LDC Board has
allocated $2.3 million for data collection and preparation on behalf
of LDC members.  Some of this money will be used to continue on-going
efforts, such as ATIS and CSR. The rest of it will be spent on new
material, including plain and annotated speech and text corpora,
lexicons, grammatical resources, and software.  Members will be kept
informed about the progress of this work, and will receive new
material as soon as it becomes available.  Please send inquiries about
Calls for Proposals or submission of data for distribution to:

	Mark Liberman		or	Jack Godfrey
	myl@unagi.cis.upenn.edu		jgodfrey@unagi.cis.upenn.edu
-- 
David Graff			Linguistic Data Consortium
graff@chestnut.ling.upenn.edu	441 Williams Hall
voice: (215) 898-0887		University of Pennsylvania
fax:   (215) 573-2175		Philadelphia, PA 19104

