Language Text At CMU
We have the following language text and resources here at CMU. We are
also able to request text from the LDC and may be able to obtain some
text from ELRA. See their web pages for what is available.
- LDC:
Home page.
Alex Rudnicky (air@cs)
is the CMU contact for ordering text. Barb Sandling
(WeH 3204, x-8860) keeps the actual discs and maintains
a sign-out sheet.
- ELRA:
Home page.
Contact Maxine Eskenazi (max@cs) for information. We are
not members of ELRA, but Maxine receives the newsletters
and can order text, speech or whatever else for us.
-
CSR 1996 Language Model Broadcast News Archive
- Location: 2 cdroms
- Contact: Kristie Seymore
- Description: Broadcast News text covering the periods January 1992 - April 1996.
.vp and raw text forms, as well as conditioning tools.
- More details
-
NAB 1995 (North American Business News)
- Location: 5 cdroms
- Contact: Roni Rosenfeld and Kristie Seymore
- Description:Business text 305MW,
contains Wall Street Journal,
1987 - 1995 (87 - 89, wtrics only).
New York Times, 1995.
Associated Press, 1988-1990.
Los Angeles Times and Washington Post, 1995.
Reuters Financial, 1995.
San Jose Mercury News, 1991.
(Great overlap with Tipster.)
Roni has some conditioning tools.
LDC will release an official version at some point.
The following forms of data are included:
.st - sentence tagged text, not conditioned for speech,
.vp - conditioned for speech, all punctuations retained,
.wtric - vocabulary-independent trigram counts, based on a
.svp3 view of the .vp data (most punctuations removed),
tools for converting .vp to .svp1, .svp3 etc.
-
CSR NAB94 (also known as CSRNAB1)
- Location: 2 cdroms (22-1.1, 22-2.1)
- Contact: Roni Rosenfeld
- Description: 227MW, all contained in NAB 1995.
AP89-90, SJM91, WSJ87-94.
Also, WSJ87-89 is available in .vp form.
(Great overlap with Tipster.)
-
CSR95 LM Text Data, HUB 4 ONLY
- Location: cdrom
- Contact: Bob Weide
- Description:
-
CSR 1995 Langage Model Disc 2
- Location: cdrom
- Contact: Bob Weide
- Description: Baseline language model file plus some
source texts
-
Switchboard
- Location: 1 language modeling cdrom, NIST Speech Disc 9-1.1
- Contact: Roni Rosenfeld
- Description: Conversational text
2.5MW phone conversations among people who don't know each other
on a prespecified topic from a list of 70 topics.
This is a transcribed version. An annotated version will be
available in the future.
(Annotated extensively for POS, brackets, ... disfluencies)
-
JEIDA Japanese word corpora
- Location:
- Contact: Barb Sandling
- Description:
-
Hansard's text corpus
- Location:
- Contact: Barb Sandling
- Description:
-
Switchboard Credit Card
- Location: cdrom
- Contact: Bob Weide
- Description: 35 conversations on the topic of ``Credit Card Use''.
This is a subset of the general Switchboard
corpus listed above. The transcription text
is unconditioned.
-
Broadcast News,
Research Publications Inc.
- Location: cdrom
- Contact: Roni Rosenfeld
- Description: Broadcast text from 1992 - 95.
About 40MW per year from CNN, NPR and some subset of the networks.
Same data is delivered to us daily
from Journal Graphics Inc.
Early version of conditioning tools were received from BBN
(contact person: Kristie Seymore). A revised version should be made
available soon via LDC.
-
Clarinet
- Location: /tmp_mnt/net/cocorico/usr0/air.old
- Contact: Alex Rudnicky
- Description: Broadcast text from commercial wire services,
AP, Reuters and Dow-Jones.
Has been spooling for a few years now, stopped just a few months ago.
You might be able to convince Alex to start spooling
the data again if you need it.
Look in /net/alpha3/usr0/yuzong/clarinet/src/ for various scripts.
-
Closed-captioned text
- Location: /net/speech1/usr/alex/jgi*
- Contact: Alex Hauptmann
- Description: A selection of JGI Broadcast text which arrives daily.
No commercials are included in those transcripts,
and it seems that segmentation (how to break things up) varies
by source of the show. Chengxiang Lu
has some conditioning scripts
in /net/alf16/usr/lu/cctools/, but he warns that its
kind of a mess.
-
IRCs, bboards, etc
- Location: /net/alf7/rspeech-1/general/irc-data
- Contact: Kevin Lenzo
- Description: Conversational text - more can be collected. Kevin
would be able to point you in the right direction.
-
The Universal Library Project
-
Voice mail text
- Location:
- Contact: Bob Weide
- Description: Chengxiang Lu has some conditioning scripts
in /net/alf16/usr/lu/lm_mbox/example/.
-
Brown corpus
- Location: /afs/cs/project/fgdata-2/brown/brown_corpus
- Contact: Bob Weide
- Description: 1 MW (64 MB) consisting of 500 articles
of approximately 2000
words each, spanning many genres of English. 1961.
-
The Penn Treebank Project - Release 2
- Location: 1 cdrom
- Contact: Roni Rosenfeld and Bob Weide each have a copy
- Description: This CD-ROM contains over 1.6 million words of
hand-parsed material from the Dow Jones
News Service, plus an additional 1 million words
tagged for part-of-speech. It also contains the
first fully parsed version of the Brown Corpus,
which has also been completely
retagged using the Penn Treebank tag set.
Also included are tagged and parsed data from
Department of Energy abstracts, IBM computer manuals, MUC-3, and ATIS.
1 million words of 1989 Wall Street Journal material
annotated in Treebank II style.
Tools for processing Treebank data, including a
new version of tgrep (a tree-searching and
manipulation package).
-
Latino-40 Spanish Speech Corpus from Entropics Research Labs
- Location: 1 cdrom
- Contact: Bob Weide
- Description: 13,000 sentences selected from Latin American newspaper text
-
Ricardo Corpus of Telephone Speech - Spanish
- Location: 1 cdrom
- Contact: Bob Weide
- Description:
-
Tipster
- Location: 3 cdroms
- Contact: Bob Weide
- Description: Associated Press Newswire material, 1988 - 90.
Federal Register material, 1988 - 89.
Wall Street Journal, 1987 - 92.
Department of Energy abstracts.
U.S.Patent documents, 1983-1991.
San Jose Mercury News, 1991.
(Great overlap with NAB94 and NAB95.)
-
ACL-DCI
- Location: 1 cdrom
- Contact: Roni Rosenfeld
- Description:
ACL-DCI = Association for Computational Linguistics Data
Collection Intiative.
Wall Street Journal Materials (1987, 1988, 1989).
Collins English Dictionary.
Scientific abstracts provided by the
U.S. Department of Energy.
A variety of grammatically tagged and parsed materials
from the Treebank project at the University of
Pennsylvania.
-
HCRC Map Task Corpus
- Location: 2 cdroms
- Contact: Bob Weide
- Description: 64 speakers (mostly native Scots)
describing items on a map.
128 two-person conversations,
spontaneous speech.
-
ATIS
- Location:
/net/beaver/usr2/NIST/madcow/initial-data/*/*/atis3/*/*/*/*.sro and
/net/beaver/usr2/NIST/madcow/initial-data/old/*/*/atis3/*/*/*/*.sro
- Contact: Sunil Issar
- Description: about 24K sentences
Lexicons
-
24k General English
- Location: /afs/cs/project/fgdata/DICT/
- Contact: Bob Weide
- Description: newall.lex.Z
this includes the words for all the
tasks we had engaged in, including
the general English data collection,
till 1990 or so; this should also have
all of timit and resource management
plus other tasks we did before 1990.
-
CELEX Lexical Database
- Location: cdrom
- Contact: Bob Weide
- Description: ASCII versions of lexical data for German, English and Dutch.
For each language, this CD-ROM contains detailed information on
the orthography (variations in spelling, hyphenation),
the phonology (phonetic transcriptions, variations in
pronunciation, syllable structure, primary stress),
the morphology (derivational and compositional structure,
inflectional paradigms),
the syntax (word class, word-class specific subcategorizations,
argument structures) and
word frequency (summed word and lemma counts, based on recent
and representative text corpora).
- Incomplete... Still growing.
Updates? Corrections? Send mail to kseymore@cs
last updated on 12/10/97