Broadcast News Pointers
The broadcast news language model training data exists in various forms in various places. Below I'll
outline some of the possibilities:
- LDC cdroms: The LDC released a conditioned version of the Broadcast news data
covering the period from January 1992 - April 1996. There are also some files from
May and June of 1996 that are in a slightly different format. There are raw and
verbalized punctuation versions of the data, segmented into one-month chunks.
I (Kristie Seymore) have these cdroms.
Contact me (kseymore@cs.cmu.edu) if you need them. Or if you're lucky, one of them may
be mounted at /net/alf7/cdrom.
- Online version: conditioned from the LDC cdrom version. The data from
the LDC cdroms was processed so that each story (or document) was placed in its own
file, with keywords and show ids extracted. This data is all online. Here are the
details:
- All of the following files are in /net/processe.inf.cs.cmu.edu/usr3/trekkies/TREC/BN/
- 92-96.idx.filtered: a list of all of the story ids
- An example story id is bn920101-358, where
bn{yy}{mm}{dd}-{number}, yy = year, mm = month,
dd = day, number = article id number.
- 92-96.wfreq: a word frequency list, listing the words and
how many times they occur in the training data
- text/bn{yy}/bn{yy}{mm}{dd}/bn{yy}{mm}{dd}-{number}.text:
all of the story text files
- 92-96.Keywords.filtered: topic labels for each story
- 92-96.Programs.filtered: show ids for each story
- 92-96.text.gz: all the text concatenated into one file
- 92-96.wc: number of words per story
- Sample vocabulary file (51KW) is at
/net/alf11/usr7/kseymore/eval96/vocab/bn92-96+62_all_tr-51k.vocab
- Evaluation data: The form of the language model training text used to build the language models
for the 96 and 97 Hub4 evaluations is at /net/alf11/usr7/kseymore/eval96/text/. This data was also
taken from the LDC cdroms, but was kept in month-sized chunks. Here are some
files you might find useful:
- Text: /net/alf11/usr7/kseymore/eval96/text
- Eval 96 51k vocabs:
- /net/alf11/usr7/kseymore/eval96/vocab/51k+208ap+top-ac.vocab (phrased)
- /net/alf11/usr7/kseymore/eval96/vocab/bn92-96+62_all_tr-51k.vocab
- Eval 96 language models: /net/alf11/usr7/kseymore/eval96/LMs/
- Eval 97 vocabularies: /net/alf11/usr7/kseymore/eval97/vocab/*.vocab
- Eval 97 language models: /net/alf11/usr7/kseymore/eval97/LMs
- Commercial cdroms from Primary Source Media: These cdroms hold the raw
version of the data and cover the period from 1992 - 1995. You will have to do some significant conditioning to clean
up the text. However, if you'd like to have access to things like keyword, program and
summary information and have a chance to format it yourself, you may want to start here.
I would make sure the LDC cdroms don't have what you need first.
Contact Roni Rosenfeld (roni@cs.cmu.edu) for the cdroms.
last updated on 5/27/98