********************************************************************************
      __      ____     ______          __   ___   ______   __    __   ______  
     / /     / __ \   / ____/         /  | /  /  / ____/  / /   / /  / ____/ 
    / /     / / / /  / /             /   |/  /  / /_     / /   / /  / /___
   / /     / / / /  / /             /       /  / __/    / /   / /  /___  /
  / /___  / /_/ /  / /____         /  /|   /  / /____  /   /|  /  ____/ /
 /_____/ /_____/   \_____/        /__/ |__/  /______/ /___/ |_/  /_____/

********************************************************************************

February, 1993							Volume 1/Issue 2

--------------------------------------------------------------------------------

Table of Contents

*1*	Dear Readers
*2*	The Penn Treebank Project
*3*	SWITCHBOARD
*4*	Multilingual Parallel Corpora
*5*	Map Task
*6*	COMLEX
*7*	COMLEX Syntax
*8*	Current LDC Members

( To jump ahead to a particular article, search for *Number* )

--------------------------------------------------------------------------------

*1*

Dear Readers, 

Welcome to the second issue of the LDC Newsletter.  Having introduced
the LDC and its staff in the first issue of the newsletter, in this
issue we present some of the data collection projects that the LDC is
involved in. Our lead article is by Mitch Marcus. Dr. Marcus is
Professor of Computer Science at the University of Pennsylvania and is
the Director of the Penn Treebank Project. The LDC was very happy to
announce the release and distribution through the LDC of the first
Treebank CD-ROM this month.

Also in this issue are articles on the SWITCHBOARD corpus by John
Godfrey and on the Map Task corpus by Henry Thompson. SWITCHBOARD is
one of the largest speech corpora available. Dr. Godfrey managed the
production of SWITCHBOARD and also of the Air Traffic Control corpus.
He is the Executive Director of the LDC, on loan from Texas
Instruments. We hope to announce the release of SWITCHBOARD in
February. Dr. Henry Thompson is a Reader in the Department of
Artificial Intelligence and the Centre for Cognitive Science at the
University of Edinburgh, where he is also Deputy Director of the Human
Communication Research Centre.  The Map Task corpus has just been
published and is currently available through the LDC.  

We also have articles on multilingual parallel corpora by Susan
Armstrong-Warwick and on the COMLEX project by Ralph Grishman and Mark
Liberman. Dr.  Armstrong-Warwick is a researcher at ISSCO (Istituto
Dalle Molle per gli Studi Semantici e Cognitivi) in Geneva,
Switzerland. Dr. Grishman is Professor of Computer Science and
Director of the Proteus Project, a computational linguistics research
group, at New York University.  Dr. Liberman is Trustee Professor of
Phonetics at the University of Pennsylvania and is the Director of the
LDC.  

We hope you enjoy this issue of the newsletter and welcome any
comments or suggestions for future issues.
	
Elizabeth Hodas, Editor

--------------------------------------------------------------------------------

*2*			The Penn Treebank Project
		    Mitch Marcus, U. of Pennsylvania

The Penn Treebank project has just completed its first phase,
after three years of DARPA funding. During this period, 4.5
million words of text were tagged for part-of-speech, with about
two-thirds of this material also annotated with a skeletal syntactic
bracketing. All of this material, now available in preliminary form on
CD-ROM through the LDC, has been hand corrected after processing by
automatic tools. The largest component of the corpus consists of
materials from the Dow-Jones News Service; over 1.6 million words of
this material have been hand parsed, with an additional 1 million
words tagged for part-of-speech. This material is a subset of the
corpus for the current DARPA large-vocabulary speech recognition
project.  

The second largest component of the corpus, now released for the first
time, consists of a skeletally parsed version of the Brown corpus, the
classic million word balanced corpus of American English.  As part of
the parsing process, this corpus has also been completely retagged
using the Penn Treebank tag set. Smaller tagged and parsed subcorpora
include 100,000 words of sentences from an earlier DARPA Message
Understanding Conference and 10,000 words of DARPA ATIS sentences.

The error rate of the part-of-speech tagged materials is estimated at
approximately 3%. About 300,000 words of text have been corrected
twice (each by a different annotator), and the corrected files were
then carefully adjudicated, with a resulting estimated error rate of
well under 1%. All the skeletally parsed materials have been corrected
once, except for the Brown materials, which have been quickly
proofread an additional time for gross parsing errors.  

Earlier material, released through the ACL Data Collection Initiative,
has been used for purposes ranging from serving as a "gold standard"
for parser testing to serving as a basis for the induction of
stochastic grammars (including work by groups at IBM, and a
collaboration between Penn, AT&T Bell Labs and Harvard University) to
serving as a basis for quick lexicon induction for the MUC task (in
unpublished work at BBN).  

The Penn Treebank Project, now in its second phase, is working towards
providing a 3 million word bank of predicate-argument structures. This
is being done by first producing a corpus annotated with an
appropriately rich syntactic structure, and then automatically
extracting predicate-argument structure, at the level of
distinguishing logical subjects and objects, and distinguishing
arguments from adjuncts for clear cases. This syntactic corpus will be
annotated by automatically transforming the current Penn Treebank into
a level of structure close to the intended target, and then completing
the conversion by hand. The preliminary version of the corpus is being
substantially cleaned up at the same time. This second release of the
Penn Treebank should be available through the LDC in August, 1993.

--------------------------------------------------------------------------------

*3*				SWITCHBOARD
			   John J. Godfrey, LDC

SWITCHBOARD, a large corpus of conversational speech by many talkers
over long distance telephone lines, was collected at Texas Instruments
and produced on CD-ROMs at the National Institute for Standards and
Technology (NIST). It is due for release by LDC this month. For those
members who are not familiar with its characteristics and dimensions,
we present a brief description here.  

The entire corpus consists of 2,430 conversations, averaging about six
minutes in length, by 523 speakers from around the United States. In
round numbers, this amounts to about 240 hours of speech and 3 million
spoken words. Apart from sheer volume, however, SWITCHBOARD has a
number of unique features designed to support basic research or
technology development for telephone-based applications. Among these
features are automatic, all-digital collection; detailed transcription
and time alignment of all conversations; documentation of several
important speech research variables; and an underlying relational
database system.  

The conversations in SWITCHBOARD were collected under computer
control, without human intervention, using a protocol which was
developed through extensive pilot testing. Automation guards against
human error and experimenter bias, and provides a degree of uniformity
over the long period of collection which would be difficult to achieve
otherwise.  

The hardware platform was an InterVoice "Robotoperator" system,
consisting of an IBM Model 80 computer, 700MB disk drive, a
programmable T1 interface, and a switching system for connecting among
the channels of the T1 span. Through special arrangements with MCI,
the toll-free 800 numbers assigned to this T1 line insured all-digital
service from the local point of insertion to the computer. The
application program and resident database controlled the entire
interaction with both participants and recorded the two sides of a
conversation in separate 8 kHz mu-law encoded data files.  On the
CD-ROMs, these pairs are combined into one file in NIST's standard
SPHERE format, with the two sides interleaved. Software routines which
permit users to access either side, or the sum of the two sides, are
included.  

Each conversation is fully transcribed, with special conventions to
show speakers' turns, simultaneous talking, interrupted sentences,
partial words, and other phenomena common in spontaneous
conversational speech. There is a set of terms describing nonspeech
acoustic events, and a provision for comments. The transcribers also
rated each conversation on a number of properties, such as the amount
of background noise or static, difficulty in understanding the
talkers, and degree to which the conversants stayed on one subject.

The SWITCHBOARD conversations are also time aligned at the word level.
The time alignment, which was performed automatically using supervised
recognition, has been spot checked by comparison with hand markings,
and found to be accurate within a tenth of a second in the majority of
cases. Needless to say, this greatly enhances the value of the corpus
for acoustic phonetic studies or for building acoustic models of words
or segments.  

In a project like SWITCHBOARD, with its reliance on cooperation from
hundreds of casual participants, it is difficult to control every
factor that might be of interest to speech researchers, particularly
with limited resources.  Nevertheless, a serious attempt was made to
sample and document many expected sources of variation in speech data.
Registration forms asked the participants' age, sex, education,
where they spent their formative years, and what subjects they would
like to converse about.  The area codes and phone numbers for each
party were also recorded.  About 50 of the callers made as many as 25
or 30 calls, enough for research on speaker characteristics; others
made fewer, some only one or two. The computer prompted each
conversation with one of 70 suggested "topics", so the 27,000
word vocabulary should have some interesting distributional features.

All this information would be of limited use unless it was organized
and accessible. Thus a SWITCHBOARD database is included in the
SWITCHBOARD corpus, putting all the information about the speakers,
calls and topics in tables ready to load into a relational database
management system.  

The SWITCHBOARD corpus occupies 26 CD-ROMs, of which 25 are sampled
speech data. Putting the transcripts, time alignment files, database
tables, and documentation on one disk will enable LDC to issue
corrected and enhanced versions as warranted.  For example, a
pronunciation dictionary and tools for doing precise local time
marking are planned for this year. Members who do specialized
annotation of SWITCHBOARD as part of their research are encouraged to
contact LDC about incorporating these into later versions.

--------------------------------------------------------------------------------

*4*		Multilingual Parallel Corpora
	       Susan Armstrong-Warwick, ISSCO

The Hansard corpus, a large set of parallel texts of Canadian
Parliamentary debates in French and English, has provided the basis of
a number of new and interesting corpus-based projects. Areas of
application have ranged from building machine translation (MT) systems
to automatic word sense disambiguation. The availability of this one
resource has clearly demonstrated the potential for exploiting
parallel texts. As other large parallel corpora become available,
progress will only accelerate. Fortunately, we are well on our way to
making an initial set of such resources available. This report
provides an overview of current activities to acquire parallel
corpora.  

At ISSCO, a research institute in Switzerland, the acquisition of
parallel text corpora has been under way for a number of years^1. The
first efforts were rather small and directed towards specific
projects: a very small corpus of federal administrative job
advertisements published in the three national languages of French,
German and Italian, the civil code for Switzerland in the same three
languages, reports on snow conditions in French and German published
by the Federal Institute for the Study of Snow and Avalanches, and a
collection of financial reports from the Union Bank of Switzerland in
French and German (plus a few in English and Italian). These
collections have served as a basis for projects in MT and more
recently for developing corpus access tools for researchers and
translators.  

The founding of the Association for Computational Linguistics Data
Collection Initiative (ACL/DCI) in 1989 reflected the growing interest
in corpus-based studies and the need to organize efforts in text
collection and distribution in order to assure that all researchers
would have access to these basic resources. Inspired by this
initiative and the multilingual concerns of the European research
centers, the text acquisition work at ISSCO was redirected towards
finding larger sets of multilingual texts and negotiating
redistribution rights for the entire community. First as a member of
the ACL/DCI, and then as a co-founder of the European Corpus
Initiative (ECI)^2, ISSCO has continued to identify potential
resources, to negotiate for and acquire the texts, and to prepare them
for redistribution.  

An initial set of parallel texts is currently in preparation and will
soon be available for redistribution. The texts were acquired under
the auspices of the ECI and are being prepared with support from the
LDC. The accompanying tables provide an overview of the monolingual
and multilingual texts collected to date. The two major collections
currently in hand are the International Telecommunications Union's
CCITT Blue Book Series and portions of the International Labor
Organization series of Official Bulletins.  These texts represent
parallel versions of English, French and Spanish. ISSCO is also hoping
to provide translations of these documents prepared in other
countries^3. 

Several bilingual texts have also been acquired. IBM Germany has
donated a set of technical manuals in German and English.  Two major
Swiss Banks, Union Bank of Switzerland and Credit Suisse, have donated
texts in French and German. All of this material will be made
available on CD-ROM from the ECI, with distribution in Europe by
ELSNET and in North America by the LDC. The donations by the ITU and
ILO have helped to open up communication with major multilingual text
producers. Negotiations are in progress for two other sets of parallel
texts with the United Nations and with the European Community Office
of Publications for texts produced in the nine official languages.

This is only the beginning. With support from the LDC, negotiations
for more multilingual and parallel texts will be pursued at ISSCO.
Situated in Geneva, with its concentration of international
organizations and its commitment to multilingualism, the institute is
well placed to continue its efforts to acquire an ever larger set of
multilingual textual resources.  

1. This work has been sponsored by SWISSTRA.  
2. The initiative was founded in January, 1991, and is sponsored by
the European Chapter of the Association for Computational Linguistics
(EACL), the European Network in Language and Speech (ELSNET), the
Network for European Reference Corpora (NERC) and the Linguistic Data
Consortium. Technical assistance in preparing the texts is being given
by David McKelvie, Edinburgh, Dominique Petitpierre, ISSCO and Dave
Graff, LDC. For information on the ECI contact Henry Thompson
(eucorp@cogsci.ed.ack.uk) or Susan Armstrong-Warwick
(susan@divsun.unige.ch).  
3. A portion of the Japanese texts (not all are translated) have been
identified and are being negotiated for by Prof. M. Nagao, Univ. of
Kyoto and similarly for the German translations by Dr. R. Kaese,
Cap-Debis, Germany.

	Monolingual Text Collections in Preparation by the ECI

	Language	Text Type		K Words

	Dutch		Newspaper		  600
	Czech		Newspaper		5,000
	English		Novels/Stories		1,000
	French		Newspaper		3,000
	German		Newspaper		1,000
	Greek		Mixed			2,000
	Italian		Newspaper		3,500
	Serbo-Croat	Short Stories		  700
	Spanish		Newspaper		2,000
	Spanish		Transcribed Speech	  450
	Swedish		Novel/Stories		1,700

	Multilingual Text Collections in Preparation by the ECI

Languages	Text Type		Est. Words	Donor

Fr/Ger		Financial Reports	4 Million	Swiss Banks (CS, UBS)
Fr/German/Ital	Legal			227 Thousand	Swiss Government
Ger/Eng		Technical Manuals	5 Million	IBM, Germany
Eng/Fr/Span	Official Bulletins	3 Million	Int'l Labor Organization
Eng/Fr/Span	Technical Documents	5 Million	Int'l Telecomm. Union

--------------------------------------------------------------------------------

*5*				Map Task
			Henry Thompson, Edinburgh

The Human Communication Research Centre Map Task corpus has recently
been collected and transcribed in Edinburgh, and has just been
published on CD-ROM and distributed by the LDC. This effort was made
possible by funding from the British Economic and Social Research
Council.  

Using an elaboration of a method developed over a number of years, we
recorded 128 two-person conversations, employing 64 talkers (32 male,
32 female) each talker participating in four conversations.  High
quality recordings were made using Shure SM10A close-talking
microphones in a recording booth, one talker per channel on stereo DAT
(Sony DTC1000ES).  

Each participant has a schematic map in front of them, not visible to
the other. Each map is comprised of an outline and roughly a dozen
labelled features (e.g. "white cottage", "Green Bay", "oak forest").
Most features are common to the two maps, but not all. One map has a
route drawn in, the other does not. The task is for the participant
without the route to draw one on the basis of discussion with the
participant with the route.

The experimental design is quite detailed and complex, allowing a
number of different phonemic, syntactico-semantic, and pragmatic
contrasts to be explored in a controlled way. In particular, maps and
feature names were designed to allow for controlled exploration of
phonological reductions of various kinds in a number of different
referential contexts, and to provide a range of different stimuli to
referent negotiation, based on matches and mis-matches between the two
maps.  

Subjects adjusted easily to the task and experimental setting, and
produced evidently unselfconscious and fluent speech. The syntax is
largely clausal rather than sentential; showing good turn-taking, with
relatively little overlap/interruption.  The total corpus runs about
18 hours of speech, yielding 150,000 word tokens drawn from 2,000 word
form types. Word lists containing all the feature names were also
elicited from all speakers, along with a number of `dialect
diagnosis' utterances.  

The transcriptions are, at the orthographic level, quite detailed,
including filled pauses, false starts and repetitions, broken words,
etc. Considerable care has been taken to ensure consistency of
notation, which is thoroughly documented.  Although the full
complexity of overlapped regions has not been reflected in the
transcriptions, such regions are clearly set off from the rest of the
transcripts. Transcripts are connected to the acoustic sampled data by
sample numbers marked every few turns.

The published version of the corpus contains both a complete set of
transcripts and 20 kHz sampled versions of both channels of the
associated speech.  We envision this corpus as providing a uniquely
valuable resource for researchers with a wide range of interests in
spoken language, and our goal in publishing it on CD-ROM is to make it
as widely available as possible at a low cost.
 
--------------------------------------------------------------------------------

*6*				COMLEX
			Mark Liberman, LDC

A language is made up of words, lots of them-ordinary people are
familiar with tens of thousands of ordinary words, tens of thousands
of proper names, and tens of thousands of idiosyncratic fixed
expressions. Language technologies need lexicons to provide
information about these words: their pronunciations, their
morphological and syntactic characteristics, their semantic
relationships, their collocational peculiarities. Such lexicons are
expensive and time-consuming to produce, and the need for each
research project to create its own lexical databases adds costs, slows
progress and lowers quality.

COMLEX (for COMmon LEXicon) is an LDC-sponsored attempt to improve
this situation by creating a set of lexical databases focused on the
needs of particular technologies.  Plans are being developed in
consultation with researchers in all areas, beginning with two
meetings at NYU and at Penn in October and November of last year. The
October meeting, as described by Ralph Grishman elsewhere in this
issue, developed plans for a lexicon to be used in text parsing
applications. The original concept is being refined through an
iterative process that will soon include user trials of early drafts
of the actual lexicon.  

At the November meeting, a larger group discussed a wide range of
possible lexical databases. On the pronunciation side, the ideas that
emerged included a pronouncing dictionary, a "pronounced
dictionary" (in which a large list of words, in isolation and in
context, is recorded by each of a small number of suitable speakers),
and a "phonetic concordance" (in which variation in word
pronunciation is quantified with respect to a large corpus such as
SWITCHBOARD). We are asking the DARPA Spoken Language Coordinating
Committee for feedback on the value of these proposals. In the area of
semantics, ideas included lexicons for the categorization of proper
nouns in unrestricted text, lexicons for sense disambiguation based on
annotation of corpora with WordNet categories, and more ambitious
projects as well. As concrete plans are developed in these other
areas, each effort will address the needs of a particular group of
users, but in a way that permits the various individual databases to
fit together into a consistent overall scheme.  

COMLEX is based on three key ideas. First, we will provide useful
lexicons quickly and incrementally, with the design and development
controlled by researchers in the targeted applications. Second, we
will ground our lexicons in LDC-distributed corpora of speech and
text, quantifying variation by counts in documented samples. Third, we
will keep our lexicons legally unencumbered for LDC members, so that
members' (and sponsors') investment in pre-competitive research
can eventually be put to commercial use, without being held hostage to
the need for additional licenses of unknown availability and cost.  

We expect that COMLEX will help our members do better pre-competitive
research more quickly and at lower cost. Initial plans for a COMLEX
syntactic database for English are discussed in this issue, and plans
in other areas will be described as they develop. We invite comments
and suggestions from members.

--------------------------------------------------------------------------------

*7*				COMLEX Syntax
			     Ralph Grishman, NYU

COMLEX, the COMmon LEXicon, will provide a shared lexical resource for
a variety of language technology applications.  This article describes
COMLEX syntax, which is intended for use in a broad range of English
text analysis systems.  

The initial impetus for its creation came from one of the DARPA/SISTO
program managers, Charles Wayne, in early 1992.  A meeting in July of
that year brought together James Pustejovsky (Brandeis), Ralph
Grishman (NYU), Charles Wayne, and Mark Liberman (LDC); discussions
there led to a proposal by Pustejovsky and Grishman for an LDC-funded
COMLEX.  

A number of guiding principles were defined early in the process.  

1.  Broad coverage: for COMLEX to be large enough to be useful for
text analysis, yet small enough to be constructible within a year,
plans called for a dictionary of 30,000 to 40,000 base forms.  

2.  Syntactic focus: the desire for a rapid start-up led to an initial
focus on syntactic information, for which it was believed that
relatively broad agreement among system developers could be obtained.
Semantic information may be added in later versions of COMLEX.  

3.  Rapid start up: specifications for a first version of COMLEX by
the beginning of 1993 and an initial usable dictionary by the end of
1993.  In view of this rapid timetable, it was recognized that the
specifications would probably change to some degree during the
dictionary creation.  

In addition, as with other LDC products, the intention was to create a
resource which could be used for both research and commercial purposes
with minimal restrictions.  

During the fall of 1992, Catherine Macleod and Ralph Grishman at NYU
developed a detailed specification of the syntactic features to be
included in the lexicon. Particular attention was given to including a
rich set of subcategorization features-a set sufficiently detailed
that the features required for most current syntactic analyzers could
be derived from them. Comparisons were made against several other
computer lexica, including the Brandeis verb lexicon, the NYU
Linguistic String Project lexicon, and the codes used by ACQUILEX and
the Oxford Advanced Learner's Dictionary.  

An initial meeting to review these specifications was held at NYU on
October 22, attended by Macleod and Grishman (NYU), Mark Liberman and
Jack Godfrey (LDC), Mitch Marcus (University of Pennsylvania), Bran
Boguraev (IBM), George Miller (Princeton), James Pustejovsky
(Brandeis), and Yorick Wilks and Louise Guthrie (New Mexico State).
The meeting reviewed the preliminary COMLEX specifications, discussed
alternative dictionary sources, and considered additional entries and
features which might be required.  

An initial specification for COMLEX syntax is now complete.  A
menu-based dictionary entry program is under development, and
systematic dictionary creation at NYU, primarily by manual entry, is
to begin in early spring.

--------------------------------------------------------------------------------

*8*			Current LDC Members

NYNEX (Senior Member)
Texas Instruments (Senior Member)
Apple Computer, Inc.
AT&T Bell Laboratories
BBN Systems & Technologies
Bellcore
Boston University
Cambridge University
Canon Research Centre Europe
Carnegie Mellon University
Centre de Recherche Informatique de Montreal
Columbia University
Dragon Systems
IBM T. J. Watson Research Center
INRS-Telecommunications
Institute for Perception Research
Instituto de Engenharia de Sistemas e Computadores
International Computer Science Institute
Kurzweil Applied Intelligence
Lernout & Hauspie Speech Products
LIMSI-CNRS
MIT
MIT Lincoln Laboratory
MITRE Corporation
Oregon Graduate Institute
Philips Research Lab Aachen
Princeton University
Purdue University
Rutgers University
Southwestern Bell Technology Resources, Inc.
Speech Processing Expertise Centre
SRI International
Sun Microsystems
Telecom Paris
UNISYS
University of Rochester
University of Southern California/Information Sciences Institute
University of Sydney
Xerox PARC

--------------------------------------------------------------------------------
