********************************************************************************
      __      ____     ______          __   ___   ______   __    __   ______  
     / /     / __ \   / ____/         /  | /  /  / ____/  / /   / /  / ____/ 
    / /     / / / /  / /             /   |/  /  / /_     / /   / /  / /___
   / /     / / / /  / /             /       /  / __/    / /   / /  /___  /
  / /___  / /_/ /  / /____         /  /|   /  / /____  /   /|  /  ____/ /
 /_____/ /_____/   \_____/        /__/ |__/  /______/ /___/ |_/  /_____/

********************************************************************************

October, 1992							Volume 1/Issue 1

--------------------------------------------------------------------------------

Table of Contents

*1*	What is the LDC?
*2*	LDC Corpora: An Update
*3*	Meet the LDC Staff
*4*	Workshops

( To jump ahead to a particular article, search for *Number* )

--------------------------------------------------------------------------------

*1*			What is the LDC?

There is increasing interest in computer-based linguistic
technologies, including speech recognition and understanding, optical
and pen-based character recognition, text retrieval and understanding,
and machine translation. In each area, we have useful present-day
systems and realistic expectations of progress.

However, because human language is so complex and information-rich,
computer programs for processing it must be fed enormous amounts of
varied linguistic data-speech, text, lexicons, and grammars-to be
robust and effective. Not even large companies can easily afford
enough of this data to satisfy their research and development needs.
Researchers at smaller companies and universities risk being frozen
out of the process entirely.

Therefore, the Defense Advanced Research Projects Agency (DARPA) has
helped form a broadly-based consortium of companies, universities, and
government agencies. The Linguistic Data Consortium (LDC) will make it
possible to share pre-competitive development costs widely. An initial
two-year grant of $5 million from DARPA will amplify the effect of
members' contributions so that there is guaranteed to be far more data
than any member can afford individually. Additional government
investment in the technologies supported by the LDC is likely in the
future, and a significant fraction of the consortium's on-going budget
will also come from membership fees, which are set at $20,000 per year
in the case of for-profit institutions, and $2,000 per year in the
case of non-profit institutions. Senior Members contribute $200,000
per year, and are represented on the LDC Board.

DARPA has also contributed a number of speech and text corpora to the
LDC: the TIMIT, Resource Management (RM), Air Travel Information
System (ATIS), SWITCHBOARD, and Air Traffic Control (ATC) speech
corpora; the Penn Treebank annotated text corpus; the MUC corpus of
FBI terrorist reports; and the TIPSTER/TREC text corpus. In addition,
DARPA has agreed to contribute all additional relevant data that it
sponsors.

The operations of the LDC will be closely tied to the evolving needs
of the research and development community that it supports. The LDC
will undertake a vigorous, on-going campaign to acquire and create
resources in areas of interest to its members.

To join the LDC, or to get more information about its activities,
please contact: 

The Linguistic Data Consortium 
441 Williams Hall
University of Pennsylvania 
Philadelphia, PA 19104-6305 
Tel.:	(215) 898-0464 
Fax:	(215) 573-2175 
email:	ldc@unagi.cis.upenn.edu

--------------------------------------------------------------------------------

*2*			LDC Corpora: An Update

The tables on the following pages summarize the speech and text
corpora which are free to LDC members.  In addition to the "classics",
such as TIMIT and ATIS, there are several new or soon to be released
items. Some are sponsored by DARPA; in other cases, LDC is cooperating
in the production and distribution of data from particular projects;
and still others are the first fruits of LDC funded efforts.  

NTIMIT is a single volume containing all of the speech material from
TIMIT as processed through the public telephone network by NYNEX
researchers.  It was released in August.  

SWITCHBOARD will occupy about 40 CD-ROMs; it consists of 2500
telephone conversations by over 500 men and women lasting 5 to 10
minutes each.  The two sides of each conversation can be either split
or combined.  Transcripts and word for word time alignments are
provided along with tables describing callers, topics of conversation,
etc.  It will be released in December.  

ATC will be a set of about 8 disks with 70 hours of Air Traffic
Control communications between controllers and pilots at three major
airports, transcribed and time aligned.  Release is expected by
January 1993.  

MAPTASK consists of 128 task-driven conversations between subjects
interacting as they determine routes on a map.  The speech is mostly
Scots English, collected at Edinburgh University in an office-like
environment at 20 kHz.  The complete set of 8 volumes, which LDC is
helping to produce, will be available to members in December.  

MARSEC, or "Machine-Readable Spoken English Corpus," is another
British product, from the Universities of Leeds and Lancaster.  LDC
has the data and will produce a version with NIST standard file
formats; further information will be sent by mail soon.  

The TREEBANK is an ongoing project at the University of Pennsylvania,
under the direction of Mitch Marcus, providing part of speech tags and
parse trees for text corpora. Some early Treebank data has been
available on the ACL/DCI disk. The LDC now plans to publish a CD-ROM
with a complete tagged and parsed version of the widely-used Brown
corpus, as well as other material, in December.

			LDC-Funded Projects

In addition to the material in the tables, the LDC will be funding or
co-funding a limited number of new data collection projects.  In
general the criteria for LDC funding will be:
 
1. A perceived need for data or resources by LDC members; 

2.  Anticipated high impact on research for dollars spent; 

3. Data not likely to be available when needed from already funded
sources.  

Basic research or projects with high technical risk will not be
funded.  Within legal and practical bounds, we hope to make the LDC
funding process simple and efficient.  Unsolicited "preproposals" will
be accepted at any time, preferably in the form of email.  A reply
indicating LDC's level of interest will follow as soon as possible,
certainly within 30 days, and guidelines for a formal proposal will
normally be offered at that time.  

After technical review and recommendations by LDC staff and outside
experts, final decisions on funding are made by the LDC Board.
Contracts are issued and monitored through the University of
Pennsylvania Office of Research Administration.  

Occasionally LDC will issue calls for proposals by email, either to
fill a specific need expressed by members or to encourage ideas and
competition where one or more preproposals have been received.  The
email list for these calls will include LDC member institutions and
prospective members.  If you wish to be included, send email to LDC.

To date some 17 preproposals and four full proposals have been
received.  Three calls for proposals will be issued shortly. Funding
decisions on several proposals are awaiting a Board meeting in
October.

--------------------------------------------------------------------------------

*3*			Meet the LDC Staff

Dr. Mark Liberman, the Director of the LDC, is Trustee Professor of
Phonetics at the University of Pennsylvania. Before coming to the
University of Pennsylvania, he was Head of the Linguistics Research
Department at AT&T Bell Laboratories. Dr. Liberman received his Ph.D.
degree from the Massachusetts Institute of Technology.  

Dr. John Godfrey, the Executive Director for the LDC, is on loan to
the LDC from Texas Instruments where he was Project Manager and
Principal Investigator for the Speech Corpus Collection Contract. Dr.
Godfrey managed the production of the two largest speech corpora of
their kind: SWITCHBOARD and Air Traffic Control (ATC). He did his
doctoral work at Georgetown University.  

Dave Graff is the Programmer Analyst for the LDC. He received a B.A.
in linguistics from Pitzer College in Claremont, California and is
currently working on his doctoral thesis in linguistics at the
University of Pennsylvania under Dr. William Labov. Before joining the
LDC, Mr. Graff was on the engineering staff at RCA/GE Advanced
Technology Laboratories in Moorestown, New Jersey.  

Elizabeth Hodas did her undergraduate work at the University of
Pennsylvania where she majored in French and German.  She has an MS in
Education from the Graduate School of Education and recently completed
an MS in Computer Science from the School of Engineering and Applied
Science, both at the University of Pennsylvania. Ms. Hodas is the
Administrative Assistant for the LDC and the editor of the LDC
Newsletter.

--------------------------------------------------------------------------------

*4*			Workshops

In order to promote work in areas of interest to the Linguistic Data
Consortium, the LDC expects to sponsor a variety of occasional
workshops. The first such event was the recent meeting of the Grammar
Evaluation Interest Group which was held at the Institute for Research
in Cognitive Science at the University of Pennsylvania on September
20-21, 1992.  

The meeting featured a workshop dubbed "Parseval", for which
participants submitted the results of running their parsers on a
common set of sentences selected at random from the Penn Treebank. The
"dry-run" was used to fine-tune the Parseval metric for scoring
the accuracy of parser bracketing. A session was also devoted to
suggestions on how to extend or improve the Treebank annotation
system.  

The LDC will sponsor a workshop in October to lay the groundwork for
COMLEX, a common lexical database project. The goal of the project is
the creation of a flexible, scalable lexical resource of value to
researchers in many disciplines. This meeting grew out of Ralph
Grishman and James Pustejovsky's proposal for a common lexicon for
message understanding applications.

Early in the spring of 1993 the LDC will sponsor a workshop on the
annotation of speech corpora. The expected outcome of that workshop is
the first edition of a manual proposing transcription and annotation
standards, at least for LDC-sponsored projects.

--------------------------------------------------------------------------------
