********************************************************************************
      __      ____     ______          __   ___   ______   __    __   ______  
     / /     / __ \   / ____/         /  | /  /  / ____/  / /   / /  / ____/ 
    / /     / / / /  / /             /   |/  /  / /_     / /   / /  / /___
   / /     / / / /  / /             /       /  / __/    / /   / /  /___  /
  / /___  / /_/ /  / /____         /  /|   /  / /____  /   /|  /  ____/ /
 /_____/ /_____/   \_____/        /__/ |__/  /______/ /___/ |_/  /_____/

********************************************************************************

July, 1993							Volume 1/Issue 3

--------------------------------------------------------------------------------

Table of Contents

*1*	The UN Multilingual Text Corpus
*2*	The TIPSTER Project
*3*	Development of Speech Data Collection Infrastructure
*4*	Changes in LDC Membership Policy and Fee Structure

( To jump ahead to a particular article, search for *Number* )

--------------------------------------------------------------------------------

*1*		   The UN Multilingual Text Corpus
			   David Graff, LDC

One of the Linguistic Data Consortium's major goals for next year is
the acquisition of multilingual text to support research in machine
translation and other areas. Parallel multilingual texts are
especially valuable, but are extremely difficult to find (see
Multilingual Parallel Text Corpora, Susan Armstrong-Warwick, LDC
Newsletter, Vol. 1, No. 2, pp. 3-4).  A significant step forward in
this effort has resulted from negotiations with the United Nations in
New York. The UN has agreed to make its electronic text archives
available for language research, and the LDC has taken on the task of
making these archives accessible to the research community.  Initial
negotiations with the UN were made by Dragon Systems Inc. beginning in
1990, and were continued by the LDC in December 1992.

The electronic archives consist of all UN documents of public record
dating from 1988 to the present.  The documents include the
proceedings, resolutions and reports of the General Assembly, the
Security Council, UNICEF, the Economic and Social Council, and
numerous other committees, commissions and councils within the UN.
The majority of archival material represents parallel text in the six
official languages of the UN: English, French, Spanish, Russian,
Arabic and Chinese.  

So far, the LDC has received copies on tape of only the English,
French and Spanish archives.  The amount of text delivered to date is
in the neighborhood of 2.5 gigabytes.  The full extent of parallelism
in these texts is not clear at present; it appears that some portion of the
archives is made up of material that exists in only one or a subset of
the three languages.

Obtaining data for the other three official languages will likely
require somewhat greater effort, because the UN's archiving practices
were not consistent across all languages.  While the English, French
and Spanish archives exist on removable 80 megabyte disk packs, the
Chinese, Arabic and Russian data are only found on tape cartridges and/or
5.25-inch floppy disks.  Since no data from the latter three
languages have been sent to the LDC yet, it is uncertain what
additional effort will be required to transform the text to an
accessible format, and how much data (and parallel text) actually
exist in these languages.

The UN texts were created and archived on WANG VS computer systems,
using the Wang WP word processing program.  The tapes delivered to the
LDC were copied from the archived disk packs by means of WANG BACKUP.
Each of these Wang programs uses its own file formatting scheme, which
had to be reverse-engineered at the LDC so that programs could be
written to extract the actual text data from the tapes.  The LDC's
efforts to make sense of the WP character encoding, format control
codes and file structure were helped substantially by Dominique
Petitpierre of ISSCO, as well as by the technical support staff at
Wang Office Systems.

The English, French and Spanish texts are being transliterated to the
ISO 8859-1 (Latin1) character set, an 8-bit encoding system in which
accented characters of European languages (and some other specialized
symbols) are provided in the upper half of the 256-character table.
Common 7-bit ASCII, or ISO-646, occupies the lower half of the table.
In addition, the various WP text formatting control codes (such as
line-centering, underlining, indentation, tab-stop settings, etc.) are
being preserved in the form of SGML (Standard Generalized Markup
Language) tags.  Considerable care is being taken to ensure that the
resulting text files are fully SGML compatible and parsable.  In this
regard, we are especially grateful to David McKelvie of the HCRC in
Edinburgh, Scotland, for providing a critique and verification of some
extracted samples, and for creating a complete SGML Document Type
Definition (DTD) and character set specification, which will be
distributed with the data when it is published.

It is not clear at present what solution will be adopted for character
encoding in the other three languages, once these become available
from the UN.  We would welcome suggestions from LDC members as to the
most accessible methods for encoding Arabic, Russian and Chinese,
bearing in mind that the resulting files must be able to accommodate
some Roman alphabetic and decimal numeric strings interspersed with
the running text, as well as, presumably, SGML tags.  The LDC would
like to develop (or approximate) a consensus on this issue.

The initial publication of parallel text data from the UN is expected
to be ready for release in the fall of 1993.  It will consist of one
CD-ROM each for English, French and Spanish (approximately 650
megabytes per disc). The directory and file structure on each disc
will directly reflect the parallel relations among texts (i.e. a given
document will have the same path and file name on each disc, with the
exception of a single component that describes the language of each
file).  While there will be no attempt to insert tags specifically to
mark alignment points within parallel documents, there will be an
abundance of cues for alignment within the text itself, owing to the
fairly consistent use of typographic formatting (chapter headings,
etc.), and frequent use of sequence numbers assigned to each paragraph.

The UN corpus promises to be an invaluable resource for researchers in
machine translation. The UN will benefit from this work as well, in
that the archives will be returned to them in a converted form that
will be portable to their new PC-based word processing systems.


*2*			 The TIPSTER Project
     Donna Harman, National Institute of Standards and Technology

The TIPSTER project is sponsored by the Software and Intelligent 
Systems Technology Office of the Advanced Research Projects Agency 
(ARPA/SISTO)x in an effort to significantly advance the state of the art in
effective document detection (information retrieval) and data extraction
from large, real-world data collections.

There are two separate but connected parts of TIPSTER.  The first part of
the project, document detection, is concerned with retrieving relevant
documents from a very large (3 gigabyte) collection of documents, both in
a routing environment, and in an adhoc retrieval environment.  The routing
environment is similar to the document filtering or profile searches
currently done in libraries, where a query topic is constant, and the
documents are viewed as the incoming stream of publications.  The adhoc
part of the project is similar to the standard search done against static
collections.

The second part of the TIPSTER project is concerned with data extraction.  
Here it is assumed that there is a much smaller set of documents, presumed 
to be mostly relevant to a topic, and the goal is to extract information 
to fill a database.  This database could then be used for many applications, 
such as question-answering systems, report writing, or data analysis.  
The rest of this report concerns only the first part of the TIPSTER project, 
the detection part, as at present this is the only data available thru LDC.  
The extraction data will hopefully be available in the future.

The detection data is comprised of a new test collection built at NIST to be
used both for the TIPSTER project and the related TREC project.  The TREC
project has many other participating information retrieval research groups,
working on the same task as the TIPSTER groups, but meeting once a year 
in a workshop to compare results (similar to MUC).  The test collection built
at NIST consists of 3 disks (gigabytes) of documents, 150 topics, and the answers 
(relevant documents) for those topics. 

The documents in the test collection are varied in style, size, and
subject domain.  The first disk contains material from the Wall Street 
Journal (1986, 1987, 1988, 1989), the AP Newswire (1989), the Federal 
Register (1989), information from Computer Select disks (Ziff-Davis 
Publishing), and short abstracts from the Department of Energy.
The second disk contains information from the same sources, but from 
different years.  The third disk contains more information from the 
Computer Select disks, plus material from the San Jose Mercury News (1991), 
more AP newswire (1990), and about 250 megabytes of formatted U.S. Patents.
The format of all the documents is relatively clean and easy to use,
with SGML-like tags separating documents and document fields.  There is no 
part-of-speech tagging or breakdown into individual sentences or paragraphs 
as the purpose of this collection is to test retrieval against real-world 
data.  

The 150 topics (user need statements) were built by real users of a retrieval 
system and were constructed to be relatively narrow (retrieving an average
of 300 relevant documents), although there is a range of broad and narrow 
topics.  These statements are about a page long, and are very formatted, 
including special fields such as related concepts, definitions, and the 
narrative, which is the description of what constitutes a relevant document.
The relevance judgments consist of lists of documents considered to be 
relevant to each topic.  They were made using a sampling method, with the 
sets of documents retrieved at high ranks (most likely relevant) by
both the TIPSTER contractors and the TREC participants forming a pool of 
likely relevant documents for each topic.  This pool was then judged by 
professional relevance assessors at NIST.

A preliminary version of the test collection is available from LDC,
with a final version to be ready in the fall.


*3*	 Development of Speech Data Collection Infrastructure
		     by Jim Glass and Victor Zue

One of the stated goals of the LDC is to collect speech data in
quantities far exceeding what is currently available. This can best be
achieved with multiple sites participating in concurrent collection
efforts and combining the resulting data.  To facilitate this effort
and to provide data collection standards, the LDC has initiated an
effort with the Spoken Language Systems group at MIT's Laboratory for
Computer Science to develop a software environment which can be used
for data collection purposes. 

In the first stage of the four-part project, a speech prompting and
collection interface program will be developed to provide an
environment for the collection of prompted speech, with or without
supervision.  The program will enable a user to select and display
words, phrases, or sentences.  The program will permit  both "push to
talk" and "open mic" recording conditions.  It will give graphic
status cues wherever appropriate (e.g., ready, recording, amplitude
levels), and allow rapid playback.  The program will also be able to
display and produce printouts of waveforms and spectrograms. 

The second part of the project will produce a postprocessing
verification program and simple alignment tools.  This program will
feature interactive verification of speech content in order to permit
a human transcriber to authenticate an utterance, view the putative
transcription (if the utterance was prompted), confirm or alter the
transcription, trim or subdivide the utterance, and create new speech
and text files as needed.

The third part of the project will feature automatic transcription and
alignment.  This will take as input a speech file and its orthographic
transcription, and produce as output the start and end times of each
word.  Similarly, a hypothesized phonetic transcription for the entire
utterance can be aligned with the waveform indicating the start and
end times for each phoneme or other phonetic segment.

The fourth and final part of the project will consist of analytical
tools for corpus development.  This interactive program will
facilitate distributional analysis of corpora according to linguistic
factors such as phonetic content, stress patterns, minimal pairs, etc.
It will handle corpora in normal orthographic form, and will also
include word-level tools for determining frequency of occurrence of
words, word parts, or word sequences, and a facility for computing
statistical N-gram language models from a given corpus.

All of the software developed in this effort will be available for LDC
members.  Early releases will be made to those members willing to
serve as beta test sites.


*4*	  Changes in LDC Membership Policy and Fee Structure
			 John J. Godfrey, LDC

As LDC nears the end of its first membership year, the Board of
Directors has reviewed membership policies, fees, and other operational
matters in the light of experience gained from dealing with our
members, sponsors, and contractors.  They asked such questions as:
Does this practice (or rule, or fee) help us serve our members better?
Does it increase membership?  Does it contribute to the eventual
self-sufficiency of LDC?

The first year has been very successful in most respects.  Nearly 100
CDs will be available by August 31; a dozen contracts have been let,
with more than 100 CDs already expected for the next membership year;
the number of members is in the 60s and growing; and feedback about
the quality and selection of corpora has been very positive.
Notwithstanding this general success, the first year's experience
suggests that some course corrections be made in membership policy and
in fee structure.  LDC is intended to be self-sufficient in a few
years' time, but to do so it will require more senior members and/or
more reinvestment income from sales or memberships.

Some of the changes will be welcome to all parties:

1.  NO CUMULATIVE FEES. The original grant authorizes the LDC to
collect membership fees cumulatively, back to the founding of LDC, in
order to join in later years.  This will be changed.  Instead, each
corpus will have a release date corresponding to a membership year
(e.g., 1993 for everything released before 1 September 1993), and will
be available to members of record for that year at member rates, and
to others at nonmember rates.  For example, an organization that joins
in 1994 and wants to acquire only, say, Vol. 1 of TIPSTER can simply
pay the nonmember rate of $1000 rather than join for 1993.  On the
other hand, if they want corpora whose nonmember prices add up to more
than the price of membership, it would make sense to pay the 1993
membership fee and then request the corpora as a benefit of
membership, since 1993 corpora are free to members.

2.  60-DAY NOTICE.  The membership agreement calls for members who do
not wish to renew for the following year to give notice at least 60
days in advance (i.e., in the first week of July).  While we certainly
encourage everyone to renew memberships promptly, this rule will not
be enforced, and will be removed from the membership agreement in the
future.  Invoices will be sent to all current members in July or
August for the 1994 MY.

Some of the changes will NOT be welcome:

3.  PER CORPUS CHARGES.  Beginning in 1994, regular membership dues
will cover only the license fees for LDC corpora, plus an allotment of
10 speech or 2 text disks; there will be a charge for all disks after
that.  Unless otherwise specified, this charge will be $100 per disk
for speech corpora, and $500 per disk for text (including lexica), or
the nonmember catalog price, whichever is less.

The differential reflects the (normally) much greater development
costs per volume of textual data.  The per-disk charges are expected
to generate enough income to fund the collection of one more corpus
per year, without further invading the principal of LDC's initial
grant.  We sincerely hope that you will appreciate the need for this
increase.  Although it falls proportionately more on the nonprofits,
their cost is still heavily subsidized, and LDC remains committed to
the principle that no researcher should lack access to data because of
a true inability to pay.

4.  NONMEMBER PRICES.  For this reason also, we will continue to offer
some corpora, which are especially important to smaller scale academic
research, at very low single-copy (nonmmember) prices.  Enclosed below
is a revised version of the LDC price list, which includes some
estimated release dates and prices for future corpora.

