Date: 18 December 1993
From: Jane Edwards (edwards@cogsci.berkeley.edu) via ftp from 128.32.211.5
Subject:  Survey of Electronic Corpora and Related Resources

Appended below is the electronic version of Chaper 10 (pp. 263-310) from
the following book, reproduced by permission of the publisher:

   Edwards, Jane A. & Martin D. Lampert (eds). TALKING DATA: TRANSCRIPTION AND
        CODING IN DISCOURSE RESEARCH.  London and Hillsdale, NJ: Erlbaum. 
        336 pp. 0-8058-0349-1 [ppr] US $27.50; 0-8058-0348-3 [hdbk] US $59.95; 
        (Prepaid: $24.75 & $53.95) Discourse, spoken language corpora.
   Transcription and coding systems from contrasting approaches to spoken
   language situated in their theoretical frameworks with sample analyses.
   Overview chapters present global design principles. Includes a large
   compendium of computerized corpora and related resources.   To order in
   US: 1-800-926-6579

I would greatly appreciate knowing of inaccuracies or additional
resources which should be mentioned in an update to be submitted at a
future date if needed to the ICAME fileserver in Bergen (see below).

Best Wishes,

-Jane Edwards (edwards@cogsci.berkeley.edu)

--------------------------------------------------------------------------


                               Chapter 10:
             Survey of Electronic Corpora and Related Resources 
                        for Language Researchers 
                             Jane A. Edwards
                   University of California at Berkeley
 
CONTENTS
1.  INTRODUCTION . . . 267
2.  INFORMATION SOURCES . . . 269
    A.  Centers and Associations . . . 269
        (1) NCCH (Norwegian Computing Centre for Humanities) . . . 269
        (2) CTI (Computers in Teaching Initiative Centre for 
            Textual Studies) . . . 269
        (3) CETH (Center for Electronic Texts in the Humanities) . . . 270
        (4) ACH (Association for Computers and the Humanities) . . . 270
        (5) ALLC (Association for Literary and Linguistic Computing) . . . 271
        (6) ACL (Association for Computational Linguistics) . . . 271
    B.  Electronic Mail Distribution Lists and Discussion Lists . . . 272
        (1) HUMBUL . . . 272
        (2) CORPORA . . . 272
        (3) HUMANIST . . . 272
        (4) LINGUIST . . . 273
        (5) LN, Langage Naturel, . . . 273
        (6) PROSODY . . . 274
        (7) Comserve  . . . 274
        (8) Applied linguistics (TESL-L, SLART-L, MULTI-L, LTEST-L) . . . 274
        (9) FUNKNET . . . 275
       (10) info-childes and info-psyling . . . 275
       (11) ASLING-Linguistics of Signed Languages . . . 275
       (12) List of lists . . . 275
    C.  Email Addresses . . . 276
3.  TEXT ENCODING STANDARDS (TEI, IPA, SAM, TOBI)  . . . 276
4.  DATA SOURCES . . . 278
    A.  Electronic Data Archives and Repositories . . . 278
        (1) OTA (Oxford Text Archive) . . . 278
        (2) ICAME (International Computer Archive of Modern English) . . . 278
        (3) CHILDES (The Child Language Exchange System) . . . 279
        (4) CETH (Center for Electronic Texts in the Humanities) . . . 279
        (5) The AIATSIS Aboriginal Studies Electronic Data Archive . . . 280
        (6) Project Gutenberg . . . 280
        (7) Library of the Future . . . 280
    B.  Surveys of Electronic Language Data . . . 280
        (1) Oxford Text Archive (OTA) catalogue . . . 280
        (2) University of Lancaster Survey . . . 280
        (3) Georgetown University Catalog of Archives and Projects . . . 281
        (4) Walker and Zampolli survey . . . 281
        (5) List of Electronic Texts in Philosophy . . . 281
        (6) List of Electronic Dictionaries . . . 281
        (7) Catalog of the University of Cambridge Literature
            and Linguistics Computing Centre . . . 282
        (8) Linguistic Society of America List . . . 282
        (9) The Marchand list of CD-ROM Projects . . . 282
       (10) ARL Directory of Electronic Publications . . . 282
5.  CORPORA AND TEXTBANKS . . . 282
    A.  Running text:  English Language . . . 283
        (1) Brown Corpus . . . 283
        (2) Lancaster-Oslo/Bergen (LOB) . . . 284
        (3) London-Lund Corpus . . . 285
        (4) Lancaster Spoken English Corpus (SEC) . . . 285
        (5) PIXI Corpora . . . 285
        (6) Helsinki Corpus of Historical English . . . 286
        (7) Macquarie (University) Corpus . . . 286
        (8) Kolhapur Corpus of Indian English . . . 286
        (9) American Heritage Intermediate Corpus . . . 286
       (10) Birmingham Collection of English Text (BCET) . . . 286
       (11) Longman/Lancaster English Language Corpus . . . 287
       (12) Corpus of Spoken American English (CSAE) . . . 287
       (13) International Corpus of English (ICE) . . . 287
       (14) British National Corpus Initiative (BNC) . . . 287
       (15) Bellcore Lexical Research Corpora . . . 288
       (16) Association for Computational Linguistics Data Collection 
            Initiative (ACL/DCI) . . . 288
       (17) European Corpus Initiative (ACL/ECI) . . . 289
       (18) Cambridge Language Survey (CLS) . . . 289
       (19) Linguistic Data Consortium (LDC) . . . 289
       (20) American News Stories . . . 290
       (21) Nijmegen TOSCA Corpus . . . 290
       (22) Melbourne-Surrey Corpus . . . 290
       (23) Corpus of English-Canadian Writing . . . 290
       (24) Warwick Corpus . . . 290
       (25) Cornell corpus . . . 290
       (26) NEXIS, LEXIS, MEDIS (Mead Data Central)
            and WESTLAW (West Corporation) . . . 291
    B.  Running text:  French Language . . . 291
        (1) OTA holdings . . . 291
        (2) Hansard Canadian Parliamentary Sessions . . . 291
        (3) Ottawa-Hull Corpus of Spoken French . . . 291
        (4) Tresor de la Langue Francaise (TLF or ARTFL) . . . 291
    C.  Running text:  German Language . . . 292
        (1) Mannheim Corpus . . . 292
        (2) Bonner Zeitungskorpus . . . 292
        (3) Freiburger Corpus . . . 292
        (4) LIMAS Corpus . . . 292
        (5) Pfeffer Spoken German Corpus . . . 292
        (6) Ulm Textbank . . . 292
        (7) Muenster Textbank . . . 292
    D.  Running text:  Italian Language . . . 292
        (1) PIXI corpora . . . 292
        (2) Pisa corpus . . . 292
    E.  Running text:  Other Languages . . . 293
        (1) Native American Languages . . . 293
        (2) Australian Indigenous Languages . . . 293
        (3) Danish . . . 293
        (4) Estonian . . . 293
        (5) Finnish . . . 293
        (6) Spanish . . . 293
        (7) Swedish . . . 293
        (8) Yugoslavian . . . 293
    F.  Running text:  Language Acquisition . . . 294
        (1) Child Language Acquisition (CHILDES, PoW) . . . 294
        (2) Adult Second Language Acquisition (ESFSLDB, Montreal) . . . 294
    G.  Phonetic Databases . . . 295
        (1) DARPA Speech Recognition Research Databases . . . 295
        (2) Phonetic Database (PDB) . . . 295
        (3) Multi-Language Speech Database . . . 295
    H.  Electronic Dictionaries . . . 296
        (1) See the Wooldridge list . . . 296
        (2) Oxford Text Archive (OTA) holdings . . . 296
        (3) Oxford English Dictionary (OED) . . . 296
        (4) Le Robert Electronique . . . 296
    I.  Lexical Databanks . . . 296
        (1) MRC Psycholinguistic Database . . . 296
        (2) Consortium for Lexical Research (CLR) . . . 297
        (3) Centre for Lexical Information (CELEX) . . . 297
        (4) Acquisition of Lexical Knowledge (ACQUILEX) . . . 298
        (5) Cambridge Language Survey (CLS) . . . 298
        (6) Japanese Electronic Dictionary Research Project . . . 298
    J.  Treebanks . . . 298
        (1) Lancaster-Leeds Treebank . . . 298
        (2) Lancaster Parsed Corpus . . . 298
        (3) Linguistic DataBase System (LDB) . . . 298
        (4) Penn Treebank Project . . . 299
        (5) Treebank of Written and Spoken American English . . . 299
    K.  Translation into English . . . 299
6.  LITERATURE PERTAINING TO ELECTRONIC CORPORA . . . 300
ACKNOWLEDGMENTS . . . 300
REFERENCES . . . 301
APPENDIX . . . 307

                                                                 [267]
                       1. INTRODUCTION

Corpora and textbanks of natural language sentences or utterances are
becoming increasingly widely used in linguistics, lexicography, and
computer science research, in part due to facilitatory technological
advances but also due to a broadening of focus in these three fields to
include a greater interest in produced language (vs. introspective
knowledge), structured interdependencies involving larger stretches of
text (vs. individual utterances or sentences), and contrasts across
language varieties, genres, and modalities (e.g., British vs.  American
English; narratives vs. interviews; spoken vs. written language).  For
further discussion, see Chafe (1992), Church (1991), Fillmore (1992),
Francis (1982), Halliday (1992) Leech (1991, 1992), Sinclair (1992),
and Svartvik (1992a).
   It is significant that a corpus often contains utterances or sentences
which would seem implausible from introspection but are perfectly
natural and acceptable in context (such as "It'll've been going to've
been being tested every day for about a fortnight soon!" from Halliday,
1992), and conversely, that sentences invented to illustrate
grammatical points may seem implausible as actual utterances because
they violate discourse constraints or expectations reflected in
definiteness of referents, aspectual perspectives taken on events, or
other properties (see Chafe, 1992, for examples and discussion).
Corpus-based approaches can bring to light aspects of linguistic
structure and process which are not illuminated in introspectively
generated data or psycholinguistic experiments and are needed for
comprehensive understanding of language phenomena (see Chafe, 1992;
Leech, 1991, 1992; Svartvik, 1992a concerning the particular
contributions of different approaches).
   In lexicography, corpora and textbanks enable a more efficient
exhaustive cataloging of word senses and collocations than is possible
with introspection alone (see Kjellmer, 1984; Sinclair, 1982; Sinclair
& Kirby, 1990).  In addition, they enable systematic attention to
contrasts between spoken and written uses of words, contrasts in
meaning as a function of position in the utterance or prosodic
features, and the relative frequencies of word senses (see Altenberg,
1990, for a comparison of corpus-based dictionaries).
   Corpora of increasing size are also being used in probabilistic sense
disambiguation, speech recognition, automatic syntactic analysis,
automatic assignment of intonation to written texts, and other types of
models and applications (to name but a few: Bachenko & Fitzpatrick,
1990; Bindi, Calzolari, Monachini, & Pirrelli, 1991; Brill, Magerman,
Marcus, & Santorini, 1990; Church & Hanks, 1990; Hindle & Rooth, 1991;
Knowles & Lawrence, 1987; Leech & Garside, 1991; Liberman, 1989; Morris
& Hirst, 1991; Sampson, 1992, Svartvik, 1990).
   Where one million words was once considered large, some of the projects
summarized below seek to gather 100 million words.  For written
                                                                 [268]
language, this is facilitated by the increasing availability of text
already on computer media (such as from typesetter tapes).  Spoken
language is less frequently available in this way, and therefore must
be specially gathered and prepared for electronic use.  In both cases,
data sharing and reuse is increasingly important both within and across
disciplinary boundaries, and a single (large) corpus community seems to
be emerging.
   This survey is intended in a modest way to help with this development.
Its focus is electronic corpora and textbanks, and related information
of primary interest to linguistic, computer science, and humanities
research.  The information summarized here was garnered from standard
published sources and the email discussion lists described below.  For
accuracy, the wording of the individual descriptions is as close as
possible to the original source, which is typically the person cited as
the contact person in the entry.  In addition, the descriptions of
completed corpora owe a debt to the following:  Chafe, Du Bois, and
Thompson (1992); Svartvik (1990), Taylor, Leech and Fligelstone (1989),
and the catalogs of the Oxford Text Archive, the ICAME archive, and the
Georgetown University archives project, all described below.
   What is unique to the current survey is its inclusion of a number of
projects and corpora that have sprung up during the past two years, a
heavier representation of projects in computational linguistics than in
available surveys to date, and the inclusion of electronic discussion
lists and public lists of email addresses, few of which were available
at the time of the earlier surveys.
   The first version of this compilation was completed in 1991, and was
updated and expanded to include new developments through October 1992.
Although I have attempted to make this survey as complete as possible,
this is a rapidly growing area.  Any update of this survey will be
submitted to the ICAME fileserver (see below), possibly for access via
anonymous ftp (file transfer software available on many mainframes).
   Concerning corpora developed before computers, readers are referred to
Francis (1992).  Lexicographical resources are treated here only
briefly, in Sections 5H and 5I.  For further information, readers are
referred to Altenberg (1990), Atkins, Clear, and Ostler (1992),
Boguraev and Briscoe (1988), Gellerstam (1988), Sinclair (1987),
Sinclair and Kirby (1990), and Walker (1992).
   The materials survey below are organized with respect to five main
headings:
- information sources (associations, email addresses and discussion lists);
- encoding standards;
- data sources (archives and repositories, surveys of electronic language data);
- descriptions of selected corpora and textbanks; and
- bibliographies of related research.
                                                                 [269]
   The Appendix contains Susan Hockey's summary of resources relevant to
humanities computing.


                     2. INFORMATION SOURCES

A. Centers and Associations

The following organizations encourage corpus-related research and the
exchange of corpus-related information by publishing journals,
sponsoring conferences and workshops, and various other professional
activities.  (Organizations concerned with the gathering and
distribution of electronic data are summarized under "Data Sources,"
later in this chapter.)
   1. The Norwegian Computing Centre for the Humanities (NCCH) was
established in 1972 as a center for research and development to help
individual researchers and academic institutions in the use of
computers in the humanities.  To this end, it develops computing
methods and software for application in humanistic research and
provides information and teaching services to demonstrate how computer
technology can be utilized in the field.  This work is carried out in
cooperation with humanities research institutions and the Norwegian
universities' computing departments.  NCCH houses the ICAME archive
(described later), which contains the most widely used linguistic
corpora of English, and distributes these data at low cost to
researchers.  Its ICAME CD-ROM contains the Brown Corpus (written
American English), the LOB Corpus (written British English), the
London-Lund Corpus (spoken British English), the Helsinki Corpus
(diachronic English) and the Kolhapur Corpus (Indian English), and
costs roughly $500 US.  Further information on the CD-ROM can be
obtained by emailing the message "send icame info.cd" to
fileserv@nora.hd.uib.no, or via anonymous ftp to nora.hd.uib.no
(129.177.24.42) (filename: pub/icame/info.cd). NCCH sponsors the
electronic bulletin board, "CORPORA" (described below), and serves as a
clearinghouse for information concerning corpora, corpus availability,
and corpus research.  For more information:  NCCH, Humanistisk
Datasenter, Harald Haarfagres gt. 31, N-5007 Bergen, Norway; Tel: +47
(5) 212954; FAX:  +47 (5) 322656; email: adm@nora.hd.uib.no or
knut@x400.hd.uib.no.
   2. The Computers in Teaching Initiative Centre for Textual Studies
(CTI) was established in 1990 to promote and support the use of
computers in teaching literature, linguistics and related disciplines
in all British universities.  Begun under the direction of Susan
Hockey, the CTI produces a newsletter, called Computers in Literature,
                                                                 [270]
and a software guide and holds periodic training workshops concerned
with the use of computers in humanities training and research.  It also
sponsors the Humanities Bulletin Board (HUMBUL) described in a later
section.  For more information:  CTI Centre for Textual Studies,
University of Oxford Computing Services, 13 Banbury Road, Oxford, OX2
6NN, UK; Tel: +44 (865) 273 221; FAX: +44 (865) 273 275; email:
ctitext@vax.oxford.ac.uk.
   3. The Center for Electronic Texts in the Humanities (CETH), directed
by Susan Hockey, was established in 1991 by Rutgers and Princeton
Universities with external support from the Mellon Foundation and the
National Endowment for the Humanities.  It is intended to become a
national focus of interest in the United States for those who are
involved in the creation, dissemination and use of electronic texts in
the humanities, and it will act as a national node on an international
network of centers and projects which are actively involved in the
handling of electronic texts.  Developed from the international
inventory of machine-readable texts which was begun at Rutgers in 1983
and is held on RLIN, the Center is now reviewing the records in the
inventory and continues to catalog new texts.  The acquisition and
dissemination of text files to the community is another important
activity, concentrating on a selection of good quality texts which can
be made available over Internet with suitable retrieval software and
with appropriate copyright permission.  The Center also acts as a
clearinghouse on information related to electronic texts, directing
inquirers to other sources of information.  Susan Hockey's useful list
of resources for humanities computing is included below in the
Appendix.  For further information:  Center for Electronic Texts in the
Humanities, 169 College Avenue, New Brunswick, NJ 08903, USA; email
ceth@zodiac.rutgers.edu or ceth@zodiac.bitnet or hockey@zodiac.bitnet;
Tel:  +1 (908) 932-1384; FAX: +1 (908) 932-1386.
   4. The Association for Computers and the Humanities (ACH) is an
international organization devoted to computer-aided research in
literature and language studies, history, philosophy, anthropology, and
related social sciences, especially research involving the manipulation
and analysis of textual materials. The ACH encourages development and
dissemination of significant textual and linguistic resources and
software for scholarly research.  Its official journal, Computers and
the Humanities, is published six times a year.  It also publishes Bits
and Bytes Review, a review of software in the humanities and social
sciences, nine times each year.  Jointly with the ALLC (see next
entry), it sponsors an annual meeting held in North America in
odd-numbered years and in Europe in even-numbered years, which brings
together scholars from around the world to report on research
activities and software and hardware developments in the field.  ACH
                                                                 [271]
initiated the Text Encoding Initiative (TEI), an international effort
to develop guidelines for the encoding of machine-readable literary
and linguistic data.  The ACH also sponsors the Rutgers/Princeton
National Text Archive, the HUMANIST Electronic Discussion Group, and
the LN Electronic Bulletin Board for Natural Language Studies in French
and English. For further information:  Joseph Rudman, Association for
Computers and the Humanities, Department of English, Carnegie-Mellon
University, Pittsburgh, PA 15213, USA; email:  rudman@cmphys.bitnet.
   5. The Association for Literary and Linguistic Computing (ALLC) has
representatives in over 30 countries, including advisors in the
following areas:  Machine Translation, Computer-Assisted Learning,
Lexicography, Software, Structured Databases.  Its journal, Literary
and Linguistic Computing, is published four times per year, containing
papers on all aspects of computing applied to literature and language,
ranging from computing techniques to results of research projects.  To
join ALLC and obtain the journal: Journals Marketing, Oxford University
Press, Pinkhill House, Southfield Road, Eynsham, Oxford, OX8 1JJ, UK,
or Journals Marketing, Oxford University Press, 2001 Evans Road, Cary,
NC  27513, USA.
   6. The Association for Computational Linguistics (ACL) promotes
research on computational linguistics and natural language processing.
It publishes the journal Computational Linguistics and sponsors annual
meetings (usually in North America), biennial European meetings, and
biennial meetings on applied natural language processing, and supports
the international conferences on Computational Linguistics (COLING).
Proceedings of past meetings are available through the ACL Office.  The
ACL also sponsors the Text Encoding Initiative (TEI), for standardizing
the encoding and interchange of machine-readable text, and two data
collection initiatives-the Data Collection Initiative (DCI) and the
European Corpus Initiative (ECI)-(described later, under Data Sources)
to assemble massive text corpora in English and other languages, and
make them available for scientific research at cost and without
royalties.  Recently, the ACL established a series of Special Interest
Groups (SIGs) on the Mathematics of Language, the Lexicon, Parsing,
Generation, Computational Phonetics, and Multimedia Language
Processing.  Others are likely.  The SIGs organize workshops, prepare
bibliographies, and provide specialized communication channels.  For
more information: Donald E. Walker (ACL), Bellcore, MRE 2A379, 445
South Street, Box 1910, Morristown, NJ 07960-1910, USA; FAX: +1 (201)
829- 5981; email: walker@bellcore.com.

                                                                 [272]
B. Electronic Mail Distribution Lists and Discussion Lists

Electronic distribution lists and discussion lists distribute messages
contributed by subscribers to all other subscribers on that list.  They
are a good forum for queries and current information, are easy to join
and unjoin, and often cost nothing beyond what the user's institution
is already paying for email service.
   1.  HUMBUL (Humanities Bulletin Board) is a long-running service aimed
at providing academics and interested parties with news and information
on Humanities Computing.  This service is an on-line bulletin board,
edited by Stuart Lee at the CTI (described earlier) at Oxford
University.  Information is collected from all applicable electronic
networks plus periodicals, leaflets, and also direct requests to the
editor.  At regular intervals, HUMBUL indicates its most recent
acquisitions, and these can be accessed via ftp, telnet, or other
means.  To subscribe, send the following one-line command to
listserv@UKACRL.bitnet:

SUB HUMBUL <John Doe>

where <John Doe> is your name.  If you do not then receive an automatic
message saying you have been added to the list, send email to:
humbul@vax.oxford.ac.uk.
   2.  Begun in 1992, CORPORA is an international email discussion list
for information and questions about text corpora, such as availability,
aspects of compiling and using corpora, software, tagging, parsing,
bibliography, and related matters.  To join the list, send a message
to:

CORPORA-REQUEST@nora.hd.uib.no

To submit a contribution to the list, send it to:

CORPORA@nora.hd.uib.no

The list administrator is Knut Hofland, NCCH, Humanistisk Datasenter,
Harald Haarfagres gt. 31, N-5007 Bergen, Norway; Tel: +47 (5) 212954;
FAX:  +47 (5) 322656; email:  knut@x400.hd.uib.no.
   3.  HUMANIST is an international email discussion list for issues
relating to the application of computers to scholarship in the
humanities.  This includes linguistics, comparative literature,
philosophy, Biblical studies, and several other fields.  Begun in 1987
under joint sponsorship of the ACH, the ALLC and the University of
Toronto's Centre for Computing in the Humanities, it is currently
                                                                 [273]
housed at Brown University and moderated by Elaine Brennan and Allen
Renear.  It has over 600 members in 24 countries. To subscribe, mail
"SUB <your email address>" to listserv@brownvm.brown.edu; to post
articles, mail them to humanist@brownvm.brown.edu.  Articles submitted
to HUMANIST are archived on a file server and can be searched remotely
by means of one-line listserv commands.
   4.  LINGUIST is an international list intended as a place for
discussion of issues of concern to the academic discipline of
linguistics and related fields.  It is moderated by Anthony Aristar
(University of Western Australia) and Helen Dry (University of Texas at
San Antonio).  It explicitly welcomes discussion of any linguistic
subfield.  To subscribe to LINGUIST, send email to the LINGUIST
listserver (listserv@TAMVM1.bitnet or listserv@TAMVM1.tamu.edu),
containing the following one-line message:

SUBSCRIBE LINGUIST <Your Name>

for example, "subscribe linguist Jane Smith."  To submit a posting to
the list, mail it to linguist@TAMVM1.tamu.edu.
   The LINGUIST fileserver may contain contributed files of interest to
language researchers, such as the LSA or Georgetown lists of corpora,
and linguists' email addresses and these are similarly obtainable by
one-line commands.  For more information, send the one-line command
"help linguist" via email to linguist-request@TAMVM1.tamu.edu.  For
questions requiring human attention, send a message to:
linguist-editors@TAMVM1.tamu.edu.
   5.  LN, Langage Naturel, is an international list for computational
linguistics, sponsored by the Association for Computational Linguistics
(ACL) and the Association for Computers and the Humanities (ACH).  Its
goal is to disseminate calls for papers; conference and seminar
announcements; requests for software, corpora, and various types of
data; project descriptions; and discussions on technical topics.  The
list is primarily French-speaking, but many items are circulated in
English.  The list is moderated by Jean Veronis (Vassar University) and
Pierre Zweigenbaum (France). To subscribe to LN, send the following
one-line message to  listserv@FRMOP11.bitnet:

SUBSCRIBE LN your name

To post a message to the list as a whole, email it to
LN@FRMOP11.bitnet. In case of problems, send a message to one of the
editors:  veronis@vassar.bitnet or zweig@FRSIM51.bitnet.
                                                                 [274]
   6.  PROSODY is an international list with members representing a broad
spectrum of approaches including linguistics, psycholinguistics, and
computer science.  It serves a vital function of disseminating
information concerning available resources in a technologically rapidly
expanding area.  To subscribe, send: "subscribe prosody <your name>" to
LISTSERV@msu.bitnet.  Send postings to PROSODY@msu.bitnet.  The list is
managed by George Allen, Michigan State University (email:
alleng@msu.bitnet) who also owns the list, "HYPERCARD."
   7.  Comserve is an electronic information service for professionals and
students interested in human communication studies.  It is located at
Rensselaer Polytechnic Institute and coordinated by Timothy Stephen and
Teresa Harrison, both of whom are professors in communication studies.
Comserve keeps archives of bibliographies, course materials, job
announcements, text transcripts, and other materials, with the author
retaining the rights and the copyright.  It coordinates a number of
hotlines on communication, which can be subscribed to via the
listserver.  To subscribe to the Ethnomethodology hotline, send the
following one-line message to comserve@rpiecs.bitnet:

Join Ethno Your_name

To obtain a long list of useful bibliographic information, send the
following one-line message to comserve@rpiecs.bitnet:

send compunet biblio

Send materials to be posted to the net to ethno@rpiecs.bitnet and
materials to be archived to support@rpiecs.bitnet;
   8.  Applied linguistics lists.  From Ken Willing at Macquarie
University, I learned of the following four lists and their listserver
addresses:
TESL-L  (Teaching English as a Second Language)
	Listserver address:  listserv@cunyvm.bitnet 
SLART-L  (Second Language Acquisition Research and Teaching)
	Listserver address:  listserv@psuvm.bitnet 
MULTI-L  (Language and Education in Multicultural Settings)
	Listserver address:  listserv@barilvm.bitnet 
LTEST-L  (Language Testing Research and Practice)
	Listserver address:  listserv@UCLACN1.bitnet
                                                                 [275]
To subscribe, send a one-line email message to the indicated address,
containing:

subscribe XXXXXX John Doe

where XXXXXX is the list-name (e.g. TESL-L), and John Doe is your
name.
   9.  FUNKNET, headed by Talmy Givon and Paul Hopper, is a discussion
list concerned with various aspects of human language, communication,
cognition, socioculture, neuropsychology, and other facets of cognitive
and communicative behavior, viewed from what might loosely be called
the functionalist perspective, that is, language viewed as an
instrument of communication, coding experience, an evolved
neurobiological phenomenon, a sociocultural phenomenon, or a
combination of these, with an emphasis on empirical language study,
including especially corpus data.  For further information, contact
Talmy Givon at: funknet-request@oregon.uoregon.edu
   10. info-childes and info-psyling are international email distribution
lists, moderated by Julia Evans and Brian MacWhinney, Psychology
Department, Carnegie Mellon University.  Info-childes circulates
information concerning corpus-related child language research, and
info-psyling circulates information on psycholinguistics.  To
subscribe, send email to brian+@andrew.cmu.edu.
   11.  ASLING-L is a list for linguistic study of signed languages,
including all linguistic areas, including syntax, acquisition,
phonology, morphology, psycholinguistics, and cognition.  To subscribe,
send:

SUB ASLING-L <your name>

to listserv@yalevm.bitnet.  The listowner is Christine Romano (cromano
@uconnvm.bitnet).
   12.  List of lists.  A very lengthy list of Bitnet and Internet
discussion lists (presently over one megabyte long) can be obtained via
anonymous ftp to ftp.nisc.sri.com (192.33.33.22) in the directory
netinfo as "interest-groups.Z" or by sending the following one-line
message to mail-server@nisc.sri.com, making sure in advance that your
system has sufficient space to receive it:

SEND NETINFO/INTEREST-GROUPS

                                                                 [276]
A related list can be obtained by sending email to
listserv@ndsuvm1.bitnet with the following one-line message:

sendme interest package

For further information concerning electronic discussion lists, see the
ARL Directory of Electronic Publications (below).

C. Email Addresses

There are now several periodically updated lists of email addresses for
researchers engaged in language-related research.  One of them is
compiled by Norval Smith and associates at the University of Amsterdam
and accessible for retrieval and modification via the name server
linguists@alf.let.uva.nl.  For information, send the word "HELP" as a
one line-command to this address.  To receive the full list of email
addresses, send "list *" (with a space between list and *).  For a list
of FAX addresses, send "list fax."
   The other main list is the one compiled by John Moyne for the
Linguistic Society of America (LSA).   It can be obtained
electronically via anonymous ftp to csli.stanford.edu or by sending the
following one-line message to the LINGUIST listserver,
listserv@tamvm1.tamu.edu:

GET LSA LST LINGUIST

It can be obtained in hard copy from: LSA, 1325 18th St. NW, Suite 211,
Washington D.C. 20036, USA; email:  moygc@cunyvm.bitnet or
ZZLSA@GALLUA.bitnet.


                   3. TEXT ENCODING STANDARDS

The sources listed in this section are not exhaustive, but are useful
starting points in part as clearinghouses for information on related
projects in addition to their own proposals.
   1.  The Text Encoding Initiative (TEI) (Burnard, 1991; Hockey, 1991;
Sperberg-McQueen & Burnard, 1992; Walker, 1992; Walker & Hockey, 1991)
is an international and interdisciplinary project of the ALLC, ACH, and
ACL in collaboration to define text encoding guidelines and establish a
common interchange for machine-readable literary and linguistic data.
Fifteen other scholarly organizations including the Linguistics Society
                                                                 [277]
of America are represented on its advisory board.  The project has
received major funding from the National Endowment for the Humanities,
the European Economic Community, and The Andrew W. Mellon Foundation
and has a number of subcommittees specializing in particular aspects of
this enormous task.  This includes working groups on spoken language
encoding, encoding for lexicons, and phonetic encoding.
   TEI working papers and reports, including a copy of the Guidelines for
the Encoding and Interchange of Machine-readable Texts, can be obtained
in hard copy from Wendy Plotkin (U49127@UICVM.bitnet) or electronically
from LISTSERV@UICVM.bitnet.  For a list of available documents, send
the following line to LISTSERV@UICVM.bitnet:

GET TEI-L FILELIST

For further information:  C. Michael Sperberg-McQueen, Editor of TEI,
Computer Center (M/C 135), University of Illinois at Chicago, Box 6998,
Chicago, IL  60680, USA; Tel: +1 (312) 996-2477; FAX: +1 (312)
996-6834; email: u35395@uicvm.cc.uic.edu or u35395@uicvm.bitnet.
   2.  In 1989 in Kiel, Germany, the IPA Working Group on Suprasegmental
Categories initiated an IPA Number scheme that facilitates transmission
of data by code (if correspondents set up their systems to refer to the
common IPA Number).  Their proposal also includes encoding of
suprasegmental categories (see Bruce, 1989, 1992; Bruce & Touati,
1990).  For further information:  Gosta Bruce, Professor of Phonetics,
Lund University, Sweden; email: linglund@seldc52.bitnet, or John
Esling, Linguistics Department, University of Victoria, British
Columbia, Canada; email:  VQPLOT@uvvm.bitnet.
   3.  The Speech Assessment Methodology (SAM) project is developing a
prosodic labeling system to facilitate computer readable prosodic
transcriptions, representation of prosodic properties in the lexicon,
and tools for prosodic labelling.  Their system is intended to be
uncommitted with respect to prosodic theories, and is being developed
in conjunction with the ASL (Architecture for Speech Language Systems)
project.  For more information:  Dafydd Gibbon, Linguistik und
Literaturwissenschaft, University of Bielefeld, P-8640, D-4800
Bielefeld 1; FAX +49 (521) 1065844; email:
gibbon@LILI11.UNI-BIELEFELD.DE.
   4.  The TOnes and Break Indices (TOBI) is a prosodic labeling system
(Silverman et al., 1992).  In 1991 and 1992, Victor Zue (MIT) and Kim
Silverman (Nynex), sponsored two prosodic transcription workshops for
the development of a prosodic labelling system, to facilitate the
                                                                 [278]
sharing of corpora in a manner compatible with WAVES(tm) format, and to
accompany speech files and time- aligned analysis records for sets of
utterances.  TOBI focuses especially on word groupings and prominences,
in a manner loosely tied to Pierrehumbert (1980) and Pierrehumbert and
Hirschberg (1990).  The description of the TOBI system, sample
WAVES(tm) scripts and supporting materials will be announced on the
Prosody discussion list, and made available via anonymous ftp at
kiwi.nmt.edu (129.138.1.82), or cassette tape, with an invitation for
feedback from potential users.


                         4. DATA SOURCES

A. Electronic Data Archives and Repositories
   1.  The Oxford Text Archive (OTA), directed by Lou Burnard, is by far
the largest archive of computerized language texts and corpora on this
list.  Its catalog lists nearly 2000 titles, including over 450
separate collections of written or spoken language in nearly three
dozen languages.  It is a deposit archive for textbanks from private
scholarly research, and welcomes for inclusion collections of any
specialization and in any format for reuse within the scholarly
community.  Its facilities are free and secure and provided as a
service to the world's academic community.  Access to the archive is
possible by anonymous ftp, online, by tape (9-track; Density 800, 1600
or 6250 bpi; ASCII or EBCDIC; fixed, variable, or formatted), by
diskette (MS-DOS or Macintosh; HD or DD; 3.5" or 5.25"), by cartridge
(DC300, TAR format only), or over networks.  Costs to users are kept
low to enable wide access.
   Its catalogue, now over 60 pages long, is available in hard copy from
the address given below, or electronically, in either SGML
(international mark-up standard for written texts) or non-SGML format.
The catalog and some of its texts are available via anonymous ftp to
black.ox.ac.uk (or 129.67.1.165).
   For more information:  Alan Morrison or Lou Burnard, Oxford Text
Archive, Oxford University Computing Services, 13 Banbury Road, Oxford
OX2 6NN, UK; Tel: +44 (865) 273238 [direct line] or 273200
[switchboard]; FAX: +44 (865) 273275; archive@vax.oxford.ac.uk.
   2.  The International Computer Archive of Modern English (ICAME) was
established in 1977 with the aims of (a) collecting and distributing
information on electronically available English language materials and
on linguistic research involving these materials, (b) compiling an
archive of English text corpora in machine-readable form, and (c)
making material available to research institutions.  Its holdings
include the three most widely used electronic corpora of spoken and
                                                                 [279]
written language (the Brown, LOB, and London-Lund corpora, described
later) and several other large corpora, some with grammatical
annotations, together with corpus-related software, and are distributed
through the NCCH in Bergen, Norway (described earlier).  The ICAME
CD-ROM contains the Brown, LOB, London-Lund, Helsinki and Kolhapur
corpora together with software and a summary of discussion lists,
networks, surveys, and corpora, and is available for approximately $500
US.  Their survey is independent of the current one and should be
consulted as an important resource, as it may contain information not
covered here, especially with respect to European projects.  Further
information concerning the CD-ROM can be obtained by sending the
command "send icame info.cd" to fileserv@nora.hd.uib.no or via
anonymous ftp to nora.hd.uib.no (129.177.24.42).  Its catalog of
holdings and related document files can be obtained via anonymous ftp
to nora.hd.uib.no (129.177.24.42) or by fileserver commands sent to
fileserv@nora.hd.uib.no.  For more information regarding the
fileserver, email the following command to fileserv@nora.hd.uib.no:
send icame file.servers.
   ICAME holds an annual conference (with some proceedings available from
Rodopi Publishers, Amsterdam) and produces a journal once a year,
edited by Stig Johansson at the University of Oslo, containing analyses
of corpus data, surveys of archives, and book reviews.
   For more information:  ICAME, Norwegian Computing Centre for the
Humanities, Harald Haarfagres gt. 31, N-5007 Bergen, Norway; Tel: +47
(5) 212954 or 212955 or 212956; FAX: +47 (5) 322656; email:
adm@nora.hd.uib.no or knut@x400.hd.uib.no.
   3.  The Child Language Exchange System (CHILDES) (MacWhinney, 1991;
MacWhinney & Snow, 1985) contains child language data in several
languages, including a number of the major child language corpora in
English.  It also contains some corpora of adult language (e.g., the
Cornell Corpus described later).  Data contributions are welcomed and
secure and are made available free of charge after contacting Brian
MacWhinney to become a member of CHILDES (also free of charge).  The
data are accessible via anonymous ftp to poppy.psy.cmu.edu or CD-ROM or
other magnetic media.  The archive also offers a free software package
(CLAN) for use on PCs, MACs and mainframes and manages the info-psyling
and info-childes electronic discussion groups.  For more information:
Brian MacWhinney, Department of Psychology, Carnegie Mellon University,
Pittsburgh, PA, 15213 USA; email:  brian+@andrew.cmu.edu; Tel:  +1
(412) 268-2782.
   4.  The Center for Electronic Texts in the Humanities (CETH) is
described earlier.
                                                                 [280]
   5.  The Aboriginal Studies Electronic Data Archive, housed by the
Australian Institute of Aboriginal and Torres Strait Islander Studies
(AIATSIS), includes over 150 Australian indigenous languages.  It is
available to researchers, subject to deposit and access conditions.
The catalog of holdings is available by sending the following one-line
message to listserv@tamvm1.tamu.bitnet:

get aboriginal-cat

For further information:  Aboriginal Studies Electronic Data Archive,
AIATSIS, GPO Box 553, Canberra, ACT 2601, Australia; Tel: +61 (6) 246
1170; FAX:  +61 (6) 249 7310; email:  aiatsis@peg.apc.org.
   6.  Project Gutenberg makes available literary works on electronic
media.  These are available via anonymous ftp from mrcnext.cso.uiuc.edu
(or 128.174.73.105).  For more information:  Michael Hart, email:
hart@vmd.cso.uiuc.edu.
   7.  Library of the Future is a set of CD-ROMs sold by DAK Industries,
containing the complete unabridged text of 453 novels, stories, plays
and historical documents.  For more information:  DAK Industries, 8200
Remmet Ave., Canoga Park, CA, 91304, USA; Tel: +1 (800) 888-6703.

B. Surveys of Electronic Language Data

Three long lists (items #1 through 3 below) cover the major language
research corpora in the common domain (plus a couple which are not).
These lists are best obtained from their sources (given below) rather
than in static printed sources, since some of them are updated
periodically.  Some further data sources may be found in Levelt, Mills,
and Karmiloff (1981), though, it is difficult to know which of these
may have become computerized in the meantime.  Two sources for
humanities texts beyond those included below are Raben and Gaunt
(forthcoming) and, from the Appendix, Hughes (1987) and Lancashire and
McCarty (1989).
   1. The OTA catalogue, mentioned earlier, provides 60 pages of corpus
descriptions.
   2. The University of Lancaster Survey describes 56 language archive
projects intended mainly for linguistic research.  This includes
non-English corpora and several varieties of English (Indian, Canadian,
and Australian), some of which contain rich grammatical and semantic
tags for individual words in the corpus.  Taylor, Leech, and
                                                                 [281]
Fligelstone (1989) is available from the HUMANIST file server by
sending the following one-line command to listserv@brownvm.bitnet:

GET SURVEY CORPORA HUMANIST

or via anonymous ftp to NCCH at nora.hd.uib.no (129.177.24.42)
(filename:  pub/icame/survey.corpora).  The parts concerning English
texts are published in Taylor, Leech, and Fligelstone (1991).
   3.  The Georgetown University Catalog of Projects in Electronic Text
(CPET), begun in 1989, contains highly informative descriptions and
access information for over 312 electronic corpus projects in 27
countries and is continually updated.  It can be accessed via telnet to
guvax3.georgetown.edu.  For further information:  Paul Mangiafico,
Center for Text and Technology, Reiss Science Building, Room 238,
Georgetown University, Washington, DC 20057, USA; Tel: +1 (202)
687-6096; pmangiafico@guvax.georgetown.edu.
   4.  The Walker and Zampolli Survey of Written and Spoken Language in
Machine-Readable Form (in progress), directed by Don Walker (Bellcore,
Morristown, NJ, USA; walker@bellcore.com) and Antonio Zampolli
(Institute for Computational Linguistics, Pisa, Italy;
glottolo@icnucevm.cnuce.cnr.it), is being conducted to provide a
comprehensive inventory of such materials.  It is sponsored by several
associations discussed elsewhere in this chapter (including the ACH,
the ACL and its Data Collection Initiative, the ALLC, the CETH, and the
TEI), and also the Modern Language Association, the European Science
Foundation, the Commission of the European Communities, the Network of
European Reference Corpora, the Linguistic Data Consortium (LDC) among
others.  For more information about the textual component:  Textual
Data Survey, Center for Electronic Texts in the Humanities, 169 College
Avenue, New Brunswick, NJ 08903, USA; FAX +1 (908) 932-1386;
ceth@zodiac.rutgers.edu.
   5.  The list of Electronic Texts in Philosophy was compiled by Leslie
Burkholder (CDEC, Carnegie Mellon University) in December 1991 for the
American Philosophical Association.  It can be obtained from the
HUMANIST file server by sending an email message to brownvm.bitnet
containing only the following line: GET PHILOSFY ETEXTS HUMANIST
   6.  List of Electronic Dictionaries.  In a posting to HUMANIST (Vol. 4,
No.  1137. Thursday, 7 Mar 1991), Russ Wooldridge
                                                                 [282]
(wulfric@vm.epas.utoronto.ca) listed 58 electronic dictionaries, mostly
in English but also including several European languages and Hebrew,
Greek, and Latin.  This list is available from the HUMANIST file
server.
   7.  The Catalog of the University of Cambridge Literature and
Linguistics Computing Centre is a published catalog (see Dawson,
1977).
   8.  The Linguistic Society of America List, compiled in 1987 by Lise
Menn, turned up numerous data sets but only relatively few of them on
computer.  For more information, contact the LSA office (at the address
provided above concerning the list of linguists' email addresses).
   9.  The Marchand list of CD-ROM projects was compiled by James Marchand
at the University of Illinois and is available via the Humanist
fileserver by mailing the following one-line command to
listserv@brownvm.bitnet: GET CDROM PROJECTS HUMANIST
   10.  ARL Directory of Electronic Publications.  Although many journals,
newsletters and scholarly lists may be accessed free of charge through
Bitnet, Internet and affiliated networks, it is not always simple to
know what is available.  Compiled and published by the Association of
Research Libraries (ISBN #1057-1337), this directory provides access
information to 500 scholarly lists, 30 journals, and 60 newsletters.
It is available in either hard copy or on 3.5 inch diskette, at a cost
of $20 to nonmembers of the ARL.  For more information:  Office of
Scientific and Academic Publishing, Association of Research Libraries,
1527 New Hampshire Ave., NW, Washington, DC. 20036, USA; email:
ARLHQ@umdc.umd.edu or ARLHQ@umdc.bitnet; FAX: 202- 462-7849.  The
"Directory of Electronic Journals and Newsletters," compiled by Michael
Strangelove in 1991, can be obtained at no charge by sending an email
message to listserv@uottawa.bitnet containing the following two lines:
GET  EJOURNL1  DIRECTRY
GET  EJOURNL2  DIRECTRY


                     5. CORPORA AND TEXTBANKS

It is common to distinguish between corpora and textbanks.  These
differ in size and composition, and serve somewhat different analytic
aims.  Corpora are intended to be representative of some specified
population or genre.  Textbanks tend to be collections of available
                                                                 [283]
data with looser connection to each other, or focus on a restricted
number of genres (including perhaps only one).  Corpora are needed for
large scale, systematic contrasts of, for example, language varieties,
genres, and modalities (e.g., American vs. British English, informative
vs. imaginative prose, or spoken vs. written language).  Other research
requires enormous amounts of data, even if from fewer genres, as for
example, in lexicography, in order to detect words and collocations
which occur only rarely.  (For systematic discussion of corpus design,
size, and sampling issues, see Atkins, Clear, & Ostler, 1992; Church,
1991; Carroll, Davies, & Richman, 1971; Fillmore, 1992; Francis, 1982;
Kucera & Francis, 1967; Leech, 1991, 1992; Poplack, 1989; Sinclair,
1982, 1992; Walker, 1991.) A particularly interesting concept is that
of a "monitor corpus," intended to be not finite or temporally bounded
but rather gaining and losing texts over time in parallel with the
fluidity of the language itself (Sinclair, 1982; 1992).
   Listed below are collections of running prose, followed by some
phonetic databases, lexical databases, and treebanks (that is,
databases of bracketed and syntactically labeled structures, such as
noun phrase, verb phrase, etc.).  The survey is probably less
exhaustive for the phonetic, lexical, and treebank sections than for
the sections on corpora and textbanks of running prose, which were the
dominant focus in compiling it.

A. Running Text:  English Language

The three most widely used corpora to date are the Brown corpus, the
Lancaster/Oslo-Bergen (LOB) corpus, and the London-Lund corpus.  These
are described first, followed by descriptions of 23 others that are
well-known within one or another subdomain of corpus-based language
research (i.e., linguistics, psycholinguistics, computational
linguistics, lexicology and lexicography), ordered in thematically
related clusters and roughly chronologically within each cluster.
   1.  The Brown Corpus (The Standard Corpus of Present-Day Edited
American English) (Francis, 1982; Francis & Kucera, 1979, 1982; Kucera,
1992; Kucera & Francis, 1967) is a corpus of 1 million words of written
American English printed in the year 1961.  It was the first corpus to
be put on computer medium and is the most analyzed corpus of English to
date.  It consists of 500 written American English texts of 2,000 words
apiece, selected to represent diverse genres of written American
language.  There are two main sections:  Informative Prose and
Imaginative Prose.  Genres represented include newspaper reportage,
press editorials, memoirs,  religion, science fiction, detective
fiction, and romance novels (excluding drama and fiction with more than
                                                                 [284]
50% dialog).  This corpus of running text is available for academic
research for the cost of materials from both the Oxford Text Archive
and the ICAME archive and is contained on the ICAME CD-ROM available
through NCCH (see above).
   A "tagged" version of the Brown Corpus (i.e., supplemented by labeling
of individual words for 82 part-of-speech designations) was produced at
Brown University during the period 1970-1978 with assistance from the
TAGGIT program, written by B. B. Greene and G. M. Rubin (for additional
details, see Francis, 1980; Garside, Leech, & Sampson, 1987; Svartvik,
1990).  The tagged version is protected by its own copyright, and is
available for $1000 to academic institutions.  For more information:
Text Research, 196 Bowen Street, Providence RI 02906, USA. FAX: +1
(401) 751-8958 or Nelson Francis or Henry Kucera, Department of
Linguistics, Brown University, Providence RI 02906, USA;  email:
henry@brownvm.bitnet or henry_kucera@brown.edu.
   For a parsed (as opposed to part-of-speech tagged) version of part of
the Brown corpus, known as the Gothenburg Corpus, contact:  Gudrun
Magnusdottir, Sprakdata, University of Go"teborg, S-412 98 Go"teborg,
Sweden.  The Susanne Corpus (Surface and Underlying Structural Analyses
of Naturalistic English), using more transparent codes, for easier
research use is currently in preparation.  For information:  G.
Sampson, Department of Linguistics and Phonetics, University of Leeds,
Leeds LS2 9JT, UK.
   2.  The Lancaster-Oslo/Bergen Corpus (LOB) is 1 million words of
written British English from 1961.  It was compiled in the 1970's under
the direction of Geoffrey Leech, University of Lancaster, and Stig
Johansson, University of Oslo.  It is the British counterpart of the
Brown corpus, and contains 500 texts of roughly 2,000 words each.  The
texts range across the same types of published written language as
those of the Brown corpus, and the number of texts of each type are
almost identical to those of the Brown corpus.  A tagged version of the
LOB corpus was produced between 1978 and 1983, using the CLAWS1
automatic tagging system, which uses text-based probabilities.
Garside, Leech, and Sampson (1987) and Leech and Garside (1991) provide
details of their methods and a survey of methods for automatic tagging
and parsing of language corpora more generally.
   Both the tagged and untagged versions of the LOB corpus are available
for academic use from the ICAME archive, and are contained on the ICAME
CD- ROM described above.  Their manuals (Johansson, Leech, & Goodluck,
1978, and Johansson, Atwell, Garside, & Leech, 1986, respectively) are
also available from ICAME.  A hand-parsed version of 45,000 words from
the LOB is available as the Lancaster-Leeds Treebank; an automatically
parsed version of 140,000 words from the LOB is available as the
Lancaster Parsed Corpus (both described below, under "Treebanks").  A
                                                                 [285]
larger treebank is being prepared by Steve Fligelstone.  For further
information:  Steve Fligelstone, UCREL, Linguistics Department, Bowland
College, Lancaster University, Lancaster LA1 4XZ, UK; email:
eia002@lancaster.ac.uk.
   3.  The London-Lund Corpus (LLC) is 500,000 words of spoken educated
British English, collected during the 1960's and early 1970's from
speakers of various ages, representing a range of discourse types.
They were transcribed to include markings of tone unit boundaries,
nucleus (points of pitch prominence), direction of nuclear tones,
pauses, degrees of stress, and other features.  The data were
originally gathered as the spoken half of the Survey of English Usage,
used in several major reference grammars of English (Leech & Svartvik,
1975; Quirk, Greenbaum, Leech, & Svartvik, 1972, 1985).  The first 87
texts to be computerized are published in Svartvik and Quirk (1980).
The remaining 13 texts have now been added to the computerized corpus.
The full 100 texts can be obtained for academic use from the ICAME and
OTA archives, and are contained on the ICAME CD-ROM (described above).
They are available as either running text or supplemented by semantic
and syntactic tags associated with all words in the texts.  The manual
for the LLC (Svartvik, 1992b) is distributed through ICAME/NCCH.  A
bibliography of 200 studies using this corpus is found in Svartvik
(1990).  A parsed version of a part of the data is described in that
source.
   4.  The Lancaster Spoken English Corpus (SEC) (Knowles & Lawrence,
1987) consists of 52,000 words of contemporary spoken British English,
gathered between 1984 and 1987, from radio broadcasts, university
lectures and several other types of speech.  It is available from the
ICAME archive in orthographic and prosodic transcription, with
word-class tags (generated by CLAWS2) and accompanying manual.  For
more information, contact NCCH or Peter Roach, Linguistics Department,
Leeds University; email:  p.j.roach@cmsl.leeds.ac.uk; or Gerry Knowles,
Linguistics Department, Bowland College, Lancaster University,
Lancaster LA1 4XZ, UK; email:  eia008@central1.lancaster.ac.uk.
   5.  The PIXI Corpora consist of 450 naturally occurring conversations
recorded in bookshops in  England and Italy, for the purpose of
cross-cultural comparisons of discourse structure.  They are available
in electronic form from the Oxford Text Archive, and in book form in
Gavioli & Mansfield (1990), together with careful details of the data
gathering, discourse contexts, analytic approach and bibliography of
related publications.  For further information, contact the Oxford Text
Archive or Guy Aston (VK1A@ICINECA.bitnet).
                                                                 [286]
   6.  The Helsinki Corpus of Historical English (Rissanen, 1992) is a
textbank of 1.5 million written words from law, handbooks, science,
trials, sermons, diaries, documents, plays, and private and official
correspondence from periods at roughly 100-year intervals beginning in
850.  It is used for variational study of the development of English.
The manual for this corpus is Kyto" (1991), distributed through
ICAME/NCCH.  For more information contact:  Matti Rissanen, or Merja
Kyto" (mkyto@cc.helsinki.fi), Department of English, University of
Helsinki, Porthania 311, 00100 Helsinki, Finland.  A corpus of
dialectal English is underway (Ihalainen, 1987).  For information,
contact Ossi Ihalainen at the same address.  The Helsinki Corpus is
contained on the ICAME CD-ROM (see above).
   7.  The Macquarie (University) Corpus (Peters, 1987) is nearing
completion.  It consists of 1 million words of Australian English and
is intended to be comparable to the Brown Corpus.  For more
information:  Pam Peters, David Blair, Peter Collins, or Alison
Brierley, School of English and Linguistics, Macquarie University, 2109
New South Wales, Australia.
   8.  The Kolhapur Corpus of Indian English (Shastri, 1985, 1988)
contains 1 million words of written Indian English from the year 1978.
Its texts were selected from the same text categories as the Brown
Corpus and is available from ICAME.
   9.  The American Heritage Intermediate Corpus (Carroll, Davies, &
Richman, 1971) consists of over 5 million words of written American
English from the most widely used books in grades 3 through 9.  It was
compiled as a database for the American Heritage School Dictionary.
   10.  The Birmingham Collection of English Text (BCET) (Renouf, 1984,
1987; Sinclair & Kirby, 1990), compiled from 1980-1985 by J. Sinclair,
A.  Renouf, and J. Clear, contains 20 million words of written (18.5)
and spoken (1.5) language (mostly British) used in producing a series
of Collins COBUILD reference and teaching works.  It also contains 20
million words of speech from a public inquiry including the complete
transcripts of the 18- month-long inquiry into the plan for
constructing the Sizewell nuclear power station.  It is intended to be
representative of modern British English and therefore consists of
samples of current and general usage (rather than technical use), from
adult speakers without regional dialects, and excludes poetry and
drama.  For more information:  A. J. Renouf,  Research and Development
Unit for English Language Studies, 50 Edgbaston Park Road, Birmingham
B15 2RX, UK; Tel: +44 (21) 414 3935; FAX: +44 (21) 414 6203; email:
renoufaj@bham.ac.uk.
                                                                 [287]
   11. The Longman/Lancaster English Language Corpus (Summers, 1991)
consists of 30 million words of mainly British and American English
texts.  Begun in 1985, it contains varied stylistic levels and text
types, and is intended for lexicographic and academic research.  For
more information:  Longman/Lancaster English Language Corpus, Longman
Group Ltd., Longman House, Burnt Mill, Harlow, Essex CM20 2JE, UK.
   12. The Corpus of Spoken American English (CSAE) (in progress), will be
a database of one million words of spoken American English,
encompassing a wide range of spoken language types (Chafe, Du Bois, &
Thompson, 1992).  The corpus will be disseminated as widely as possible
in several formats, including a printed book and an interactive
computer format that will allow simultaneous access to transcription
and sound.  The creation of the Corpus of Spoken American English will
be coordinated with the ICE project (described next), of which the CSAE
is the officially designated representative for the United States.  For
information:  Wallace Chafe, John Du Bois, or Sandra Thompson,
Department of Linguistics, University of California, Santa Barbara, CA
93106, USA; Tel: +1 (805) 961-3776.
   13. The International Corpus of English (ICE) (Greenbaum, 1988, 1990,
1992) (in progress), was begun in 1988 for the purpose of providing
comparable data for comparative studies of national varieties of
English internationally.  Under the coordination of Sidney Greenbaum,
Department of English, University College London, parallel corpora of
spoken and written texts will be compiled for a number of regions,
including the United States, Australia, the United Kingdom, Wales,
Canada, New Zealand, India, East Africa, Nigeria, Jamaica and others,
using uniform classification and encoding schemes.  The American
English component of this project is the CSAE, described above.  Each
regional corpus will contain one million running words, half from
spoken and half from written language.  The material in each regional
corpus must date from no earlier than 1990 and no later than the end of
1993 and will come from speakers 18 years or older with education
through the medium of English.  In addition, there are plans for
nonregional supplementary corpora of written translations into English,
international spoken communication, and EFL (English as a foreign
language) teaching texts (see Francis, 1989).  The ICE data will
ultimately be made available together with original sound recordings
and possibly also digitized recordings for a concordance format.
   14. The British National Corpus (BNC) (Quirk, 1992) (in progress) is to
be an electronic corpus of 100 million words of contemporary spoken and
written British English.  Texts will represent a cross-section of a
                                                                 [288]
wide range of styles of current written and spoken English.  A uniform
target encoding scheme will be defined, conforming to the international
Standard Generalised Markup Language (SGML), in which all texts in the
corpus will be stored and distributed.  The corpus is to be
automatically tagged with word-class labels to enhance its value for
linguistic research.  Special purpose tools developed for manipulation
and processing of the corpus will be distributed together with it.  The
BNC is intended to provide the UK research and industrial communities
with state-of-the-art corpus and lexical resources, as a solid basis
for the development and exploitation of new products in the rapidly
expanding field of natural language processing as applied to British
English.  These resources will be made widely available under
appropriate licensing conditions and at minimum cost to the academic
research community and also to the wider industrial research
community.  Begun in 1991, this 3-year project is managed by Jeremy
Clear, with major participation from Oxford University Press (OUP),
Longman Group UK Ltd, the British Library, and the Universities of
Oxford and Lancaster.  For more information:  Jeremy Clear, Oxford
University Press, Walton Street, Oxford OX2 6DP, UK; Tel: +44 (865)
56767; FAX: +44 (865) 56646; email: JHCLEAR@vax.oxford.ac.uk.
   15.  The Bellcore Lexical Research Corpora (Walker, 1987) were compiled
to support corpus linguistics and computational lexicography research.
They include textbases of 200 million words of newswire text (New York
Times, Associated Press), 50 million words of magazine and journal
articles, a collection of English machine-readable dictionaries and
other machine-readable reference books, electronic-mail digests, and
assorted smaller texts.  For more information: Donald E. Walker,
Language and Knowledge Resources Research, Bellcore, MRE 2A-379, 445
South Street, Morristown, NJ 07960-1910, USA; FAX: +1 (201) 829-5981;
email:  walker@bellcore.com.
   16.  Established in 1989, the Association for Computational Linguistics
Data Collection Initiative (ACL/DCI) (Church & Liberman, 1991;
Liberman, 1989; Walker, 1991, 1992) is an activity which collects
machine readable text to support scientific and humanistic research,
and distributes it at cost and without royalties.  Its first CD-ROM,
available for only $25, contains about 300 Mb of Wall Street Journal
text, about 180 Mb of scientific abstracts, the full text of the 1979
edition of the Collins English Dictionary in the form of a
typographer's tape, and some samples of tagged and parsed text from the
Penn Treebank project.  Its second CD-ROM will contain most or all of
six years of the Hansard corpus, that is, Canadian parliamentary
sessions, in bilingual French/English aligned format.  For more
                                                                 [289]
information: Mark Liberman, Department of Linguistics, University of
Pennsylvania, Philadelphia, PA 19104, USA; FAX: +1 (215) 573-2091;
email: myl@unagi.cis.upenn.edu.
   17.  The European Corpus Initiative (ACL/ECI) (in progress), which is
patterned after the ACL/DCI, was established in 1992 to bring together
existing materials in as many major European languages as possible, and
to make these available in digital form and in a consistent format to
the research community at cost and without royalties.  The ECI welcomes
contributions from all researchers and will distribute the data on
CD-ROMs, the number depending on the ultimate size of the archive.  For
more information (to contribute or obtain data):  Henry Thompson, HCRC,
University of Edinburgh, 2 Buccleuch Place, Edinburgh, EH8 92W,
Scotland; FAX: +44 (31) 650-4587; email: eucorp@cogsci.ed.ac.uk.
   18.  The Cambridge Language Survey (CLS) (in progress) is an
international multilingual survey of language.  Under sponsorship from
industry and government sources, and in cooperation with other
projects, the CLS is bringing together existing data from a variety of
languages, starting with English, French, German, Dutch, Italian,
Spanish and Japanese, with the intent to code this data semantically
and to prepare concordances and multilingual corpora, parallel and
aligned, for educational and such publishing uses as the preparation of
multilingual dictionaries and other reference books.  The data will be
made as available as possible, perhaps including distribution via
CD-ROM.  For more information:  Paul Procter, Cambridge University
Press, Edinburgh Building, Shaftesbury Rd., Cambridge CB2 2RU, UK;
Tel:  +44 (223) 325052; FAX:  +44 (223) 315052; email:
psp10@phx.cam.ac.uk.
   19.  The DARPA-funded Linguistic Data Consortium (LDC) (in progress)
was inaugurated in the Spring of 1992.  Its formation was stimulated by
the establishment of the Data Collection Initiative (DCI) of the
Association for Computational Linguistics (ACL), but also strongly
influenced by cooperative work in the speech community that led to the
development of corpora consisting of digits and of acoustic-phonetic
data pronounced by multiple speakers.  The LDC is intended to develop
and distribute large amounts of linguistic data (e.g., speech, text,
lexicons, and grammars) to assist the development of speech- and
text-processing systems.  The data will include large quantities of raw
and annotated (i.e., syntactically and/or semantically tagged) text and
speech (billions of words of text and thousands of hours of speech), a
large lexicon, and a broad coverage grammar of English. The data will
also include whatever additional materials (including foreign language
materials) the Consortium can obtain by exchange or on other reasonable
terms.  Data are to be provided on CD-ROM on a subscription basis to
                                                                 [290]
universities and corporations.  Although the Consortium does not need
exclusive rights to donated data, DARPA does intend to make its growing
holdings available exclusively through the Consortium.  General
membership fees will be set at affordable levels, and foreign members
will be considered if access to foreign data can be assured. The
Consortium may be established as a separate legal entity, such as a
nonprofit corporation or other form of association.  For further
information:  Mark Liberman, Department of Linguistics, University of
Pennsylvania, Philadelphia, PA 19104, USA; email:
myl@unagi.cis.upenn.edu.
   20.  American News Stories consists of approximately 250,000 words of
written American English consisting of Associated Press news stories in
December 1979 (available from the Oxford Text Archive).
   21.  The Nijmegen TOSCA Corpus (Oostdijk, 1988) is a textbank of 75
works (1.5 million words) of educated written British English drawn
from a variety of genres meant to be read rather than spoken (i.e.,
excluding poetry, plays and speeches), compiled for studies of
linguistic variation.  For more information:  Dr. Jan Aarts and Prof.
C. Koster, Directors, The Nijmegen Research Group for Corpus
Linguistics, Department of English, University of Nijmegen,
Erasmusplein 1, NL-6525 HT Nijmegen, The Netherlands; Tel:  +31 (80)
512836; email: cor_hvh@hnykun52.
   22.  The Melbourne-Surrey Corpus (Ahmad & Corbett, 1987) consists of
100,000 words of Australian newspaper texts and is available from
ICAME.
   23.  The Corpus of English-Canadian Writing, is a textbank of 3 million
words of Canadian English from magazines, books, and newspapers,
gathered beginning in 1984, and representing a wide variety of genre
categories in common with the LOB and Brown corpora, plus "Feminism"
and "Computing."  For more information:  Margery Fee, Strathy Language
Unit, 207 Stuart Street, Room 316, Rideau Building, Queen's University,
Kingston, Ontario, Canada K7L 3N6; email:  feem@qucdn.bitnet.
   24.  The Warwick Corpus is approximately 2.5 million words of written
British English (letters, fiction and other genres) compiled by J. M
Gill for use in research aimed at the automatic generation of Braille
by computer (available from the OTA).
   25. The Cornell corpus (Hayes 1988; Hayes & Ahrens, 1988) is a 1.6
million word corpus, consisting of 1151 written or spoken British and
American English texts, representing a wide variety of language types.
It was compiled in the 1980's for a study on lexical adaptation of
parents to children.  The spoken samples range from abortion debates to
                                                                 [291]
the Patty Hearst trial to television situation comedies.  It is
available from the CHILDES archive (described above).
   26.  NEXIS, LEXIS, and MEDIS (owned by Mead Data Central) and WESTLAW
(run by the West Corporation) are commercial archives.  These are used
by newswriters, lawyers, and doctors, but they tend to be very
expensive.  NEXIS contains newspapers (New York Times, Reuters,
Business Week), newsletters, and other periodicals from the 1980s to
the present and is used by columnists such as William Safire.  LEXIS
and WESTLAW contain legal codes and almost all legal decisions at the
federal and state level in the United States and several European
countries from far back to very current.  MEDIS is a medical literature
database.

B. Running Text:  French Language
   1.  The Oxford Text Archive (OTA) has a number of literary holdings in
the French language.
   2.  The Hansard corpus contains six years of Canadian Parliamentary
sessions, in English/French bilingual aligned format, and is available
from the ACL/DCI.
   3.  The Ottawa-Hull Corpus of Spoken French (Poplack, 1989) is 3.5
million words, compiled in 1985 to address issues of sociolinguistic
variation and language contact.  Respondents were selected from two
contiguous cities on the border between Ontario and Quebec, in an
unbiased manner to reflect a carefully balanced sampling grid of
occupational, age, sex and other variables.  To avoid Labov's "observer
paradox," the data were recorded by trained community members.  For
more information:  Shana Poplack, Linguistics Department, University of
Ottawa, Ottawa, Ontario, Canada; email:  sxpaf@uottawa.bitnet.
   4.  The Tre'sor de la Langue Franc,aise (TLF) (Treasury of the French
Language) contains about 2,000 texts (150 million words) of a variety
of types of written French-from novels and poetry to biology and
mathematics- stretching from the 17th to the 20th centuries, the result
of a cooperative project between the Centre National de la Recherche
Scientifique and the University of Chicago.  Access to the ARTFL
database is organized through a consortium of user-institutions, in
most cases universities and colleges, each of which pays an annual
subscription fee.  The data will soon also be available on CD-ROM
together with access software for UNIX systems.  For more information,
contact:  Mark Olsen, ARTFL Project, American and French Research on
                                                                 [292]
the Treasury of the French Language, Department of Romance Languages,
University of Chicago, 1050 East 59th Street, Chicago, IL  60637, USA;
Tel:  (312) 702-8488; email: artfl@artfl.uchicago.edu or
mark@gide.uchicago.edu.

C. Running Text:  German language

The Mannheim Corpus (Teubert, 1984) is a textbank of 8 million words of
modern literary prose and nonfiction, available from the Oxford Text
Archive and also from the Institut f|r Deutsche Sprache, University of
Mannheim, Friedrich-Karl-Strasse-12, Postfach 5409, D-6800 Mannheim,
Germany.  The Institut f|r Deutsche Sprache also houses the Bonner
Zeitungskorpus, a three million word collection of representative
samples from German newspapers between 1949 and 1974, and the
Freiburger Corpus, a textbank of one-half million words from 224 texts
and documents, including discussions, interviews, speeches, reports,
narrations, and documentary.  The LIMAS Corpus of modern German is 1.1
million words, constructed by the same rules as the Brown Corpus.  It
is available from the Institut f|r Deutsche Sprache.  It is also
available together with software on HD floppies for 1000 DM from Gerd
Willee, email:  upk000@dbnrhrz1.bitnet or upk000@ibm.rhrz.uni-bonn.
The Pfeffer Spoken German Corpus, collected in 1961, contains 400 12-
minute spontaneous interviews covering 25 different topics, recorded in
60 locations in Germany (including both former East and West), Austria,
and Switzerland.  The speakers represent diverse demographic
characteristics with regard to gender, age, education, and geography.
For information: the Oxford Text Archive or Randall L. Jones,
Department of German, 4096 JKHB, Brigham Young University, Provo, UT
84602, USA; Tel: +1 (801) 378-3513; email: jones@byuvm.bitnet.
Finally, the Ulm Textbank is mainly a textbank of psychiatric
interviews, together with a very powerful text retrieval and
concordance package (Mergenthaler, 1985).  For more information:
Erhard Mergenthaler, University of Ulm, Germany; email:
lu07@dmarum8.bitnet.  The Muenster Textbank contains 94 million words
of newspaper text.  For more information, Lothar Lemnitzer,
lothar@hendrix.uni-muenster.de.

D. Running Text:  Italian Language

The PIXI Corpora are transcripts of service encounters in comparable
bookshops in Italy and England and are available through the Oxford
Text Archive (described in fuller detail above with the English
Language Corpora).  The Pisa Corpus consists of 3.5 million words of
Italian.  For more information:  Antonio Zampolli, Istitute di
                                                                 [293]
Linguistica Computazionale, Via Della Faggiola 32, University of Pisa,
I-56100 Pisa, Italy; email:  glottolo@icnucevm.bitnet.

E. Running Text:  Other Languages

Besides English, French, German and Italian, electronic corpora are
increasingly available also in other languages.  The Oxford Text
Archive contains a diverse sampling of languages, best surveyed in the
OTA catalog itself.  The resources listed in this section are from
other locations.  (See also the other entries under "Data Sources:
Surveys" above.)
   The Center for Native (American) Languages of the Plains and the
Southwest, at University of Colorado, has electronic versions of the
Dorsey Omaha-Ponca texts in its Siouan Archives, and has several
dictionary projects (Winnebago, Siouan, and Lakhota).
   For Australian indigenous languages, please see the entry for the
AIATSIS Aboriginal Studies Electronic Data Archive under "Electronic
Data Archives and Repositories" above.
   For Danish there are two corpora of written Danish from fiction,
newspapers and professional texts:  the DANWORD corpus is 1.25 million
words (see Maegaard and Ruus, 1987), housed at the University of
Copenhagen; DK87 and DK88 are one million words apiece, from work
published in 1987 and 1988, respectively, and are available from:
Henning Bergenholtz, The Aarhus School of Business, Fuglesange Alle 4,
DK-8210 Aarhus V.
   Regarding Estonian, a corpus is in progress at the Laboratory of the
Estonian Language, Tartu University, EE2400 Tartu, Estonia.
   Regarding Finnish corpora, contact: Fred Karlsson, Department of
General Linguistics, University of Helsinki, Hallituskatu 11, SF-00100
Helsinki, Finland; email: fkarlsso@ling.helsinki.fi
   For Spanish, the Archivo Digital de Manuscritos y Textos Espan~oles is
available on CD-ROM.  For more information:  Charles Faulhaber,
Department of Spanish and Portuguese, University of California,
Berkeley, CA 94720; Tel: +1 (510) 642-0471; email:
cbf@athena.berkeley.edu.
   Swedish language corpora are surveyed and summarized in Gellerstam
(1992).
   Regarding Yugoslavian, there is the YU-CORPUS.  It consists of mainly
contemporary fiction prose in Serbo-Croatian, with the main areas
represented:  Serbia, Croatia, Montenegro, and Bosnia-Hercegovina.  The
corpus consists of 15 files for a total of approximately 700,000
words.  These files are available via anonymous ftp at aau.dk
(129.142.17.240) in the directory /home/ftp/pub/slav.  For more
information:  Henning Moerk, Slavisk Institut, Aarhus Universitet, Ny
                                                                 [294]
Munkegade 116, 8000 Aarhus C, Denmark; Tel: +45 (86) 136555; FAX: +45
(86) 192155; email:  slavhenn@aau.dk.

F. Language Acquisition
   1.  Child Language Acquisition.  The main archive for child language
data is the Child Language Exchange System  (CHILDES), described
earlier.
   The Polytechnic of Wales Corpus (Fawcett, 1980) compiled by R. Fawcett
and M. Perkins between 1978 and 1984, consists of 100,000 words of
children's English (ages 6 to 12), gathered in Pontypridd, South
Wales.  The data are from 120 children (balanced by age, sex, and
socioeconomic status and screened to exclude those with strong Welsh or
other second language influence), recorded at play and in interview
with an adult.  The computer files contain detailed grammatical tagging
and have been fully hand-parsed using an extension of Systemic
Functional Grammar developed by Fawcett which includes functional and
formal categories.  These are available from the ICAME Archive.  The
recorded tapes and four volumes of transcripts with intonation contours
are available for the cost of materials from:  Robin Fawcett,
Department of Behavioral and Communication Studies, Polytechnic of
Wales, Treforest, Cardiff CF 37 1DL, UK.
   2.  Adult or Second Language Acquisition.  The European Science
Foundation Second Language Data Bank (ESFSLDB) consists of longitudinal
data obtained systematically over a 3-year period from adult migrant
workers in five nations in Europe with a focus on language learning in
the absence of formal instruction (see Perdue, 1984, in press).  This
very large database contains texts of interviews, narratives, role
plays, picture descriptions, and other data gathered mostly on a
roughly monthly basis from the same informants during the course of
this time period.  The informants were chosen to be comparable in terms
of age, recency of arrival, level of education, and other factors, and
represented 10 combinations of source language (Moroccan Arabic,
Italian, Spanish, Finnish, and Punjabi) and target language (French,
English, Dutch, German, Sweden).  The data and Word Cruncher software
are accessible for noncommercial research with signed agreement,
available through file server (psyli@hnympi51.bitnet), tapes,
diskettes, or CD-ROM.  For more information:  Kees v.d.Veer, Technical
Group, Max-Planck-Institut fuer Psycholinguistik, Postbus 310, NL-6500
AH Nijmegen, The Netherlands; Tel: +31 (80) 521-911; email:
kees@mpi.nl.
   The Montreal Corpus was gathered for a project headed by Prof. K.
Connors concerning the acquisition of French as a second language by
                                                                 [295]
anglophones and lusophones in Montreal. The data consist of three sets
of interviews each from anglophones, lusophones, and a control group of
French speakers.  The corpus is available for research in magnetic
form.  For more information:  Michel Lenoble, Litterature Comparee,
Universite de Montreal, Montreal, Canada; email:
lenoblem@umtlvr.bitnet.

G. Phonetic Databases
   1.  The DARPA Speech Recognition Research Databases consist of phonetic
transcriptions of sentences read aloud by American adults from various
parts of the country.  These databases include both a speaker-
independent (a few sentences from many speakers) and a
speaker-dependent (a lot of speech from a few speakers) part-designed
for use in training and testing both speaker-independent and
speaker-dependent recognition systems.  Digitized versions are also
available.  For more information see Fisher, Doddington, and
Goudie-Marshall (1986), Lamel, Kassel, and Seneff (1986), and Price,
Fisher, Bernstein, and Pallet (1988).
   2.  The Phonetic Database (PDB) at the University of Victoria consists
of language files in MS-DOS format that run with Micro Speech
Lab/KayLab hardware/software and illustrate speech sounds of some less
frequently encountered languages.  Each language has about 40-50 words
and a few to several sentences of text encoded.  It is intended to
provide illustrative and archival samples of different languages from
field data and lab recordings.  Some languages represented are Egyptian
Arabic, Cantonese, Modern Standard Chinese, Scots Gaelic, Inuktitut,
Korean, Miriam, Ditidaht, Nyangumarta, Rutooro, Runyoro, Skagit,
Spokane, Turkish, Umpila, Xhosa, Yoruba, Sinhala, and Japanese.  Files
are being converted to 20K sampled data for use with CSL (KayLab) and
ASL programs (on the IBM).  Concordance material is in written text
format.  For more information contact John Esling, Linguistics
Department, University of Victoria, British Columbia, Canada; email:
VQPLOT@UVVM.bitnet.
   3.  The Multi-Language Speech Database (in progress) is to be a large
10-language database of digitized speech recordings over the
telephone.  Plans are to gather five minutes of speech from each of 100
native speakers in each of 10 languages.  This database is scheduled
for completion in mid-1992, and will be made available to researchers
at nominal cost together with software (developed for UNIX xwindows) to
display and interactively modify the speech files, and signal
processing functions that compute different parameters of the speech
waveform.  For more information:  Ronald Cole, Center for Spoken
Language Understanding, Oregon Graduate Institute of Science and
                                                                 [296]
Technology, 19600 NW Von Neumann Dr., Beaverton, OR  97006-1999, USA;
Tel: +1 (503) 690-1159; email:  cole@cse.ogi.edu.

H. Electronic Dictionaries
   1.  See the Wooldridge list of machine-readable dictionaries mentioned
under "Data Sources: Surveys."
   2.  The Oxford Text Archive (OTA) distributes several machine-readable
dictionaries, including some in languages other than English.  These
are listed and described, together with illustrative examples of the
more widely used, in the file ota/dicts/info, available via anonymous
ftp from black.ox.ac.uk (or 129.67.1.165).
   3.  The second edition of the Oxford English Dictionary is available on
CD-ROM from:  Electronic Publishing Division, Oxford University Press,
200 Madison Avenue, New York NY 10016; Tel (212) 679-7300, ext. 7370;
or Electronic Publishing Division, Oxford University Press, Walton
Street, Oxford OX2 6DP; Tel: +44 (865) 267979; email
OUPJSC@VAX.OXFORD.AC.UK.
   4.  Le Robert Electronique is the electronic version of the nine-volume
English-French dictionary by Robert Grant, De La Langue Franc,aise
(1985 edition).  It is available on CD-ROM for $995 (U.S.) from
Chadwyck-Healey Inc., 1101 King Street, Alexandria, VA 22314, USA; Tel:
+1 (703) 683-4890 or +1 (800) 752-0515; FAX: +1 (703) 683-7589; or
Chadwyck-Healey Ltd., Cambridge Place, Cambridge CB2 1NR, UK; Tel: +44
(223) 311479; FAX:  +44 (223) 66440.

I. Lexical Databanks

As noted in the introduction, this sampling of resources is probably
less complete than for the corpora and textbanks of running texts.
References relating to corpus-based lexicography include:  Altenberg
(1990), Atkins, Clear, and Ostler (1992), Boguraev and Briscoe (1988),
Gellerstam (1988), Sinclair (1987, 1992), and Walker (1992).
   1.  The MRC Psycholinguistic Database, described in Coltheart (1981),
consists of 150,837 entries from the Shorter OED with various forms of
additional information (including part of speech, the British
pronunciation, rating of concreteness, familiarity, and frequencies
from Kucera-Francis and Thorndike-Lorge) for various subsets of words.
It is available together with computer programs for efficient access,
written in C for UNIX systems via anonymous ftp from
laurel.ocs.mq.edu.au (or 137.111.3.11) as the file
                                                                 [297]
pub/wrec/incoming/mrc.tar.Z (binary) or from the Oxford Text Archive
(black.ox.ac.uk or 129.67.1.165) in the directory ota/dicts/1054.  The
Macintosh version has been produced by Philip Quinlan and is marketed
by the Oxford University Press.
   2.  The DARPA-funded Consortium for Lexical Research (CLR) (under
development).  Begun in 1989 and modeled partly after large data
projects such as the British National Corpus, the CLR is an
organization for sharing lexical data and tools used to perform
research on natural language dictionaries and lexicons, and for
communicating the results of that research.  It is intended to make
available to the whole natural language processing community certain
resources now held by only a few groups that have special relationships
with companies or dictionary publishers.  The CLR would as far as is
practically possible accept contributions from any source, regardless
of theoretical orientation, and make them available as widely as
possible for research.  It will be located at the Computing Research
Laboratory, Box 30001, Las Cruces, New Mexico, USA, under the direction
of Yorick Wilks and an ACL committee consisting of Roy Byrd, Ralph
Grishman, Mark Liberman and Don Walker.  An annual fee will be charged
for membership.  For information on participating in the CLR as a
provider or consumer of data, tools, or services, or on joining the
lexical information list: Natural Language Research, Consortium for
Lexical Research, Computing Research Lab, New Mexico State University,
Las Cruces, NM 88003, USA; Tel: +1 (505) 646-5466; FAX:  +1 (505)
646-6218; email: lexical@nmsu.edu or lexical@nmsu.bitnet.
   3. The Centre for Lexical Information (CELEX) has a relational database
containing lexical data on present-day Dutch (400,000 word forms),
English (150,000 word forms), and German (51,000 word forms) that it
makes available to institutes and companies for language and speech
research and for the development of language- and speech-oriented
technological systems.  It contains detailed information on
orthography, phonology, morphology, and syntax, as well as word
frequencies based on the COBUILD corpus (described above). New
information on translation equivalency is currently being developed,
along with additional syntactic and semantic subcategorizations to
establish semantic links among the three databases.  The CELEX user
interface was specially designed to make it easy for nontechnical
people to use the databases.  Researchers from several countries can
log onto CELEX remotely and use it interactively.  Costs for
noncommercial use are modest; for commercial use, somewhat more
expensive.  If the network connections are not sufficient, then CELEX
can prepare the information you require and send it on tape.  For more
information:  CELEX-Centre for Lexical Information, University of
                                                                 [298]
Nijmegen, Wundtlaan 1, 6525 XD NIJMEGEN, The Netherlands; email:
celex@celex.kun.nl or celex@hnympi52.bitnet.
   4.  ACQUILEX (Boguraev, Briscoe, Calzolari, Cater, Meijs, & Zampolli,
1988) is a project funded by the European Community to draw on and
extend current work on extracting data from published machine-readable
databases in multiple languages and formalizing the data to facilitate
the algorithmic processing of language.  It is described also in Walker
(1992).
   5.  The Cambridge Language Survey (CLS) is described above under
"English Language" Corpora.
   6.  Japanese Electronic Dictionary Research Project (in progress) is a
corpus-based project described in Walker (1991).  Details of the corpus
itself were not mentioned in this source but will no doubt become
widely known as the project continues.

J. Treebanks

These are databanks containing not only part of speech tags but also
labeled constituent structures (e.g., noun phrase, adverbial phrase,
coordinate clause).  Some treebanks were mentioned briefly above in the
descriptions of the Brown and LOB corpora (under "Running text:
English Language").  Bracketed structures have also been added to some
texts in the LLC (see Svartvik, 1990).  For parsed child language data,
see 5F above.  For a discussion of treebanks and methods used in
compiling them, see Leech & Garside (1991).
   1.  The Lancaster-Leeds Treebank, compiled by G. Sampson and G. N.
Leech, is a treebank of hand-parsed phrase structure analyses of 45,000
words from the LOB (written British English) representing all 15 of the
LOB categories of text types.  For more information:  Carol Lockhart,
CCALAS Secretary, Department of Linguistics and Phonetics, University
of Leeds, Leeds LS2 9JT, UK.
   2. The Lancaster Parsed Corpus (Garside, Leech & Sampson, 1987) is a
treebank of approximately 140,000 words from the LOB Corpus (written
British English) from all 15 LOB text types.  The sentences were all
automatically parsed with the UCREL parsing systems, using statistics
derived from the Lancaster-Leeds Treebank.  It is available for limited
distribution.  For more information:  UCREL Secretary, Department of
Linguistics and Modern English Language, University of Lancaster,
Lancaster LA1 4YT, UK.
   3. The Linguistic DataBase System (LDB) (de Haan, 1987; Lancashire,
1991; van Halteren & Oostdijk, 1988; van Halteren & van den Heuvel,
                                                                 [299]
1990) was developed by the TOSCA group at Nijmegen University.  It is a
software package which is distributed together with "syntactic analysis
trees" of all utterances from the 130,000-word Nijmegen Corpus of
modern British English.  The LDB was designed to be easy to use even
for computing novices and is independent of both formalism and
language, so it is possible to use it for any other kind of analyzed
corpus.  It can be used on VAX VMS systems, IBM PCs (AT preferred), and
UNIX systems, and in 1991 cost about $60 US for academic institutions
($3000 US for others).  It can be used to examine trees, search for
utterances with given properties, and handle database-wide queries
about constructs in the utterances.  A fully functional demonstration
version is available for any MS-DOS machine with hard disk.  For more
information:  Hans van Halteren, TOSCA Group, Department of English,
University of Nijmegen, P.O. Box 9103, 6500 HD Nijmegen, The
Netherlands; Tel: +31 (80) 512836; email: cor_hvh@kunrc1.urc.kun.nl.
   4.  The Penn Treebank is a databank of labeled bracketed structures,
for samples of written language (the Wall Street Journal) (98%) and
spoken language (Mari Ostendorf's WBUR radio transcripts) (2%).  For
more information:  The Penn Treebank Project, Department of Computer
and Information Science, School of Engineering and Applied Science,
University of Pennsylvania, Philadelphia, PA 19104, USA; email:
khanr@unagi.cis.upenn.edu or maryann@unagi.cis.upenn.edu.
   5.  Treebank of Written and Spoken American English (in progress) (as
mentioned in Walker, 1992) is to contain potentially millions of
sentences together with part of speech tags, skeletal syntactic
parsings and intonational boundaries for spoken language.  The data
themselves will be derived at least in part from the ACL/DCI collection
and to be available through it.  For more information:  Mitch Marcus,
Department of Linguistics, University of Pennsylvania, Philadelphia,
PA, USA; email:  mitch@linc.cis.upenn.edu.

K. Translation into English
   1.  English/French parallel texts are provided in the Hansards material
of the ACL/DCI, already described.
   2.  English/Italian parallel texts are part of the Italian Reference
Corpus in Pisa (see Bindi, Calzolari, Monachini, & Pirrelli, 1991).
   3.  Parallel texts in various combinations of languages are also one of
the goals of the Cambridge Language Survey (CLS), described above under
"English Language" Corpora.
                                                                 [300]
   4.  English translations of Pravda 1986-1987 on a CD-ROM disk for IBM
PC or compatible for $249 U.S. (Product #CD-1505, Description: PRAVDA)
are available from:  Bureau of Electronic Publishing, P. O. Box 779,
Upper Montclair, NJ  07043, USA; Tech. Support: Tel: +1 (201) 746-3033;
Orders:  +1 (201) 857-4300; FAX: +1 (201) 857-3031.


         6. LITERATURE PERTAINING TO ELECTRONIC CORPORA 

As sources for further information and bibliographies in corpus
linguistics, lexicography, computational linguistics, and humanities,
there are:  (1) the 200-work bibliography of research involving the
London-Lund Corpus in Svartvik (1990, Chapter 1), (2) the Altenberg
(1991) bibliography of corpus research on written and spoken language,
which is available also via the ICAME fileserver
(fileserv@nora.hd.uib.no), together with annual updates, and (3) Susan
Hockey's survey of resources for computer-assisted research in
literature and other humanities (included as the Appendix below).


                     ACKNOWLEDGEMENTS

This compilation is indebted to all of the sources cited above, but I
wish to thank especially the following people, for their help in
providing information, corrections, and suggestions concerning earlier
versions:  Lou Burnard, Helmut Feldweg, Stig Johansson, Knut Hofland,
Henry Kucera, Laura Proctor, and Don Walker.  As already noted, this
survey has benefited from several other corpus surveys concerning the
earlier corpora:  Chafe et al. (1992), Taylor, Leech & Fligelstone
(1989), the Georgetown University Catalog of Projects in Electronic
Text, and the catalogs of the OTA and ICAME archives.  Any errors that
remain are my own.
   Given the rapid growth in this area, I have no doubt inadvertently
overlooked some relevant projects.  To them, my apologies.  Similarly,
mention of any resource is not intended as endorsement.
   This work was made possible financially by the Institute of Cognitive
Studies, University of California at Berkeley, which, however, bears no
responsibility for opinions expressed in these pages.
   Finally, I wish to thank Susan Hockey for her generosity in
contributing the materials in the Appendix.


                                                                 [301]
                                REFERENCES

Ahmad, K., & Corbett, G.  (1987).  The Melbourne-Surrey Corpus, ICAME Jour-
     nal, 11, 39-43.
Altenberg, B.  (1990).  Spoken English and the dictionary.  In J.  Svartvik
     (Ed.), The London-Lund Corpus of Spoken English:  Description and
     Research (pp. 177-191).  Lund, Sweden: Lund University Press.
Altenberg, B.  (1991).  A bibliography of publications relating to English
     computer corpora.  In S.  Johansson & A. B. Stenstro"m (Eds.), English
     computer corpora: Selected papers and research guide.  New York: Mou-
     ton de Gruyter.
Atkins, B. H., Clear, J., & Ostler, N.  (1992).  Corpus design criteria.
     Literary and Linguistic Computing, 7, 1-16.
Bachenko, J., & Fitzpatrick, E.  (1990).  A computational grammar of
     discourse-neutral prosodic phrasing in English.  Computational
     Linguistics, 16, 155-170.
Bindi, R., Calzolari, N., Monachini, M., & Pirrelli, V.  (1991).  Lexical
     knowledge acquisition from textual corpora: A multivariate statistic
     approach as an integration to traditional methodologies.  In Using
     Corpora: Proceedings of the Seventh Annual New OED Conference (pp.
     170-196).  Waterloo, Ontario: UW Centre for the New OED and Text
     Research.
Brill, E., Magerman, D., Marcus, M., & Santorini, B.  (1990).  Deducing
     linguistic structure from the statistics of large corpora.  Proceed-
     ings of the DARPA Speech and Natural Language Workshop, June 1990 (pp.
     275-282).  Arlington, VA: Defense Advanced Research Projects Agency.
Boguraev, B., & Briscoe, T.  (Eds.).  (1988).  Computational lexicography
     for natural language processing.  London: Longman.
Boguraev, B., Briscoe, T., Calzolari, N., Cater, A., Meijs, W., & Zampolli,
     A.  (1988).  Acquisition of lexical knowledge for natural language
     processing systems.  Proposal for ESPRIT basic research activities.
     Cambridge: Cambridge University Press.
Bruce, G.  (1989).  Report from the IPA Working Group on Suprasegmental
     Categories.  Lund University, Department of Linguistics Working
     Papers, 35, 25-40.
Bruce, G.  (1992).  Comments.  In J. Svartvik (Ed.), Directions in corpus
     linguistics: Proceedings of the Nobel Symposium 82, Stockholm, August
     4-8, 1991 (pp. 145-147).  New York: Mouton de Gruyter.
Bruce, G., & Touati, P.  (1990).  On the analysis of prosody in spontaneous
     dialogue.  Lund University, Department of Linguistics Working Papers,
     36, 37-55.
Burnard, L.  (1991).  What is SGML and how does it help? (Document No.  TEI
     EDW 25).  Text Encoding Initiative listserver (listserv@uicvm.bitnet).
Carroll, J. B., Davies, P., & Richman, B.  (1971).  The American Heritage
     word frequency book.  Boston: Houghton Mifflin.
Chafe, W.  (1992).  The importance of corpus linguistics to understanding
     the nature of language.  In J. Svartvik (Ed.), Directions in corpus
     linguistics: Proceedings of the Nobel Symposium 82 (pp.  79-97).  New
     York: Mouton de Gruyter.
Chafe, W., Du Bois, J. W., & Thompson, S. A.  (1992).  Corpus of spoken
     American English.  Unpublished manuscript, Linguistics Department,
     University of California, Santa Barbara.
Church, K. W.  (1991).  [Review of J. Aarts & W. Meijs (Eds.), Theory and
     practice in corpus linguistics].  Computational Linguistics, 17, 99-
     103.
Church, K. W., & Hanks, P.  (1990).  Word association norms, mutual infor-
     mation, and lexicography.  Computational Linguistics, 16, 22-29.
Church, K. W., & Liberman, M.  (1991).  A status report on the ACL/DCI.
     Using corpora:  Proceedings from the New OED Conference (pp.  84-91)
     Waterloo, Ontario: The University of Waterloo Centre for the New OED
     and Text Research.
Coltheart, M.  (1981).  The MRC psycholinguistic database.  Quarterly Jour-
     nal of Experimental Psychology, 33A, 497-505.
Dawson, J. L.  (1977).  Texts in machine-readable form and the University
     of Cambridge Literary and Linguistics Computing Centre.  CAMDAP, 7,
     25-30.
de Haan, P.  (1987).  Exploring the linguistic database: Noun phrase com-
     plexity and language variation.  In W. Meijs (Ed.), Corpus linguistics
     and beyond.  Amsterdam: Rodopi.
Fawcett. R. P.  (1980).  Language development in children 6-12: Interim
     report.  Linguistics, 18, 953-958.
Fillmore, C. J.  (1992).  "Corpus linguistics" or "Computer-aided armchair
     linguistics."  In J.  Svartvik (Ed.), Directions in corpus linguis-
     tics: Proceedings of the Nobel Symposium 82 (pp.  35-60).  New York:
     Mouton de Gruyter.
Fisher, W. M., Doddington, G. R., & Goudie-Marshall, K. M.  (1986).
     Proceedings of the Speech Recognition Workshop (Defense Advanced
     Research Projects Agency, Information Processing Techniques Office
     Report No. AD-A165 977).
Francis, W. N.  (1980).  A tagged corpus-Problems and prospects.  In S.
     Greenbaum, G. Leech, & J. Svartvik (Eds.), Studies in English linguis-
     tics for Randolph Quirk (pp. 192-209).  New York: Longman.
Francis, W. N.  (1982).  Problems of assembling and computerizing large
     corpora.  In S. Johansson (Ed.), Computer corpora in English language
     research (pp. 7-24).  Bergen: Norwegian Computing Centre for the
     Humanities.
Francis, W. N.  (1992).  Language corpora B. C.  In J. Svartvik (Ed.),
     Directions in corpus linguistics: Proceedings of the Nobel Symposium
     82 (pp. 18-32).  New York: Mouton de Gruyter.
Francis, W. N.  & Kucera, H.  (Eds.).  (1979).  Manual of information to
     accompany a Standard Corpus of Present-Day Edited American English for
     use with digital computers (rev. ed.).  Providence, RI:  Brown Univer-
     sity, Department of Linguistics.
Francis, W. N.  & Kucera, H.  (1982).  Frequency analysis of English usage:
     Lexicon and grammar.  Boston: Houghton Mifflin.
Garside, R., Leech, G., & Sampson, G.  (Eds.).  (1987).  The computational
     analysis of English: A corpus-based approach.  New York: Longman.
Gavioli, L., & Mansfield, G.  (1990).  The PIXI Corpora:  Bookshop
     encounters in English and Italian.  Bologna, Italy:  CLUEB.
Gellerstam, M. (Ed.).  (1988).  Studies in computer-aided lexicology.
     Stockholm: Almqvist & Wiksell International.
Gellerstam, M.  (1992).  Modern Swedish corpora.  In J. Svartvik (Ed.),
     Directions in corpus linguistics. (pp. 149-163).  New York:  Mouton de
     Gruyter.
Greenbaum, S.  (1988).  A proposal for an international computerized corpus
     of English.  World Englishes, 7, 315.
Greenbaum, S.  (1990).  Standard English and the international corpus of
     English.  World Englishes, 9, 79-83.
Greenbaum, S.  (1992).  A new corpus of English: ICE.  In J. Svartvik
     (Ed.), Directions in corpus linguistics (pp. 1761-179).  New York:
     Mouton de Gruyter.
Halliday, M. A. K.  (1992).  Language as system and language as instance:
     The corpus as a theoretical construct.  In J. Svartvik (Ed.), Direc-
     tions in corpus linguistics (pp. 61-77).  New York: Mouton de Gruyter.
Hayes, D. P.  (1988).  Speaking and writing: distinct patterns of word
     choice.  Journal of Memory and Language, 27, 572-585.
Hayes, D. P., & Ahrens, M. G.  (1988).  Vocabulary simplification for chil-
     dren: A special case of motherese? Journal of Child Language, 15,
     395-410.
Hindle, D., & Rooth, M.  (1991).  Structural ambiguity and lexical rela-
     tions.  In Proceedings of the 29th Annual Meeting of the Association
     for Computational Linguistics (229-236).
Hockey, S.  (1991).  The ACH-ACL-ALLC Text Encoding Initiative: An overview
     (Document No.  TEI J16).  Text Encoding Initiative listserver
     (listserv@uicvm.bitnet).
Hughes, J. J.  (1987).  Bits, bytes and Biblical studies:  A resource guide
     for the use of computers in Biblical and Classical studies.  Grand
     Rapids, MI: Academie Books.
Ihalainen, O.  (1987).  The Helsinki Corpus of English Texts: Diachronic
     and dialectical-Report on work in progress, ICAME Journal, 11, 58-60.
Johansson, S., Atwell, E., Garside, R., & Leech, G.  (1986).  The tagged
     LOB corpus: Users manual.  Bergen: Norwegian Computing Centre for the
     Humanities.
Johansson, S., Leech, G., & Goodluck, H.  (1978).  Manual of information to
     accompany the Lancaster-Oslo/Bergen corpus of British English for use
     with digital computers.  Oslo:  Department of English, University of
     Oslo.
Kjellmer, G.  (1984).  Some thoughts on collocational distinctiveness.  In
     J. Aarts & W. Meijs (Eds.), Corpus linguistics: Recent developments in
     the use of computer corpora in English language research (pp.  163-
     171).  Amsterdam: Rodopi.
Knowles, G., & Lawrence, L.  (1987).  Automatic intonation assignment.  In
     R. Garside, G. Leech, & G. Sampson (Eds.),  The computational analysis
     of English:  A corpus-based approach.  London: Longman.
Kucera, H.  (1992).  Brown corpus.  In S. C. Shapiro (Ed.), Encyclopedia of
     artificial intelligence (Vol. 1, pp. 128-130).  New York: John Wiley &
     Sons.
Kucera, H., & Francis, W. N.  (1967).  Computational analysis of present-
     day American English.  Providence, RI: Brown University Press.
Kyto", M.  (Ed.).  (1991).  Manual to the Diachronic part of the Helsinki
     Corpus of English Texts:  Coding conventions and lists of source
     texts.  Helsinki: University of Helsinki, Department of English.
     [Distributed by the Norwegian Computing Centre for the Humanities,
     Bergen].
Lamel, L. F., Kassel, R. H., & Seneff, S.  (1986).  Speech database
     development: Design and analysis of the acoustic-phonetic corpus.  In
     Proceedings of the DARPA Speech Recognition Workshop (pp. 100-109).
Lancashire, I.  (1991).  [Review of H. van Halteren & T. van den Heuvel,
     Linguistics exploration of syntactic databases: The use of the
     Nijmegen Linguistic DataBase program].  Computational Linguistics, 17,
     457-461.
Lancashire, I., & McCarty, W.  (Eds.).  (1988).  Humanities computing year-
     book 1988.  Oxford:  Oxford University Press.
Leech, G.  (1991).  The state of the art in corpus linguistics.  In K.
     Aijmer & B. Altenberg (Eds.), English corpus linguistics: Studies in
     honour of Jan Svartvik (pp. 8-29).  London: Longman.
Leech, G.  (1992).  Corpora and theories of linguistic performance.  In J.
     Svartvik (Ed.), Directions in corpus linguistics: Proceedings of the
     Nobel Symposium 82 (pp. 105-122).  New York:  Mouton de Gruyter.
Leech, G., & Garside, R.  (1991).  Running a grammar factory: The produc-
     tion of syntactically analysed corpora or treebanks.  In S.  Johansson
     & A. B. Stenstro"vm (Eds.), English computer corpora: Selected papers
     and research guide (pp. 15-32).  New York: Mouton de Gruyter.
Leech, G., & Svartvik, J.  (1975).  A communicative grammar of English.
     London: Longman.
Levelt, W. J. M., Mills, A., & Karmiloff, A.  (1981).  Child language
     research in ESF countries:  An inventory.  Strasbourg: ESF.
Liberman, M.  (1989).  Text on tap: The ACL/DCI.  In Proceedings of the
     DARPA Speech and Natural Language Workshop, Oct. 1989.  San Mateo, CA:
     Morgan Kaufman.
MacWhinney, B.  (1991).  The CHILDES project: Tools for analyzing talk.
     Hillsdale, NJ: Lawrence Erlbaum Associates.
MacWhinney, B., & Snow, C.  (1985).  The child language data exchange sys-
     tem.  Journal of Child Language, 12, 271-296.
Maegaard, B., & Ruus, H.  (1987).  The compilation and use of a text
     corpus.  In A. Cappelli, L.  Cignoni, & C. Peters (Eds.), Studies in
     honour of Roberto Busa SJ (pp. 103-122).  Pisa:  Giardini.
Mergenthaler, E.  (1985).  Textbank systems: Computer science applied in
     the field of psychoanalysis.  New York: Springer-Verlag.
Morris, J., & Hirst, G.  (1991).  Lexical cohesion computed by thesaural
     relations as an indicator of the structure of text.  Computational
     Linguistics, 17, 21-48.
Oostdijk, N. A.  (1988).  Corpus for Studying Linguistic Variation.  ICAME
     Journal, 12.
Perdue, C. (Ed.).  (1984).  Second language acquisition by adult immi-
     grants.  A field manual.  Rowley, MA: Newbury House.
Perdue, C.  (Ed.).  (in press).  The crosslinguistic study of second
     languages.  Cambridge: Cambridge University Press.
Peters, P. H.  (1987).  Toward a corpus of Australian English.  ICAME Jour-
     nal, 11, 27-28.
Pierrehumbert, J.  (1980).  The phonology and phonetics of English intona-
     tion.  Bloomington, IN:  Indiana University Linguistics Club.
Pierrehumbert, J., & Hirschberg, J.  (1990).  The meaning of intonational
     contours in the interpretation of discourse.  In P. Cohen, J. Morgan,
     & M. Pollack (Eds.), Intentions in Communication.  Cambridge, MA: MIT
     Press.
Price, P. J., Fisher, W. M., Bernstein, J., & Pallet, D. S.  (1988).  The
     DARPA 1000-word resource management database for continuous speech
     recognition.  In Proceedings of the 1988 IEEE International Conference
     on Acoustics, Speech, and Signal Processing (pp. 651-654).
Poplack, S.  (1989).  The care and handling of a mega-corpus: The Ottawa-
     Hull French Project.  In R. W. Fasold & D. Schiffrin (Eds.), Language
     change and variation (pp. 411-444).  Philadelphia: John Benjamins.
Quirk, R.  (1974).  The linguist and the English language.  London: Long-
     man.
Quirk, R.  (1992).  On corpus principles and design.  In J. Svartvik (Ed.),
     Directions in corpus linguistics (pp. 457-469).  New York: Mouton de
     Gruyter.
Quirk, R., Greenbaum, S., Leech, G., & Svartvik, J.  (1972).  A grammar of
     contemporary English.  London: Longman.
Quirk, R., Greenbaum, S., Leech, G., & Svartvik, J.  (1985).  A comprehen-
     sive grammar of the English language.  London: Longman.
Raben, J., & Gaunt, M.  (forthcoming).  Electronic scholars research guide.
Renouf, A. J.  (1984).  Corpus development at Birmingham University.  In J.
     Aarts & W. Meijs (Eds.), Corpus linguistics: Recent developments in
     the use of computer corpora in English language research.  Amsterdam:
     Rodopi.
Renouf, A. J.  (1987).  Corpus development.  In J. M. Sinclair (Ed.), Look-
     ing up: An account of the Cobuild Project in lexical computing.  Lon-
     don: Collins ELT.
Rissanen, M.  (1992).  The diachronic corpus as a window to the history of
     English.  In J. Svartvik (Ed.), Directions in corpus linguistics:
     Proceedings of the Nobel Symposium 82 (pp. 185-205).  New York: Mouton
     de Gruyter.
Sampson, G.  (1992).  Probabilistic parsing.  In J. Svartvik (Ed.), Direc-
     tions in corpus linguistics:  Proceedings of the Nobel Symposium 82
     (pp. 425-447).  New York:  Mouton de Gruyter.
Shastri, S. V.  (1985).  A computer corpus of present-day Indian English: A
     preliminary report.  ICAME Journal, 9, 9-10.
Shastri, S. V.  (1988).  The Kolhapur Corpus of Indian English and work
     done on its basis so far.  ICAME Journal, 12, 15-26.
Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C.,
     Price, P., Pierrehumbert, J., & Hirschberg, J.  (1992, October).
     TOBI: A standard for labeling English prosody.  Paper presented at the
     International Conference on Spoken Language Processing, Banff,
     Alberta, Canada.
Sinclair, J. M.  (1982).  Reflections on computer corpora in English
     language research.  In S.  Johansson (Ed.), Computer corpora in
     English language research (pp. 1-6).  Bergen:  Norwegian Computing
     Centre for the Humanities.
Sinclair, J. M.  (Ed.).  (1987).  Looking up: An account of the COBUILD
     project in lexical computing.  London: Collins ELT.
Sinclair, J. M.  (1992).  The automatic analysis of corpora.  In J.  Svart-
     vik (Ed.),  Directions in corpus linguistics: Proceedings of the Nobel
     Symposium 82 (pp. 379-397).  New York: Mouton de Gruyter.
Sinclair, J. M., & Kirby, D. M.  (1990).  Progress in English computational
     lexicography.  World Englishes, 9, 21-36.
Sperberg-McQueen, C. M., & Burnard, L.  (Eds.).  (1992).  Guidelines for
     electronic text encoding and interchange (Document No. TEI P2, Chapter
     34).  Text Encoding Initiative listserver (listserv@uicvm.bitnet).
Summers, D.  (1991).  Longman computerization initiatives, corpus building,
     semantic analysis and Prolog version of LDOCE by Cheng-ming Guo.
     Proceedings of the International Workshop on Electronic Dictionaries
     (Document No. EDR TR-031, pp. 141-152).  Tokyo: Japan Electronic Dic-
     tionary Research Institute.
Svartvik, J.  (Ed.).  (1990).  The London-Lund Corpus of Spoken English:
     Description and research.  Lund, Sweden:  Lund University Press.
Svartvik, J.  (Ed.).  (1992a).  Corpus linguistics comes of age.  Direc-
     tions in corpus linguistics:  Proceedings of the Nobel Symposium 82
     (pp. 7-13).  New York: Mouton de Gruyter.
Svartvik, J.  (Ed.).  (1992b).  The London-Lund corpus of spoken English:
     Users manual.  Lund, Sweden: Lund University, Department of English.
     [Distributed by the Norwegian Computing Centre for the Humanities,
     Bergen].
Svartvik, J., & Quirk, R.  (Eds.).  (1980).  A corpus of spoken English.
     Lund, Sweden: Lund University Press.
Taylor, L., Leech, G., & Fligelstone, S.  (1989).  Lancaster preliminary
     survey of machine-readable language corpora.  Lancaster, England:
     University of Lancaster, Linguistics Department.  [available from the
     Humanist and NCCH fileservers, see text]
Taylor, L., Leech, G., & Fligelstone, S.  (1991).  A survey of English
     machine-readable corpora.  In S. Johansson & A. B. Stenstro"vm (Eds.),
     English computer corpora: Selected papers and research guide (pp.
     319-354).  New York: Mouton de Gruyter.
Teubert, W.  (1984).  Setting up a lexicographical data-base for German.
     In R. R. K. Hartmann (Ed.), LEXeter 83 Proceedings: Papers from the
     International Conference on Lexicography at Exeter (pp.  425-429).
     Tuebingen: Max Niemeyer.
van Halteren, H., & Oostdijk, N.  (1988).  Using an analyzed corpus as a
     linguistic database.  In J.  Roper (Ed.), Computers in literary and
     linguistic computing: Proceedings of the XIIIth ALLC Conference
     (Norwich 1986).  Geneva: Slatkine.  van Halteren, H., & van den
     Heuvel, T.  (1990).  Linguistic exploitation of syntactic databases.
     Amsterdam: Rodopi.
Walker, D. E.  (1987).  Knowledge resource tools for accessing large text
     files.  In A. Cappelli, L.  Cignoni, & C. Peters (Eds.), Studies in
     honour of Roberto Busa SJ (pp. 279-300).  Pisa:  Giardini.
Walker, D. E.  (1991).  The ecology of language.  Proceedings of the Inter-
     national Workshop on Electronic Dictionarie (Document No. EDR TR-031,
     pp. 10-22).  Tokyo, Japan: Japan Electronic Dictionary Research Insti-
     titute.
Walker, D. E.  (1992).  Developing computational lexical resources.  In E.
     F. Kittay & A. Lehrer (Eds).  Frames, fields, and contrasts: New
     essays in semantic and lexical organization.  Hillsdale, NJ: Lawrence
     Erlbaum Associates.
Walker, D. E., & Hockey, S.  (1991).  The Text Encoding Initiative.  Bul-
     letin du CID.  Paris: Centre des Hautes Etudes Internationales
     d'Informatique Documentaire.


                                                                 [307]
                                    APPENDIX

                 HUMANITIES COMPUTING BIBLIOGRAPHY (APRIL 1990)
                      SUSAN HOCKEY (HOCKEY@ZODIAC.BITNET)
                          (REPRODUCED WITH PERMISSION)

The following bibliography was distributed at a tutorial on Text Analysis 
Computing given by the CTI Centre for Literature and Linguistic Studies at 
the Conference on Computers and Teaching in the Humanities held in St. Andrews,
Scotland in April 1990. The CTI Centre for Literature and Linguistic Studies 
is based at Oxford University.  While some of these items date back over 10 
years, they do cover all the basic techniques for text-based humanities 
computing, some of which are not so easy to find in more recent publications.
All these items except the very latest, and of course many more, can be found 
in Ian Lancashire and Willard McCarty (Eds.), Humanities Computing Yearbook, 
Oxford University Press, 1989, which is an excellent starting point.  The CTI 
promotes and supports the use of computers in teaching text-based subjects and 
is part of the Centre for Humanities Computing at Oxford, which supports 
several research projects in text analysis computing.  The bibliography has 
been compiled over several years and is used in a course taught by Susan 
Hockey, the Director of the CTI Centre, at Oxford and in lectures and seminars 
given elsewhere by staff of the Centre.  The CTI Centre has a mailing list, 
which can be contacted at CTITEXT@VAX.OX.AC.UK.

Books-Monographs

Butler, C.  (1985).  Computers in linguistics.  New York: Blackwell.
Hockey, S.  (1980).  A guide to computer applications in the humanities.
     London:  Duckworth.
Oakman, R. L.  (1980).  Computer methods for literary research (1st ed.).
     Columbia:  University of South Carolina Press.
Oakman, R. L.  (1984).  Computer methods for literary research (rev.  ed.).
     Athens:  University of Georgia Press.
Rudall, B. H., & Corns, T. N.  (1987).  Computers and literature: A practi-
     cal guide.  Cambridge, MA: Tunbridge Wells; Kent: Abacus Press.


Books-Resources Guides

Hughes, J. J.  (1987).  Bits, bytes and Biblical studies: A resource guide
     for the use of computers in Biblical and Classical studies.  Grand
     Rapids, MI: Academie Books.
Lancashire, I., & McCarty, W.  (Eds.).  (1988).  Humanities computing year-
     book 1988.  Oxford: Oxford University Press.


Conference Proceedings

Ager, D. E., Knowles, F. E., & Smith, J. M. (Eds.).  (1978).  Advances in
     computer-aided literary and linguistic research.  Birmingham, England:
     University of Aston, Department of Modern Languages.  (ALLC, 1978)
Aitken, A. J., Bailey, R. W., & Hamilton-Smith, N.  (Eds.).  (1973).  The
     computer and literary studies.   Edinburgh: Edinburgh University
     Press.  (Edinburgh conference, 1972)
Allen, R. F. (Ed.).  (1986).  Data bases in the humanities and social sci-
     ences.  Osprey, FL: Paradigm.
Bailey, R. W.  (Ed.).  (1982).  Computing in the humanities: Papers from
     the Fifth International Conference on Computing in the Humanities.
     Amsterdam: North Holland.
Burton, S. K., & Short, D. D.  (Eds.).  (1983).  Sixth International
     Conference on Computers and the Humanities.  Rockville, MD: Computer
     Science Press.
Cameron, K. C.,  Dodd, W. S., & Rahtz, S. P. Q.  (Eds.).  (1986).  Comput-
     ers and modern language studies.  Chichester, England: Ellis Horwood;
     New York:  Halsted.
Charpentier, C., & David, J.  (Eds.).  (1985).  La recherche franc,aise par
     ordinateur en langue et litterature.  Geneva: Slatkine.
Choueka, Y.  (Ed.).  (1990).  Computers in literary and linguistic
     research:  Proceedings of the Fifteenth International ALLC Conference.
     Geneva: Slatkine.
Cignoni, L., & Peters, C.  (Eds.).  (1983).  Computers in literary and
     linguistic research: Proceedings of the Seventh International Sympo-
     sium of the Association for Literary and Linguistic Computing, Pisa
     1982.  Pisa: Giardini.
Hamesse, J., & Zampolli, A.  (Eds.).  (1985).  Computers in literary and
     linguistic computing: Proceedings of the Eleventh International ALLC
     Conference.  Geneva:  Slatkine.
Jones, A., & Churchhouse, R. F.  (Eds.).  (1977).  The computer in literary
     and linguistic studies: Proceedings of the Third International Sympo-
     sium.  Cardiff:  University of Wales Press.  (ALLC, 1974)
Lusignan, S., & North, J. S.  (Eds.).  (1977).  Computing in the humani-
     ties:  Proceedings of the Third International Conference on Computing
     in the Humanities.  Waterloo, Ontario: University of Waterloo Press.
Miall, D. S.  (1990).  Humanities and the computer: New directions.
     Oxford: Oxford University Press.  (Conference on computers and teach-
     ing in the humanities, 1988)
Mitchell, J. L.  (Ed.).  (1974).  Computers in the humanities.  Edinburgh:
     Edinburgh University Press. (ICCH, 1973)
Patton, P. C., & Holoien, R. A.  (Eds.).  (1981).  Computing in the humani-
     ties.  Lexington, MA: Heath.
Raben, J., & Marks, G.  (Eds.).  (1980).  Databases in the humanities and
     social sciences.  Amsterdam: North Holland.
Rahtz, S.  (Ed.).  (1987).  Information technology in the humanities:
     Tools, techniques and applications.  Chichester: Ellis Horwood; New
     York: Halsted.
Roper, J. P. G.  (Ed.).  (1988).  Computers in literary and linguistic
     research:  Proceedings of the Thirteenth International ALLC Confer-
     ence.  Geneva: Slatkine.
Wisbey, R. A.  (Ed.).  (1971).  The computer in literary and linguistic
     research.  Cambridge: Cambridge University Press.  (Cambridge confer-
     ence, 1970)


Periodicals

Bulletin of the Association for Literary and Linguistic Computing ("ALLC
     Bulletin") (1973-1985).  Three issues per year.
Computational Linguistics, formerly American Journal of Computational
     Linguistics.  Now in volume 16 (1990).  Quarterly published by ACL.
Computers and the Humanities (1966- ).  Has had several publishers.  Now
     published by Kluwer.  Four issues per year (six from 1989).  Covers
     language, literature, history, archaeology, music, and education.
     Sponsored by ACH.
ICAME Journal, formerly ICAME News, International Computer Archive of
     Modern English, Norwegian Computing Centre for the Humanities, PO Box
     53, Bergen, Norway.
Journal of the Association for Literary and Linguistic Computing ("ALLC
     Journal") (1980-1985). Was also published by the ALLC.  Two issues per
     year.
Linguistica Computazionale, Giardini, Pisa.
Literary and Linguistic Computing (1986- ).  In 1986, the ALLC publications
     were merged into a single journal, Literary and Linguistic Computing,
     published by Oxford University Press.  It covers all aspects of com-
     puter usage in literary and linguistic research.
Revue: Informatique et Statistique dans les Sciences Humaines.


Newsletters

Bits and Bytes Review (1986- ).  Bits and Bytes Computer Resources, 623
     North Iowa Avenue, Whitefish, MT 59937, USA.  Reviews of software,
     hardware, and new publications.
Computers in Literature (1990).  Newsletter of the CTI Centre for Litera-
     ture and Linguistic Studies, OUCS, 13 Banbury Road, Oxford, UK.
There are also a number of newsletters for specific subjects, some of
which, for example, CALCULI (Classics) and CAMDAP (Medieval Studies),
are now defunct but contain useful information.  The Humanities Com-
puting Newsletter, Office for Humanities Communication, Bath, UK, and
Ontario Humanities Computing, obtainable from CCH, Toronto are two of
the best general ones.


English for Language Research-Corpus Linguistics

Garside, R., Leech, G., & Sampson, G.  (Eds.).  (1987).  The computational
     analysis of English: A corpus-based approach.  New York: Longman.
Sinclair, J. M.  (Ed.).  (1987).  Looking up: An account of the COBUILD
     project in lexical computing. London: Collins.


Stylistic Analysis

Burrows, J. F.  (1987).  Computation into criticism:  A study of Jane
     Austens novels and an experiment in method.  Oxford: Oxford University
     Press.
Dolezel, L., & Bailey, R. W.  (1969).  Statistics and style.  New York:
     Elsevier.
Ellegard, A.  (1962).  Who was Junius?  Stockholm: Almqvist and Wiksell.
Kenny, A.  (1978).  The Aristotelian ethics.  Oxford: Clarendon.
Kenny, A.  (1982).  The computation of style: An introduction to statistics
     for students of literature and humanities.  New York: Pergamon.
Morton, A. Q.  (1978).  Literary detection-How to prove authorship and
     fraud in literature and documents.  Epping, England: Bowker;  New
     York: Scribner.
Morton, A. Q., & Winspear, A. D.  (1971).  Its Greek to the computer.
     Montreal:  Harvest House.
Mosteller, F., & Wallace, D. L.  (1964).  Inference and disputed author-
     ship: The Federalist.  Reading, MA: Addison Wesley.
Muller, C.  (1973). Initiation aux mithodes de la statistique linguistique.
     Paris:  Hachette.

-------------------------- end of text ------------------------------