Date: 18 December 1993 From: Jane Edwards (edwards@cogsci.berkeley.edu) via ftp from 128.32.211.5 Subject: Survey of Electronic Corpora and Related Resources Appended below is the electronic version of Chaper 10 (pp. 263-310) from the following book, reproduced by permission of the publisher: Edwards, Jane A. & Martin D. Lampert (eds). TALKING DATA: TRANSCRIPTION AND CODING IN DISCOURSE RESEARCH. London and Hillsdale, NJ: Erlbaum. 336 pp. 0-8058-0349-1 [ppr] US $27.50; 0-8058-0348-3 [hdbk] US $59.95; (Prepaid: $24.75 & $53.95) Discourse, spoken language corpora. Transcription and coding systems from contrasting approaches to spoken language situated in their theoretical frameworks with sample analyses. Overview chapters present global design principles. Includes a large compendium of computerized corpora and related resources. To order in US: 1-800-926-6579 I would greatly appreciate knowing of inaccuracies or additional resources which should be mentioned in an update to be submitted at a future date if needed to the ICAME fileserver in Bergen (see below). Best Wishes, -Jane Edwards (edwards@cogsci.berkeley.edu) -------------------------------------------------------------------------- Chapter 10: Survey of Electronic Corpora and Related Resources for Language Researchers Jane A. Edwards University of California at Berkeley CONTENTS 1. INTRODUCTION . . . 267 2. INFORMATION SOURCES . . . 269 A. Centers and Associations . . . 269 (1) NCCH (Norwegian Computing Centre for Humanities) . . . 269 (2) CTI (Computers in Teaching Initiative Centre for Textual Studies) . . . 269 (3) CETH (Center for Electronic Texts in the Humanities) . . . 270 (4) ACH (Association for Computers and the Humanities) . . . 270 (5) ALLC (Association for Literary and Linguistic Computing) . . . 271 (6) ACL (Association for Computational Linguistics) . . . 271 B. Electronic Mail Distribution Lists and Discussion Lists . . . 272 (1) HUMBUL . . . 272 (2) CORPORA . . . 272 (3) HUMANIST . . . 272 (4) LINGUIST . . . 273 (5) LN, Langage Naturel, . . . 273 (6) PROSODY . . . 274 (7) Comserve . . . 274 (8) Applied linguistics (TESL-L, SLART-L, MULTI-L, LTEST-L) . . . 274 (9) FUNKNET . . . 275 (10) info-childes and info-psyling . . . 275 (11) ASLING-Linguistics of Signed Languages . . . 275 (12) List of lists . . . 275 C. Email Addresses . . . 276 3. TEXT ENCODING STANDARDS (TEI, IPA, SAM, TOBI) . . . 276 4. DATA SOURCES . . . 278 A. Electronic Data Archives and Repositories . . . 278 (1) OTA (Oxford Text Archive) . . . 278 (2) ICAME (International Computer Archive of Modern English) . . . 278 (3) CHILDES (The Child Language Exchange System) . . . 279 (4) CETH (Center for Electronic Texts in the Humanities) . . . 279 (5) The AIATSIS Aboriginal Studies Electronic Data Archive . . . 280 (6) Project Gutenberg . . . 280 (7) Library of the Future . . . 280 B. Surveys of Electronic Language Data . . . 280 (1) Oxford Text Archive (OTA) catalogue . . . 280 (2) University of Lancaster Survey . . . 280 (3) Georgetown University Catalog of Archives and Projects . . . 281 (4) Walker and Zampolli survey . . . 281 (5) List of Electronic Texts in Philosophy . . . 281 (6) List of Electronic Dictionaries . . . 281 (7) Catalog of the University of Cambridge Literature and Linguistics Computing Centre . . . 282 (8) Linguistic Society of America List . . . 282 (9) The Marchand list of CD-ROM Projects . . . 282 (10) ARL Directory of Electronic Publications . . . 282 5. CORPORA AND TEXTBANKS . . . 282 A. Running text: English Language . . . 283 (1) Brown Corpus . . . 283 (2) Lancaster-Oslo/Bergen (LOB) . . . 284 (3) London-Lund Corpus . . . 285 (4) Lancaster Spoken English Corpus (SEC) . . . 285 (5) PIXI Corpora . . . 285 (6) Helsinki Corpus of Historical English . . . 286 (7) Macquarie (University) Corpus . . . 286 (8) Kolhapur Corpus of Indian English . . . 286 (9) American Heritage Intermediate Corpus . . . 286 (10) Birmingham Collection of English Text (BCET) . . . 286 (11) Longman/Lancaster English Language Corpus . . . 287 (12) Corpus of Spoken American English (CSAE) . . . 287 (13) International Corpus of English (ICE) . . . 287 (14) British National Corpus Initiative (BNC) . . . 287 (15) Bellcore Lexical Research Corpora . . . 288 (16) Association for Computational Linguistics Data Collection Initiative (ACL/DCI) . . . 288 (17) European Corpus Initiative (ACL/ECI) . . . 289 (18) Cambridge Language Survey (CLS) . . . 289 (19) Linguistic Data Consortium (LDC) . . . 289 (20) American News Stories . . . 290 (21) Nijmegen TOSCA Corpus . . . 290 (22) Melbourne-Surrey Corpus . . . 290 (23) Corpus of English-Canadian Writing . . . 290 (24) Warwick Corpus . . . 290 (25) Cornell corpus . . . 290 (26) NEXIS, LEXIS, MEDIS (Mead Data Central) and WESTLAW (West Corporation) . . . 291 B. Running text: French Language . . . 291 (1) OTA holdings . . . 291 (2) Hansard Canadian Parliamentary Sessions . . . 291 (3) Ottawa-Hull Corpus of Spoken French . . . 291 (4) Tresor de la Langue Francaise (TLF or ARTFL) . . . 291 C. Running text: German Language . . . 292 (1) Mannheim Corpus . . . 292 (2) Bonner Zeitungskorpus . . . 292 (3) Freiburger Corpus . . . 292 (4) LIMAS Corpus . . . 292 (5) Pfeffer Spoken German Corpus . . . 292 (6) Ulm Textbank . . . 292 (7) Muenster Textbank . . . 292 D. Running text: Italian Language . . . 292 (1) PIXI corpora . . . 292 (2) Pisa corpus . . . 292 E. Running text: Other Languages . . . 293 (1) Native American Languages . . . 293 (2) Australian Indigenous Languages . . . 293 (3) Danish . . . 293 (4) Estonian . . . 293 (5) Finnish . . . 293 (6) Spanish . . . 293 (7) Swedish . . . 293 (8) Yugoslavian . . . 293 F. Running text: Language Acquisition . . . 294 (1) Child Language Acquisition (CHILDES, PoW) . . . 294 (2) Adult Second Language Acquisition (ESFSLDB, Montreal) . . . 294 G. Phonetic Databases . . . 295 (1) DARPA Speech Recognition Research Databases . . . 295 (2) Phonetic Database (PDB) . . . 295 (3) Multi-Language Speech Database . . . 295 H. Electronic Dictionaries . . . 296 (1) See the Wooldridge list . . . 296 (2) Oxford Text Archive (OTA) holdings . . . 296 (3) Oxford English Dictionary (OED) . . . 296 (4) Le Robert Electronique . . . 296 I. Lexical Databanks . . . 296 (1) MRC Psycholinguistic Database . . . 296 (2) Consortium for Lexical Research (CLR) . . . 297 (3) Centre for Lexical Information (CELEX) . . . 297 (4) Acquisition of Lexical Knowledge (ACQUILEX) . . . 298 (5) Cambridge Language Survey (CLS) . . . 298 (6) Japanese Electronic Dictionary Research Project . . . 298 J. Treebanks . . . 298 (1) Lancaster-Leeds Treebank . . . 298 (2) Lancaster Parsed Corpus . . . 298 (3) Linguistic DataBase System (LDB) . . . 298 (4) Penn Treebank Project . . . 299 (5) Treebank of Written and Spoken American English . . . 299 K. Translation into English . . . 299 6. LITERATURE PERTAINING TO ELECTRONIC CORPORA . . . 300 ACKNOWLEDGMENTS . . . 300 REFERENCES . . . 301 APPENDIX . . . 307 [267] 1. INTRODUCTION Corpora and textbanks of natural language sentences or utterances are becoming increasingly widely used in linguistics, lexicography, and computer science research, in part due to facilitatory technological advances but also due to a broadening of focus in these three fields to include a greater interest in produced language (vs. introspective knowledge), structured interdependencies involving larger stretches of text (vs. individual utterances or sentences), and contrasts across language varieties, genres, and modalities (e.g., British vs. American English; narratives vs. interviews; spoken vs. written language). For further discussion, see Chafe (1992), Church (1991), Fillmore (1992), Francis (1982), Halliday (1992) Leech (1991, 1992), Sinclair (1992), and Svartvik (1992a). It is significant that a corpus often contains utterances or sentences which would seem implausible from introspection but are perfectly natural and acceptable in context (such as "It'll've been going to've been being tested every day for about a fortnight soon!" from Halliday, 1992), and conversely, that sentences invented to illustrate grammatical points may seem implausible as actual utterances because they violate discourse constraints or expectations reflected in definiteness of referents, aspectual perspectives taken on events, or other properties (see Chafe, 1992, for examples and discussion). Corpus-based approaches can bring to light aspects of linguistic structure and process which are not illuminated in introspectively generated data or psycholinguistic experiments and are needed for comprehensive understanding of language phenomena (see Chafe, 1992; Leech, 1991, 1992; Svartvik, 1992a concerning the particular contributions of different approaches). In lexicography, corpora and textbanks enable a more efficient exhaustive cataloging of word senses and collocations than is possible with introspection alone (see Kjellmer, 1984; Sinclair, 1982; Sinclair & Kirby, 1990). In addition, they enable systematic attention to contrasts between spoken and written uses of words, contrasts in meaning as a function of position in the utterance or prosodic features, and the relative frequencies of word senses (see Altenberg, 1990, for a comparison of corpus-based dictionaries). Corpora of increasing size are also being used in probabilistic sense disambiguation, speech recognition, automatic syntactic analysis, automatic assignment of intonation to written texts, and other types of models and applications (to name but a few: Bachenko & Fitzpatrick, 1990; Bindi, Calzolari, Monachini, & Pirrelli, 1991; Brill, Magerman, Marcus, & Santorini, 1990; Church & Hanks, 1990; Hindle & Rooth, 1991; Knowles & Lawrence, 1987; Leech & Garside, 1991; Liberman, 1989; Morris & Hirst, 1991; Sampson, 1992, Svartvik, 1990). Where one million words was once considered large, some of the projects summarized below seek to gather 100 million words. For written [268] language, this is facilitated by the increasing availability of text already on computer media (such as from typesetter tapes). Spoken language is less frequently available in this way, and therefore must be specially gathered and prepared for electronic use. In both cases, data sharing and reuse is increasingly important both within and across disciplinary boundaries, and a single (large) corpus community seems to be emerging. This survey is intended in a modest way to help with this development. Its focus is electronic corpora and textbanks, and related information of primary interest to linguistic, computer science, and humanities research. The information summarized here was garnered from standard published sources and the email discussion lists described below. For accuracy, the wording of the individual descriptions is as close as possible to the original source, which is typically the person cited as the contact person in the entry. In addition, the descriptions of completed corpora owe a debt to the following: Chafe, Du Bois, and Thompson (1992); Svartvik (1990), Taylor, Leech and Fligelstone (1989), and the catalogs of the Oxford Text Archive, the ICAME archive, and the Georgetown University archives project, all described below. What is unique to the current survey is its inclusion of a number of projects and corpora that have sprung up during the past two years, a heavier representation of projects in computational linguistics than in available surveys to date, and the inclusion of electronic discussion lists and public lists of email addresses, few of which were available at the time of the earlier surveys. The first version of this compilation was completed in 1991, and was updated and expanded to include new developments through October 1992. Although I have attempted to make this survey as complete as possible, this is a rapidly growing area. Any update of this survey will be submitted to the ICAME fileserver (see below), possibly for access via anonymous ftp (file transfer software available on many mainframes). Concerning corpora developed before computers, readers are referred to Francis (1992). Lexicographical resources are treated here only briefly, in Sections 5H and 5I. For further information, readers are referred to Altenberg (1990), Atkins, Clear, and Ostler (1992), Boguraev and Briscoe (1988), Gellerstam (1988), Sinclair (1987), Sinclair and Kirby (1990), and Walker (1992). The materials survey below are organized with respect to five main headings: - information sources (associations, email addresses and discussion lists); - encoding standards; - data sources (archives and repositories, surveys of electronic language data); - descriptions of selected corpora and textbanks; and - bibliographies of related research. [269] The Appendix contains Susan Hockey's summary of resources relevant to humanities computing. 2. INFORMATION SOURCES A. Centers and Associations The following organizations encourage corpus-related research and the exchange of corpus-related information by publishing journals, sponsoring conferences and workshops, and various other professional activities. (Organizations concerned with the gathering and distribution of electronic data are summarized under "Data Sources," later in this chapter.) 1. The Norwegian Computing Centre for the Humanities (NCCH) was established in 1972 as a center for research and development to help individual researchers and academic institutions in the use of computers in the humanities. To this end, it develops computing methods and software for application in humanistic research and provides information and teaching services to demonstrate how computer technology can be utilized in the field. This work is carried out in cooperation with humanities research institutions and the Norwegian universities' computing departments. NCCH houses the ICAME archive (described later), which contains the most widely used linguistic corpora of English, and distributes these data at low cost to researchers. Its ICAME CD-ROM contains the Brown Corpus (written American English), the LOB Corpus (written British English), the London-Lund Corpus (spoken British English), the Helsinki Corpus (diachronic English) and the Kolhapur Corpus (Indian English), and costs roughly $500 US. Further information on the CD-ROM can be obtained by emailing the message "send icame info.cd" to fileserv@nora.hd.uib.no, or via anonymous ftp to nora.hd.uib.no (129.177.24.42) (filename: pub/icame/info.cd). NCCH sponsors the electronic bulletin board, "CORPORA" (described below), and serves as a clearinghouse for information concerning corpora, corpus availability, and corpus research. For more information: NCCH, Humanistisk Datasenter, Harald Haarfagres gt. 31, N-5007 Bergen, Norway; Tel: +47 (5) 212954; FAX: +47 (5) 322656; email: adm@nora.hd.uib.no or knut@x400.hd.uib.no. 2. The Computers in Teaching Initiative Centre for Textual Studies (CTI) was established in 1990 to promote and support the use of computers in teaching literature, linguistics and related disciplines in all British universities. Begun under the direction of Susan Hockey, the CTI produces a newsletter, called Computers in Literature, [270] and a software guide and holds periodic training workshops concerned with the use of computers in humanities training and research. It also sponsors the Humanities Bulletin Board (HUMBUL) described in a later section. For more information: CTI Centre for Textual Studies, University of Oxford Computing Services, 13 Banbury Road, Oxford, OX2 6NN, UK; Tel: +44 (865) 273 221; FAX: +44 (865) 273 275; email: ctitext@vax.oxford.ac.uk. 3. The Center for Electronic Texts in the Humanities (CETH), directed by Susan Hockey, was established in 1991 by Rutgers and Princeton Universities with external support from the Mellon Foundation and the National Endowment for the Humanities. It is intended to become a national focus of interest in the United States for those who are involved in the creation, dissemination and use of electronic texts in the humanities, and it will act as a national node on an international network of centers and projects which are actively involved in the handling of electronic texts. Developed from the international inventory of machine-readable texts which was begun at Rutgers in 1983 and is held on RLIN, the Center is now reviewing the records in the inventory and continues to catalog new texts. The acquisition and dissemination of text files to the community is another important activity, concentrating on a selection of good quality texts which can be made available over Internet with suitable retrieval software and with appropriate copyright permission. The Center also acts as a clearinghouse on information related to electronic texts, directing inquirers to other sources of information. Susan Hockey's useful list of resources for humanities computing is included below in the Appendix. For further information: Center for Electronic Texts in the Humanities, 169 College Avenue, New Brunswick, NJ 08903, USA; email ceth@zodiac.rutgers.edu or ceth@zodiac.bitnet or hockey@zodiac.bitnet; Tel: +1 (908) 932-1384; FAX: +1 (908) 932-1386. 4. The Association for Computers and the Humanities (ACH) is an international organization devoted to computer-aided research in literature and language studies, history, philosophy, anthropology, and related social sciences, especially research involving the manipulation and analysis of textual materials. The ACH encourages development and dissemination of significant textual and linguistic resources and software for scholarly research. Its official journal, Computers and the Humanities, is published six times a year. It also publishes Bits and Bytes Review, a review of software in the humanities and social sciences, nine times each year. Jointly with the ALLC (see next entry), it sponsors an annual meeting held in North America in odd-numbered years and in Europe in even-numbered years, which brings together scholars from around the world to report on research activities and software and hardware developments in the field. ACH [271] initiated the Text Encoding Initiative (TEI), an international effort to develop guidelines for the encoding of machine-readable literary and linguistic data. The ACH also sponsors the Rutgers/Princeton National Text Archive, the HUMANIST Electronic Discussion Group, and the LN Electronic Bulletin Board for Natural Language Studies in French and English. For further information: Joseph Rudman, Association for Computers and the Humanities, Department of English, Carnegie-Mellon University, Pittsburgh, PA 15213, USA; email: rudman@cmphys.bitnet. 5. The Association for Literary and Linguistic Computing (ALLC) has representatives in over 30 countries, including advisors in the following areas: Machine Translation, Computer-Assisted Learning, Lexicography, Software, Structured Databases. Its journal, Literary and Linguistic Computing, is published four times per year, containing papers on all aspects of computing applied to literature and language, ranging from computing techniques to results of research projects. To join ALLC and obtain the journal: Journals Marketing, Oxford University Press, Pinkhill House, Southfield Road, Eynsham, Oxford, OX8 1JJ, UK, or Journals Marketing, Oxford University Press, 2001 Evans Road, Cary, NC 27513, USA. 6. The Association for Computational Linguistics (ACL) promotes research on computational linguistics and natural language processing. It publishes the journal Computational Linguistics and sponsors annual meetings (usually in North America), biennial European meetings, and biennial meetings on applied natural language processing, and supports the international conferences on Computational Linguistics (COLING). Proceedings of past meetings are available through the ACL Office. The ACL also sponsors the Text Encoding Initiative (TEI), for standardizing the encoding and interchange of machine-readable text, and two data collection initiatives-the Data Collection Initiative (DCI) and the European Corpus Initiative (ECI)-(described later, under Data Sources) to assemble massive text corpora in English and other languages, and make them available for scientific research at cost and without royalties. Recently, the ACL established a series of Special Interest Groups (SIGs) on the Mathematics of Language, the Lexicon, Parsing, Generation, Computational Phonetics, and Multimedia Language Processing. Others are likely. The SIGs organize workshops, prepare bibliographies, and provide specialized communication channels. For more information: Donald E. Walker (ACL), Bellcore, MRE 2A379, 445 South Street, Box 1910, Morristown, NJ 07960-1910, USA; FAX: +1 (201) 829- 5981; email: walker@bellcore.com. [272] B. Electronic Mail Distribution Lists and Discussion Lists Electronic distribution lists and discussion lists distribute messages contributed by subscribers to all other subscribers on that list. They are a good forum for queries and current information, are easy to join and unjoin, and often cost nothing beyond what the user's institution is already paying for email service. 1. HUMBUL (Humanities Bulletin Board) is a long-running service aimed at providing academics and interested parties with news and information on Humanities Computing. This service is an on-line bulletin board, edited by Stuart Lee at the CTI (described earlier) at Oxford University. Information is collected from all applicable electronic networks plus periodicals, leaflets, and also direct requests to the editor. At regular intervals, HUMBUL indicates its most recent acquisitions, and these can be accessed via ftp, telnet, or other means. To subscribe, send the following one-line command to listserv@UKACRL.bitnet: SUB HUMBUL where is your name. If you do not then receive an automatic message saying you have been added to the list, send email to: humbul@vax.oxford.ac.uk. 2. Begun in 1992, CORPORA is an international email discussion list for information and questions about text corpora, such as availability, aspects of compiling and using corpora, software, tagging, parsing, bibliography, and related matters. To join the list, send a message to: CORPORA-REQUEST@nora.hd.uib.no To submit a contribution to the list, send it to: CORPORA@nora.hd.uib.no The list administrator is Knut Hofland, NCCH, Humanistisk Datasenter, Harald Haarfagres gt. 31, N-5007 Bergen, Norway; Tel: +47 (5) 212954; FAX: +47 (5) 322656; email: knut@x400.hd.uib.no. 3. HUMANIST is an international email discussion list for issues relating to the application of computers to scholarship in the humanities. This includes linguistics, comparative literature, philosophy, Biblical studies, and several other fields. Begun in 1987 under joint sponsorship of the ACH, the ALLC and the University of Toronto's Centre for Computing in the Humanities, it is currently [273] housed at Brown University and moderated by Elaine Brennan and Allen Renear. It has over 600 members in 24 countries. To subscribe, mail "SUB " to listserv@brownvm.brown.edu; to post articles, mail them to humanist@brownvm.brown.edu. Articles submitted to HUMANIST are archived on a file server and can be searched remotely by means of one-line listserv commands. 4. LINGUIST is an international list intended as a place for discussion of issues of concern to the academic discipline of linguistics and related fields. It is moderated by Anthony Aristar (University of Western Australia) and Helen Dry (University of Texas at San Antonio). It explicitly welcomes discussion of any linguistic subfield. To subscribe to LINGUIST, send email to the LINGUIST listserver (listserv@TAMVM1.bitnet or listserv@TAMVM1.tamu.edu), containing the following one-line message: SUBSCRIBE LINGUIST for example, "subscribe linguist Jane Smith." To submit a posting to the list, mail it to linguist@TAMVM1.tamu.edu. The LINGUIST fileserver may contain contributed files of interest to language researchers, such as the LSA or Georgetown lists of corpora, and linguists' email addresses and these are similarly obtainable by one-line commands. For more information, send the one-line command "help linguist" via email to linguist-request@TAMVM1.tamu.edu. For questions requiring human attention, send a message to: linguist-editors@TAMVM1.tamu.edu. 5. LN, Langage Naturel, is an international list for computational linguistics, sponsored by the Association for Computational Linguistics (ACL) and the Association for Computers and the Humanities (ACH). Its goal is to disseminate calls for papers; conference and seminar announcements; requests for software, corpora, and various types of data; project descriptions; and discussions on technical topics. The list is primarily French-speaking, but many items are circulated in English. The list is moderated by Jean Veronis (Vassar University) and Pierre Zweigenbaum (France). To subscribe to LN, send the following one-line message to listserv@FRMOP11.bitnet: SUBSCRIBE LN your name To post a message to the list as a whole, email it to LN@FRMOP11.bitnet. In case of problems, send a message to one of the editors: veronis@vassar.bitnet or zweig@FRSIM51.bitnet. [274] 6. PROSODY is an international list with members representing a broad spectrum of approaches including linguistics, psycholinguistics, and computer science. It serves a vital function of disseminating information concerning available resources in a technologically rapidly expanding area. To subscribe, send: "subscribe prosody " to LISTSERV@msu.bitnet. Send postings to PROSODY@msu.bitnet. The list is managed by George Allen, Michigan State University (email: alleng@msu.bitnet) who also owns the list, "HYPERCARD." 7. Comserve is an electronic information service for professionals and students interested in human communication studies. It is located at Rensselaer Polytechnic Institute and coordinated by Timothy Stephen and Teresa Harrison, both of whom are professors in communication studies. Comserve keeps archives of bibliographies, course materials, job announcements, text transcripts, and other materials, with the author retaining the rights and the copyright. It coordinates a number of hotlines on communication, which can be subscribed to via the listserver. To subscribe to the Ethnomethodology hotline, send the following one-line message to comserve@rpiecs.bitnet: Join Ethno Your_name To obtain a long list of useful bibliographic information, send the following one-line message to comserve@rpiecs.bitnet: send compunet biblio Send materials to be posted to the net to ethno@rpiecs.bitnet and materials to be archived to support@rpiecs.bitnet; 8. Applied linguistics lists. From Ken Willing at Macquarie University, I learned of the following four lists and their listserver addresses: TESL-L (Teaching English as a Second Language) Listserver address: listserv@cunyvm.bitnet SLART-L (Second Language Acquisition Research and Teaching) Listserver address: listserv@psuvm.bitnet MULTI-L (Language and Education in Multicultural Settings) Listserver address: listserv@barilvm.bitnet LTEST-L (Language Testing Research and Practice) Listserver address: listserv@UCLACN1.bitnet [275] To subscribe, send a one-line email message to the indicated address, containing: subscribe XXXXXX John Doe where XXXXXX is the list-name (e.g. TESL-L), and John Doe is your name. 9. FUNKNET, headed by Talmy Givon and Paul Hopper, is a discussion list concerned with various aspects of human language, communication, cognition, socioculture, neuropsychology, and other facets of cognitive and communicative behavior, viewed from what might loosely be called the functionalist perspective, that is, language viewed as an instrument of communication, coding experience, an evolved neurobiological phenomenon, a sociocultural phenomenon, or a combination of these, with an emphasis on empirical language study, including especially corpus data. For further information, contact Talmy Givon at: funknet-request@oregon.uoregon.edu 10. info-childes and info-psyling are international email distribution lists, moderated by Julia Evans and Brian MacWhinney, Psychology Department, Carnegie Mellon University. Info-childes circulates information concerning corpus-related child language research, and info-psyling circulates information on psycholinguistics. To subscribe, send email to brian+@andrew.cmu.edu. 11. ASLING-L is a list for linguistic study of signed languages, including all linguistic areas, including syntax, acquisition, phonology, morphology, psycholinguistics, and cognition. To subscribe, send: SUB ASLING-L to listserv@yalevm.bitnet. The listowner is Christine Romano (cromano @uconnvm.bitnet). 12. List of lists. A very lengthy list of Bitnet and Internet discussion lists (presently over one megabyte long) can be obtained via anonymous ftp to ftp.nisc.sri.com (192.33.33.22) in the directory netinfo as "interest-groups.Z" or by sending the following one-line message to mail-server@nisc.sri.com, making sure in advance that your system has sufficient space to receive it: SEND NETINFO/INTEREST-GROUPS [276] A related list can be obtained by sending email to listserv@ndsuvm1.bitnet with the following one-line message: sendme interest package For further information concerning electronic discussion lists, see the ARL Directory of Electronic Publications (below). C. Email Addresses There are now several periodically updated lists of email addresses for researchers engaged in language-related research. One of them is compiled by Norval Smith and associates at the University of Amsterdam and accessible for retrieval and modification via the name server linguists@alf.let.uva.nl. For information, send the word "HELP" as a one line-command to this address. To receive the full list of email addresses, send "list *" (with a space between list and *). For a list of FAX addresses, send "list fax." The other main list is the one compiled by John Moyne for the Linguistic Society of America (LSA). It can be obtained electronically via anonymous ftp to csli.stanford.edu or by sending the following one-line message to the LINGUIST listserver, listserv@tamvm1.tamu.edu: GET LSA LST LINGUIST It can be obtained in hard copy from: LSA, 1325 18th St. NW, Suite 211, Washington D.C. 20036, USA; email: moygc@cunyvm.bitnet or ZZLSA@GALLUA.bitnet. 3. TEXT ENCODING STANDARDS The sources listed in this section are not exhaustive, but are useful starting points in part as clearinghouses for information on related projects in addition to their own proposals. 1. The Text Encoding Initiative (TEI) (Burnard, 1991; Hockey, 1991; Sperberg-McQueen & Burnard, 1992; Walker, 1992; Walker & Hockey, 1991) is an international and interdisciplinary project of the ALLC, ACH, and ACL in collaboration to define text encoding guidelines and establish a common interchange for machine-readable literary and linguistic data. Fifteen other scholarly organizations including the Linguistics Society [277] of America are represented on its advisory board. The project has received major funding from the National Endowment for the Humanities, the European Economic Community, and The Andrew W. Mellon Foundation and has a number of subcommittees specializing in particular aspects of this enormous task. This includes working groups on spoken language encoding, encoding for lexicons, and phonetic encoding. TEI working papers and reports, including a copy of the Guidelines for the Encoding and Interchange of Machine-readable Texts, can be obtained in hard copy from Wendy Plotkin (U49127@UICVM.bitnet) or electronically from LISTSERV@UICVM.bitnet. For a list of available documents, send the following line to LISTSERV@UICVM.bitnet: GET TEI-L FILELIST For further information: C. Michael Sperberg-McQueen, Editor of TEI, Computer Center (M/C 135), University of Illinois at Chicago, Box 6998, Chicago, IL 60680, USA; Tel: +1 (312) 996-2477; FAX: +1 (312) 996-6834; email: u35395@uicvm.cc.uic.edu or u35395@uicvm.bitnet. 2. In 1989 in Kiel, Germany, the IPA Working Group on Suprasegmental Categories initiated an IPA Number scheme that facilitates transmission of data by code (if correspondents set up their systems to refer to the common IPA Number). Their proposal also includes encoding of suprasegmental categories (see Bruce, 1989, 1992; Bruce & Touati, 1990). For further information: Gosta Bruce, Professor of Phonetics, Lund University, Sweden; email: linglund@seldc52.bitnet, or John Esling, Linguistics Department, University of Victoria, British Columbia, Canada; email: VQPLOT@uvvm.bitnet. 3. The Speech Assessment Methodology (SAM) project is developing a prosodic labeling system to facilitate computer readable prosodic transcriptions, representation of prosodic properties in the lexicon, and tools for prosodic labelling. Their system is intended to be uncommitted with respect to prosodic theories, and is being developed in conjunction with the ASL (Architecture for Speech Language Systems) project. For more information: Dafydd Gibbon, Linguistik und Literaturwissenschaft, University of Bielefeld, P-8640, D-4800 Bielefeld 1; FAX +49 (521) 1065844; email: gibbon@LILI11.UNI-BIELEFELD.DE. 4. The TOnes and Break Indices (TOBI) is a prosodic labeling system (Silverman et al., 1992). In 1991 and 1992, Victor Zue (MIT) and Kim Silverman (Nynex), sponsored two prosodic transcription workshops for the development of a prosodic labelling system, to facilitate the [278] sharing of corpora in a manner compatible with WAVES(tm) format, and to accompany speech files and time- aligned analysis records for sets of utterances. TOBI focuses especially on word groupings and prominences, in a manner loosely tied to Pierrehumbert (1980) and Pierrehumbert and Hirschberg (1990). The description of the TOBI system, sample WAVES(tm) scripts and supporting materials will be announced on the Prosody discussion list, and made available via anonymous ftp at kiwi.nmt.edu (129.138.1.82), or cassette tape, with an invitation for feedback from potential users. 4. DATA SOURCES A. Electronic Data Archives and Repositories 1. The Oxford Text Archive (OTA), directed by Lou Burnard, is by far the largest archive of computerized language texts and corpora on this list. Its catalog lists nearly 2000 titles, including over 450 separate collections of written or spoken language in nearly three dozen languages. It is a deposit archive for textbanks from private scholarly research, and welcomes for inclusion collections of any specialization and in any format for reuse within the scholarly community. Its facilities are free and secure and provided as a service to the world's academic community. Access to the archive is possible by anonymous ftp, online, by tape (9-track; Density 800, 1600 or 6250 bpi; ASCII or EBCDIC; fixed, variable, or formatted), by diskette (MS-DOS or Macintosh; HD or DD; 3.5" or 5.25"), by cartridge (DC300, TAR format only), or over networks. Costs to users are kept low to enable wide access. Its catalogue, now over 60 pages long, is available in hard copy from the address given below, or electronically, in either SGML (international mark-up standard for written texts) or non-SGML format. The catalog and some of its texts are available via anonymous ftp to black.ox.ac.uk (or 129.67.1.165). For more information: Alan Morrison or Lou Burnard, Oxford Text Archive, Oxford University Computing Services, 13 Banbury Road, Oxford OX2 6NN, UK; Tel: +44 (865) 273238 [direct line] or 273200 [switchboard]; FAX: +44 (865) 273275; archive@vax.oxford.ac.uk. 2. The International Computer Archive of Modern English (ICAME) was established in 1977 with the aims of (a) collecting and distributing information on electronically available English language materials and on linguistic research involving these materials, (b) compiling an archive of English text corpora in machine-readable form, and (c) making material available to research institutions. Its holdings include the three most widely used electronic corpora of spoken and [279] written language (the Brown, LOB, and London-Lund corpora, described later) and several other large corpora, some with grammatical annotations, together with corpus-related software, and are distributed through the NCCH in Bergen, Norway (described earlier). The ICAME CD-ROM contains the Brown, LOB, London-Lund, Helsinki and Kolhapur corpora together with software and a summary of discussion lists, networks, surveys, and corpora, and is available for approximately $500 US. Their survey is independent of the current one and should be consulted as an important resource, as it may contain information not covered here, especially with respect to European projects. Further information concerning the CD-ROM can be obtained by sending the command "send icame info.cd" to fileserv@nora.hd.uib.no or via anonymous ftp to nora.hd.uib.no (129.177.24.42). Its catalog of holdings and related document files can be obtained via anonymous ftp to nora.hd.uib.no (129.177.24.42) or by fileserver commands sent to fileserv@nora.hd.uib.no. For more information regarding the fileserver, email the following command to fileserv@nora.hd.uib.no: send icame file.servers. ICAME holds an annual conference (with some proceedings available from Rodopi Publishers, Amsterdam) and produces a journal once a year, edited by Stig Johansson at the University of Oslo, containing analyses of corpus data, surveys of archives, and book reviews. For more information: ICAME, Norwegian Computing Centre for the Humanities, Harald Haarfagres gt. 31, N-5007 Bergen, Norway; Tel: +47 (5) 212954 or 212955 or 212956; FAX: +47 (5) 322656; email: adm@nora.hd.uib.no or knut@x400.hd.uib.no. 3. The Child Language Exchange System (CHILDES) (MacWhinney, 1991; MacWhinney & Snow, 1985) contains child language data in several languages, including a number of the major child language corpora in English. It also contains some corpora of adult language (e.g., the Cornell Corpus described later). Data contributions are welcomed and secure and are made available free of charge after contacting Brian MacWhinney to become a member of CHILDES (also free of charge). The data are accessible via anonymous ftp to poppy.psy.cmu.edu or CD-ROM or other magnetic media. The archive also offers a free software package (CLAN) for use on PCs, MACs and mainframes and manages the info-psyling and info-childes electronic discussion groups. For more information: Brian MacWhinney, Department of Psychology, Carnegie Mellon University, Pittsburgh, PA, 15213 USA; email: brian+@andrew.cmu.edu; Tel: +1 (412) 268-2782. 4. The Center for Electronic Texts in the Humanities (CETH) is described earlier. [280] 5. The Aboriginal Studies Electronic Data Archive, housed by the Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS), includes over 150 Australian indigenous languages. It is available to researchers, subject to deposit and access conditions. The catalog of holdings is available by sending the following one-line message to listserv@tamvm1.tamu.bitnet: get aboriginal-cat For further information: Aboriginal Studies Electronic Data Archive, AIATSIS, GPO Box 553, Canberra, ACT 2601, Australia; Tel: +61 (6) 246 1170; FAX: +61 (6) 249 7310; email: aiatsis@peg.apc.org. 6. Project Gutenberg makes available literary works on electronic media. These are available via anonymous ftp from mrcnext.cso.uiuc.edu (or 128.174.73.105). For more information: Michael Hart, email: hart@vmd.cso.uiuc.edu. 7. Library of the Future is a set of CD-ROMs sold by DAK Industries, containing the complete unabridged text of 453 novels, stories, plays and historical documents. For more information: DAK Industries, 8200 Remmet Ave., Canoga Park, CA, 91304, USA; Tel: +1 (800) 888-6703. B. Surveys of Electronic Language Data Three long lists (items #1 through 3 below) cover the major language research corpora in the common domain (plus a couple which are not). These lists are best obtained from their sources (given below) rather than in static printed sources, since some of them are updated periodically. Some further data sources may be found in Levelt, Mills, and Karmiloff (1981), though, it is difficult to know which of these may have become computerized in the meantime. Two sources for humanities texts beyond those included below are Raben and Gaunt (forthcoming) and, from the Appendix, Hughes (1987) and Lancashire and McCarty (1989). 1. The OTA catalogue, mentioned earlier, provides 60 pages of corpus descriptions. 2. The University of Lancaster Survey describes 56 language archive projects intended mainly for linguistic research. This includes non-English corpora and several varieties of English (Indian, Canadian, and Australian), some of which contain rich grammatical and semantic tags for individual words in the corpus. Taylor, Leech, and [281] Fligelstone (1989) is available from the HUMANIST file server by sending the following one-line command to listserv@brownvm.bitnet: GET SURVEY CORPORA HUMANIST or via anonymous ftp to NCCH at nora.hd.uib.no (129.177.24.42) (filename: pub/icame/survey.corpora). The parts concerning English texts are published in Taylor, Leech, and Fligelstone (1991). 3. The Georgetown University Catalog of Projects in Electronic Text (CPET), begun in 1989, contains highly informative descriptions and access information for over 312 electronic corpus projects in 27 countries and is continually updated. It can be accessed via telnet to guvax3.georgetown.edu. For further information: Paul Mangiafico, Center for Text and Technology, Reiss Science Building, Room 238, Georgetown University, Washington, DC 20057, USA; Tel: +1 (202) 687-6096; pmangiafico@guvax.georgetown.edu. 4. The Walker and Zampolli Survey of Written and Spoken Language in Machine-Readable Form (in progress), directed by Don Walker (Bellcore, Morristown, NJ, USA; walker@bellcore.com) and Antonio Zampolli (Institute for Computational Linguistics, Pisa, Italy; glottolo@icnucevm.cnuce.cnr.it), is being conducted to provide a comprehensive inventory of such materials. It is sponsored by several associations discussed elsewhere in this chapter (including the ACH, the ACL and its Data Collection Initiative, the ALLC, the CETH, and the TEI), and also the Modern Language Association, the European Science Foundation, the Commission of the European Communities, the Network of European Reference Corpora, the Linguistic Data Consortium (LDC) among others. For more information about the textual component: Textual Data Survey, Center for Electronic Texts in the Humanities, 169 College Avenue, New Brunswick, NJ 08903, USA; FAX +1 (908) 932-1386; ceth@zodiac.rutgers.edu. 5. The list of Electronic Texts in Philosophy was compiled by Leslie Burkholder (CDEC, Carnegie Mellon University) in December 1991 for the American Philosophical Association. It can be obtained from the HUMANIST file server by sending an email message to brownvm.bitnet containing only the following line: GET PHILOSFY ETEXTS HUMANIST 6. List of Electronic Dictionaries. In a posting to HUMANIST (Vol. 4, No. 1137. Thursday, 7 Mar 1991), Russ Wooldridge [282] (wulfric@vm.epas.utoronto.ca) listed 58 electronic dictionaries, mostly in English but also including several European languages and Hebrew, Greek, and Latin. This list is available from the HUMANIST file server. 7. The Catalog of the University of Cambridge Literature and Linguistics Computing Centre is a published catalog (see Dawson, 1977). 8. The Linguistic Society of America List, compiled in 1987 by Lise Menn, turned up numerous data sets but only relatively few of them on computer. For more information, contact the LSA office (at the address provided above concerning the list of linguists' email addresses). 9. The Marchand list of CD-ROM projects was compiled by James Marchand at the University of Illinois and is available via the Humanist fileserver by mailing the following one-line command to listserv@brownvm.bitnet: GET CDROM PROJECTS HUMANIST 10. ARL Directory of Electronic Publications. Although many journals, newsletters and scholarly lists may be accessed free of charge through Bitnet, Internet and affiliated networks, it is not always simple to know what is available. Compiled and published by the Association of Research Libraries (ISBN #1057-1337), this directory provides access information to 500 scholarly lists, 30 journals, and 60 newsletters. It is available in either hard copy or on 3.5 inch diskette, at a cost of $20 to nonmembers of the ARL. For more information: Office of Scientific and Academic Publishing, Association of Research Libraries, 1527 New Hampshire Ave., NW, Washington, DC. 20036, USA; email: ARLHQ@umdc.umd.edu or ARLHQ@umdc.bitnet; FAX: 202- 462-7849. The "Directory of Electronic Journals and Newsletters," compiled by Michael Strangelove in 1991, can be obtained at no charge by sending an email message to listserv@uottawa.bitnet containing the following two lines: GET EJOURNL1 DIRECTRY GET EJOURNL2 DIRECTRY 5. CORPORA AND TEXTBANKS It is common to distinguish between corpora and textbanks. These differ in size and composition, and serve somewhat different analytic aims. Corpora are intended to be representative of some specified population or genre. Textbanks tend to be collections of available [283] data with looser connection to each other, or focus on a restricted number of genres (including perhaps only one). Corpora are needed for large scale, systematic contrasts of, for example, language varieties, genres, and modalities (e.g., American vs. British English, informative vs. imaginative prose, or spoken vs. written language). Other research requires enormous amounts of data, even if from fewer genres, as for example, in lexicography, in order to detect words and collocations which occur only rarely. (For systematic discussion of corpus design, size, and sampling issues, see Atkins, Clear, & Ostler, 1992; Church, 1991; Carroll, Davies, & Richman, 1971; Fillmore, 1992; Francis, 1982; Kucera & Francis, 1967; Leech, 1991, 1992; Poplack, 1989; Sinclair, 1982, 1992; Walker, 1991.) A particularly interesting concept is that of a "monitor corpus," intended to be not finite or temporally bounded but rather gaining and losing texts over time in parallel with the fluidity of the language itself (Sinclair, 1982; 1992). Listed below are collections of running prose, followed by some phonetic databases, lexical databases, and treebanks (that is, databases of bracketed and syntactically labeled structures, such as noun phrase, verb phrase, etc.). The survey is probably less exhaustive for the phonetic, lexical, and treebank sections than for the sections on corpora and textbanks of running prose, which were the dominant focus in compiling it. A. Running Text: English Language The three most widely used corpora to date are the Brown corpus, the Lancaster/Oslo-Bergen (LOB) corpus, and the London-Lund corpus. These are described first, followed by descriptions of 23 others that are well-known within one or another subdomain of corpus-based language research (i.e., linguistics, psycholinguistics, computational linguistics, lexicology and lexicography), ordered in thematically related clusters and roughly chronologically within each cluster. 1. The Brown Corpus (The Standard Corpus of Present-Day Edited American English) (Francis, 1982; Francis & Kucera, 1979, 1982; Kucera, 1992; Kucera & Francis, 1967) is a corpus of 1 million words of written American English printed in the year 1961. It was the first corpus to be put on computer medium and is the most analyzed corpus of English to date. It consists of 500 written American English texts of 2,000 words apiece, selected to represent diverse genres of written American language. There are two main sections: Informative Prose and Imaginative Prose. Genres represented include newspaper reportage, press editorials, memoirs, religion, science fiction, detective fiction, and romance novels (excluding drama and fiction with more than [284] 50% dialog). This corpus of running text is available for academic research for the cost of materials from both the Oxford Text Archive and the ICAME archive and is contained on the ICAME CD-ROM available through NCCH (see above). A "tagged" version of the Brown Corpus (i.e., supplemented by labeling of individual words for 82 part-of-speech designations) was produced at Brown University during the period 1970-1978 with assistance from the TAGGIT program, written by B. B. Greene and G. M. Rubin (for additional details, see Francis, 1980; Garside, Leech, & Sampson, 1987; Svartvik, 1990). The tagged version is protected by its own copyright, and is available for $1000 to academic institutions. For more information: Text Research, 196 Bowen Street, Providence RI 02906, USA. FAX: +1 (401) 751-8958 or Nelson Francis or Henry Kucera, Department of Linguistics, Brown University, Providence RI 02906, USA; email: henry@brownvm.bitnet or henry_kucera@brown.edu. For a parsed (as opposed to part-of-speech tagged) version of part of the Brown corpus, known as the Gothenburg Corpus, contact: Gudrun Magnusdottir, Sprakdata, University of Go"teborg, S-412 98 Go"teborg, Sweden. The Susanne Corpus (Surface and Underlying Structural Analyses of Naturalistic English), using more transparent codes, for easier research use is currently in preparation. For information: G. Sampson, Department of Linguistics and Phonetics, University of Leeds, Leeds LS2 9JT, UK. 2. The Lancaster-Oslo/Bergen Corpus (LOB) is 1 million words of written British English from 1961. It was compiled in the 1970's under the direction of Geoffrey Leech, University of Lancaster, and Stig Johansson, University of Oslo. It is the British counterpart of the Brown corpus, and contains 500 texts of roughly 2,000 words each. The texts range across the same types of published written language as those of the Brown corpus, and the number of texts of each type are almost identical to those of the Brown corpus. A tagged version of the LOB corpus was produced between 1978 and 1983, using the CLAWS1 automatic tagging system, which uses text-based probabilities. Garside, Leech, and Sampson (1987) and Leech and Garside (1991) provide details of their methods and a survey of methods for automatic tagging and parsing of language corpora more generally. Both the tagged and untagged versions of the LOB corpus are available for academic use from the ICAME archive, and are contained on the ICAME CD- ROM described above. Their manuals (Johansson, Leech, & Goodluck, 1978, and Johansson, Atwell, Garside, & Leech, 1986, respectively) are also available from ICAME. A hand-parsed version of 45,000 words from the LOB is available as the Lancaster-Leeds Treebank; an automatically parsed version of 140,000 words from the LOB is available as the Lancaster Parsed Corpus (both described below, under "Treebanks"). A [285] larger treebank is being prepared by Steve Fligelstone. For further information: Steve Fligelstone, UCREL, Linguistics Department, Bowland College, Lancaster University, Lancaster LA1 4XZ, UK; email: eia002@lancaster.ac.uk. 3. The London-Lund Corpus (LLC) is 500,000 words of spoken educated British English, collected during the 1960's and early 1970's from speakers of various ages, representing a range of discourse types. They were transcribed to include markings of tone unit boundaries, nucleus (points of pitch prominence), direction of nuclear tones, pauses, degrees of stress, and other features. The data were originally gathered as the spoken half of the Survey of English Usage, used in several major reference grammars of English (Leech & Svartvik, 1975; Quirk, Greenbaum, Leech, & Svartvik, 1972, 1985). The first 87 texts to be computerized are published in Svartvik and Quirk (1980). The remaining 13 texts have now been added to the computerized corpus. The full 100 texts can be obtained for academic use from the ICAME and OTA archives, and are contained on the ICAME CD-ROM (described above). They are available as either running text or supplemented by semantic and syntactic tags associated with all words in the texts. The manual for the LLC (Svartvik, 1992b) is distributed through ICAME/NCCH. A bibliography of 200 studies using this corpus is found in Svartvik (1990). A parsed version of a part of the data is described in that source. 4. The Lancaster Spoken English Corpus (SEC) (Knowles & Lawrence, 1987) consists of 52,000 words of contemporary spoken British English, gathered between 1984 and 1987, from radio broadcasts, university lectures and several other types of speech. It is available from the ICAME archive in orthographic and prosodic transcription, with word-class tags (generated by CLAWS2) and accompanying manual. For more information, contact NCCH or Peter Roach, Linguistics Department, Leeds University; email: p.j.roach@cmsl.leeds.ac.uk; or Gerry Knowles, Linguistics Department, Bowland College, Lancaster University, Lancaster LA1 4XZ, UK; email: eia008@central1.lancaster.ac.uk. 5. The PIXI Corpora consist of 450 naturally occurring conversations recorded in bookshops in England and Italy, for the purpose of cross-cultural comparisons of discourse structure. They are available in electronic form from the Oxford Text Archive, and in book form in Gavioli & Mansfield (1990), together with careful details of the data gathering, discourse contexts, analytic approach and bibliography of related publications. For further information, contact the Oxford Text Archive or Guy Aston (VK1A@ICINECA.bitnet). [286] 6. The Helsinki Corpus of Historical English (Rissanen, 1992) is a textbank of 1.5 million written words from law, handbooks, science, trials, sermons, diaries, documents, plays, and private and official correspondence from periods at roughly 100-year intervals beginning in 850. It is used for variational study of the development of English. The manual for this corpus is Kyto" (1991), distributed through ICAME/NCCH. For more information contact: Matti Rissanen, or Merja Kyto" (mkyto@cc.helsinki.fi), Department of English, University of Helsinki, Porthania 311, 00100 Helsinki, Finland. A corpus of dialectal English is underway (Ihalainen, 1987). For information, contact Ossi Ihalainen at the same address. The Helsinki Corpus is contained on the ICAME CD-ROM (see above). 7. The Macquarie (University) Corpus (Peters, 1987) is nearing completion. It consists of 1 million words of Australian English and is intended to be comparable to the Brown Corpus. For more information: Pam Peters, David Blair, Peter Collins, or Alison Brierley, School of English and Linguistics, Macquarie University, 2109 New South Wales, Australia. 8. The Kolhapur Corpus of Indian English (Shastri, 1985, 1988) contains 1 million words of written Indian English from the year 1978. Its texts were selected from the same text categories as the Brown Corpus and is available from ICAME. 9. The American Heritage Intermediate Corpus (Carroll, Davies, & Richman, 1971) consists of over 5 million words of written American English from the most widely used books in grades 3 through 9. It was compiled as a database for the American Heritage School Dictionary. 10. The Birmingham Collection of English Text (BCET) (Renouf, 1984, 1987; Sinclair & Kirby, 1990), compiled from 1980-1985 by J. Sinclair, A. Renouf, and J. Clear, contains 20 million words of written (18.5) and spoken (1.5) language (mostly British) used in producing a series of Collins COBUILD reference and teaching works. It also contains 20 million words of speech from a public inquiry including the complete transcripts of the 18- month-long inquiry into the plan for constructing the Sizewell nuclear power station. It is intended to be representative of modern British English and therefore consists of samples of current and general usage (rather than technical use), from adult speakers without regional dialects, and excludes poetry and drama. For more information: A. J. Renouf, Research and Development Unit for English Language Studies, 50 Edgbaston Park Road, Birmingham B15 2RX, UK; Tel: +44 (21) 414 3935; FAX: +44 (21) 414 6203; email: renoufaj@bham.ac.uk. [287] 11. The Longman/Lancaster English Language Corpus (Summers, 1991) consists of 30 million words of mainly British and American English texts. Begun in 1985, it contains varied stylistic levels and text types, and is intended for lexicographic and academic research. For more information: Longman/Lancaster English Language Corpus, Longman Group Ltd., Longman House, Burnt Mill, Harlow, Essex CM20 2JE, UK. 12. The Corpus of Spoken American English (CSAE) (in progress), will be a database of one million words of spoken American English, encompassing a wide range of spoken language types (Chafe, Du Bois, & Thompson, 1992). The corpus will be disseminated as widely as possible in several formats, including a printed book and an interactive computer format that will allow simultaneous access to transcription and sound. The creation of the Corpus of Spoken American English will be coordinated with the ICE project (described next), of which the CSAE is the officially designated representative for the United States. For information: Wallace Chafe, John Du Bois, or Sandra Thompson, Department of Linguistics, University of California, Santa Barbara, CA 93106, USA; Tel: +1 (805) 961-3776. 13. The International Corpus of English (ICE) (Greenbaum, 1988, 1990, 1992) (in progress), was begun in 1988 for the purpose of providing comparable data for comparative studies of national varieties of English internationally. Under the coordination of Sidney Greenbaum, Department of English, University College London, parallel corpora of spoken and written texts will be compiled for a number of regions, including the United States, Australia, the United Kingdom, Wales, Canada, New Zealand, India, East Africa, Nigeria, Jamaica and others, using uniform classification and encoding schemes. The American English component of this project is the CSAE, described above. Each regional corpus will contain one million running words, half from spoken and half from written language. The material in each regional corpus must date from no earlier than 1990 and no later than the end of 1993 and will come from speakers 18 years or older with education through the medium of English. In addition, there are plans for nonregional supplementary corpora of written translations into English, international spoken communication, and EFL (English as a foreign language) teaching texts (see Francis, 1989). The ICE data will ultimately be made available together with original sound recordings and possibly also digitized recordings for a concordance format. 14. The British National Corpus (BNC) (Quirk, 1992) (in progress) is to be an electronic corpus of 100 million words of contemporary spoken and written British English. Texts will represent a cross-section of a [288] wide range of styles of current written and spoken English. A uniform target encoding scheme will be defined, conforming to the international Standard Generalised Markup Language (SGML), in which all texts in the corpus will be stored and distributed. The corpus is to be automatically tagged with word-class labels to enhance its value for linguistic research. Special purpose tools developed for manipulation and processing of the corpus will be distributed together with it. The BNC is intended to provide the UK research and industrial communities with state-of-the-art corpus and lexical resources, as a solid basis for the development and exploitation of new products in the rapidly expanding field of natural language processing as applied to British English. These resources will be made widely available under appropriate licensing conditions and at minimum cost to the academic research community and also to the wider industrial research community. Begun in 1991, this 3-year project is managed by Jeremy Clear, with major participation from Oxford University Press (OUP), Longman Group UK Ltd, the British Library, and the Universities of Oxford and Lancaster. For more information: Jeremy Clear, Oxford University Press, Walton Street, Oxford OX2 6DP, UK; Tel: +44 (865) 56767; FAX: +44 (865) 56646; email: JHCLEAR@vax.oxford.ac.uk. 15. The Bellcore Lexical Research Corpora (Walker, 1987) were compiled to support corpus linguistics and computational lexicography research. They include textbases of 200 million words of newswire text (New York Times, Associated Press), 50 million words of magazine and journal articles, a collection of English machine-readable dictionaries and other machine-readable reference books, electronic-mail digests, and assorted smaller texts. For more information: Donald E. Walker, Language and Knowledge Resources Research, Bellcore, MRE 2A-379, 445 South Street, Morristown, NJ 07960-1910, USA; FAX: +1 (201) 829-5981; email: walker@bellcore.com. 16. Established in 1989, the Association for Computational Linguistics Data Collection Initiative (ACL/DCI) (Church & Liberman, 1991; Liberman, 1989; Walker, 1991, 1992) is an activity which collects machine readable text to support scientific and humanistic research, and distributes it at cost and without royalties. Its first CD-ROM, available for only $25, contains about 300 Mb of Wall Street Journal text, about 180 Mb of scientific abstracts, the full text of the 1979 edition of the Collins English Dictionary in the form of a typographer's tape, and some samples of tagged and parsed text from the Penn Treebank project. Its second CD-ROM will contain most or all of six years of the Hansard corpus, that is, Canadian parliamentary sessions, in bilingual French/English aligned format. For more [289] information: Mark Liberman, Department of Linguistics, University of Pennsylvania, Philadelphia, PA 19104, USA; FAX: +1 (215) 573-2091; email: myl@unagi.cis.upenn.edu. 17. The European Corpus Initiative (ACL/ECI) (in progress), which is patterned after the ACL/DCI, was established in 1992 to bring together existing materials in as many major European languages as possible, and to make these available in digital form and in a consistent format to the research community at cost and without royalties. The ECI welcomes contributions from all researchers and will distribute the data on CD-ROMs, the number depending on the ultimate size of the archive. For more information (to contribute or obtain data): Henry Thompson, HCRC, University of Edinburgh, 2 Buccleuch Place, Edinburgh, EH8 92W, Scotland; FAX: +44 (31) 650-4587; email: eucorp@cogsci.ed.ac.uk. 18. The Cambridge Language Survey (CLS) (in progress) is an international multilingual survey of language. Under sponsorship from industry and government sources, and in cooperation with other projects, the CLS is bringing together existing data from a variety of languages, starting with English, French, German, Dutch, Italian, Spanish and Japanese, with the intent to code this data semantically and to prepare concordances and multilingual corpora, parallel and aligned, for educational and such publishing uses as the preparation of multilingual dictionaries and other reference books. The data will be made as available as possible, perhaps including distribution via CD-ROM. For more information: Paul Procter, Cambridge University Press, Edinburgh Building, Shaftesbury Rd., Cambridge CB2 2RU, UK; Tel: +44 (223) 325052; FAX: +44 (223) 315052; email: psp10@phx.cam.ac.uk. 19. The DARPA-funded Linguistic Data Consortium (LDC) (in progress) was inaugurated in the Spring of 1992. Its formation was stimulated by the establishment of the Data Collection Initiative (DCI) of the Association for Computational Linguistics (ACL), but also strongly influenced by cooperative work in the speech community that led to the development of corpora consisting of digits and of acoustic-phonetic data pronounced by multiple speakers. The LDC is intended to develop and distribute large amounts of linguistic data (e.g., speech, text, lexicons, and grammars) to assist the development of speech- and text-processing systems. The data will include large quantities of raw and annotated (i.e., syntactically and/or semantically tagged) text and speech (billions of words of text and thousands of hours of speech), a large lexicon, and a broad coverage grammar of English. The data will also include whatever additional materials (including foreign language materials) the Consortium can obtain by exchange or on other reasonable terms. Data are to be provided on CD-ROM on a subscription basis to [290] universities and corporations. Although the Consortium does not need exclusive rights to donated data, DARPA does intend to make its growing holdings available exclusively through the Consortium. General membership fees will be set at affordable levels, and foreign members will be considered if access to foreign data can be assured. The Consortium may be established as a separate legal entity, such as a nonprofit corporation or other form of association. For further information: Mark Liberman, Department of Linguistics, University of Pennsylvania, Philadelphia, PA 19104, USA; email: myl@unagi.cis.upenn.edu. 20. American News Stories consists of approximately 250,000 words of written American English consisting of Associated Press news stories in December 1979 (available from the Oxford Text Archive). 21. The Nijmegen TOSCA Corpus (Oostdijk, 1988) is a textbank of 75 works (1.5 million words) of educated written British English drawn from a variety of genres meant to be read rather than spoken (i.e., excluding poetry, plays and speeches), compiled for studies of linguistic variation. For more information: Dr. Jan Aarts and Prof. C. Koster, Directors, The Nijmegen Research Group for Corpus Linguistics, Department of English, University of Nijmegen, Erasmusplein 1, NL-6525 HT Nijmegen, The Netherlands; Tel: +31 (80) 512836; email: cor_hvh@hnykun52. 22. The Melbourne-Surrey Corpus (Ahmad & Corbett, 1987) consists of 100,000 words of Australian newspaper texts and is available from ICAME. 23. The Corpus of English-Canadian Writing, is a textbank of 3 million words of Canadian English from magazines, books, and newspapers, gathered beginning in 1984, and representing a wide variety of genre categories in common with the LOB and Brown corpora, plus "Feminism" and "Computing." For more information: Margery Fee, Strathy Language Unit, 207 Stuart Street, Room 316, Rideau Building, Queen's University, Kingston, Ontario, Canada K7L 3N6; email: feem@qucdn.bitnet. 24. The Warwick Corpus is approximately 2.5 million words of written British English (letters, fiction and other genres) compiled by J. M Gill for use in research aimed at the automatic generation of Braille by computer (available from the OTA). 25. The Cornell corpus (Hayes 1988; Hayes & Ahrens, 1988) is a 1.6 million word corpus, consisting of 1151 written or spoken British and American English texts, representing a wide variety of language types. It was compiled in the 1980's for a study on lexical adaptation of parents to children. The spoken samples range from abortion debates to [291] the Patty Hearst trial to television situation comedies. It is available from the CHILDES archive (described above). 26. NEXIS, LEXIS, and MEDIS (owned by Mead Data Central) and WESTLAW (run by the West Corporation) are commercial archives. These are used by newswriters, lawyers, and doctors, but they tend to be very expensive. NEXIS contains newspapers (New York Times, Reuters, Business Week), newsletters, and other periodicals from the 1980s to the present and is used by columnists such as William Safire. LEXIS and WESTLAW contain legal codes and almost all legal decisions at the federal and state level in the United States and several European countries from far back to very current. MEDIS is a medical literature database. B. Running Text: French Language 1. The Oxford Text Archive (OTA) has a number of literary holdings in the French language. 2. The Hansard corpus contains six years of Canadian Parliamentary sessions, in English/French bilingual aligned format, and is available from the ACL/DCI. 3. The Ottawa-Hull Corpus of Spoken French (Poplack, 1989) is 3.5 million words, compiled in 1985 to address issues of sociolinguistic variation and language contact. Respondents were selected from two contiguous cities on the border between Ontario and Quebec, in an unbiased manner to reflect a carefully balanced sampling grid of occupational, age, sex and other variables. To avoid Labov's "observer paradox," the data were recorded by trained community members. For more information: Shana Poplack, Linguistics Department, University of Ottawa, Ottawa, Ontario, Canada; email: sxpaf@uottawa.bitnet. 4. The Tre'sor de la Langue Franc,aise (TLF) (Treasury of the French Language) contains about 2,000 texts (150 million words) of a variety of types of written French-from novels and poetry to biology and mathematics- stretching from the 17th to the 20th centuries, the result of a cooperative project between the Centre National de la Recherche Scientifique and the University of Chicago. Access to the ARTFL database is organized through a consortium of user-institutions, in most cases universities and colleges, each of which pays an annual subscription fee. The data will soon also be available on CD-ROM together with access software for UNIX systems. For more information, contact: Mark Olsen, ARTFL Project, American and French Research on [292] the Treasury of the French Language, Department of Romance Languages, University of Chicago, 1050 East 59th Street, Chicago, IL 60637, USA; Tel: (312) 702-8488; email: artfl@artfl.uchicago.edu or mark@gide.uchicago.edu. C. Running Text: German language The Mannheim Corpus (Teubert, 1984) is a textbank of 8 million words of modern literary prose and nonfiction, available from the Oxford Text Archive and also from the Institut f|r Deutsche Sprache, University of Mannheim, Friedrich-Karl-Strasse-12, Postfach 5409, D-6800 Mannheim, Germany. The Institut f|r Deutsche Sprache also houses the Bonner Zeitungskorpus, a three million word collection of representative samples from German newspapers between 1949 and 1974, and the Freiburger Corpus, a textbank of one-half million words from 224 texts and documents, including discussions, interviews, speeches, reports, narrations, and documentary. The LIMAS Corpus of modern German is 1.1 million words, constructed by the same rules as the Brown Corpus. It is available from the Institut f|r Deutsche Sprache. It is also available together with software on HD floppies for 1000 DM from Gerd Willee, email: upk000@dbnrhrz1.bitnet or upk000@ibm.rhrz.uni-bonn. The Pfeffer Spoken German Corpus, collected in 1961, contains 400 12- minute spontaneous interviews covering 25 different topics, recorded in 60 locations in Germany (including both former East and West), Austria, and Switzerland. The speakers represent diverse demographic characteristics with regard to gender, age, education, and geography. For information: the Oxford Text Archive or Randall L. Jones, Department of German, 4096 JKHB, Brigham Young University, Provo, UT 84602, USA; Tel: +1 (801) 378-3513; email: jones@byuvm.bitnet. Finally, the Ulm Textbank is mainly a textbank of psychiatric interviews, together with a very powerful text retrieval and concordance package (Mergenthaler, 1985). For more information: Erhard Mergenthaler, University of Ulm, Germany; email: lu07@dmarum8.bitnet. The Muenster Textbank contains 94 million words of newspaper text. For more information, Lothar Lemnitzer, lothar@hendrix.uni-muenster.de. D. Running Text: Italian Language The PIXI Corpora are transcripts of service encounters in comparable bookshops in Italy and England and are available through the Oxford Text Archive (described in fuller detail above with the English Language Corpora). The Pisa Corpus consists of 3.5 million words of Italian. For more information: Antonio Zampolli, Istitute di [293] Linguistica Computazionale, Via Della Faggiola 32, University of Pisa, I-56100 Pisa, Italy; email: glottolo@icnucevm.bitnet. E. Running Text: Other Languages Besides English, French, German and Italian, electronic corpora are increasingly available also in other languages. The Oxford Text Archive contains a diverse sampling of languages, best surveyed in the OTA catalog itself. The resources listed in this section are from other locations. (See also the other entries under "Data Sources: Surveys" above.) The Center for Native (American) Languages of the Plains and the Southwest, at University of Colorado, has electronic versions of the Dorsey Omaha-Ponca texts in its Siouan Archives, and has several dictionary projects (Winnebago, Siouan, and Lakhota). For Australian indigenous languages, please see the entry for the AIATSIS Aboriginal Studies Electronic Data Archive under "Electronic Data Archives and Repositories" above. For Danish there are two corpora of written Danish from fiction, newspapers and professional texts: the DANWORD corpus is 1.25 million words (see Maegaard and Ruus, 1987), housed at the University of Copenhagen; DK87 and DK88 are one million words apiece, from work published in 1987 and 1988, respectively, and are available from: Henning Bergenholtz, The Aarhus School of Business, Fuglesange Alle 4, DK-8210 Aarhus V. Regarding Estonian, a corpus is in progress at the Laboratory of the Estonian Language, Tartu University, EE2400 Tartu, Estonia. Regarding Finnish corpora, contact: Fred Karlsson, Department of General Linguistics, University of Helsinki, Hallituskatu 11, SF-00100 Helsinki, Finland; email: fkarlsso@ling.helsinki.fi For Spanish, the Archivo Digital de Manuscritos y Textos Espan~oles is available on CD-ROM. For more information: Charles Faulhaber, Department of Spanish and Portuguese, University of California, Berkeley, CA 94720; Tel: +1 (510) 642-0471; email: cbf@athena.berkeley.edu. Swedish language corpora are surveyed and summarized in Gellerstam (1992). Regarding Yugoslavian, there is the YU-CORPUS. It consists of mainly contemporary fiction prose in Serbo-Croatian, with the main areas represented: Serbia, Croatia, Montenegro, and Bosnia-Hercegovina. The corpus consists of 15 files for a total of approximately 700,000 words. These files are available via anonymous ftp at aau.dk (129.142.17.240) in the directory /home/ftp/pub/slav. For more information: Henning Moerk, Slavisk Institut, Aarhus Universitet, Ny [294] Munkegade 116, 8000 Aarhus C, Denmark; Tel: +45 (86) 136555; FAX: +45 (86) 192155; email: slavhenn@aau.dk. F. Language Acquisition 1. Child Language Acquisition. The main archive for child language data is the Child Language Exchange System (CHILDES), described earlier. The Polytechnic of Wales Corpus (Fawcett, 1980) compiled by R. Fawcett and M. Perkins between 1978 and 1984, consists of 100,000 words of children's English (ages 6 to 12), gathered in Pontypridd, South Wales. The data are from 120 children (balanced by age, sex, and socioeconomic status and screened to exclude those with strong Welsh or other second language influence), recorded at play and in interview with an adult. The computer files contain detailed grammatical tagging and have been fully hand-parsed using an extension of Systemic Functional Grammar developed by Fawcett which includes functional and formal categories. These are available from the ICAME Archive. The recorded tapes and four volumes of transcripts with intonation contours are available for the cost of materials from: Robin Fawcett, Department of Behavioral and Communication Studies, Polytechnic of Wales, Treforest, Cardiff CF 37 1DL, UK. 2. Adult or Second Language Acquisition. The European Science Foundation Second Language Data Bank (ESFSLDB) consists of longitudinal data obtained systematically over a 3-year period from adult migrant workers in five nations in Europe with a focus on language learning in the absence of formal instruction (see Perdue, 1984, in press). This very large database contains texts of interviews, narratives, role plays, picture descriptions, and other data gathered mostly on a roughly monthly basis from the same informants during the course of this time period. The informants were chosen to be comparable in terms of age, recency of arrival, level of education, and other factors, and represented 10 combinations of source language (Moroccan Arabic, Italian, Spanish, Finnish, and Punjabi) and target language (French, English, Dutch, German, Sweden). The data and Word Cruncher software are accessible for noncommercial research with signed agreement, available through file server (psyli@hnympi51.bitnet), tapes, diskettes, or CD-ROM. For more information: Kees v.d.Veer, Technical Group, Max-Planck-Institut fuer Psycholinguistik, Postbus 310, NL-6500 AH Nijmegen, The Netherlands; Tel: +31 (80) 521-911; email: kees@mpi.nl. The Montreal Corpus was gathered for a project headed by Prof. K. Connors concerning the acquisition of French as a second language by [295] anglophones and lusophones in Montreal. The data consist of three sets of interviews each from anglophones, lusophones, and a control group of French speakers. The corpus is available for research in magnetic form. For more information: Michel Lenoble, Litterature Comparee, Universite de Montreal, Montreal, Canada; email: lenoblem@umtlvr.bitnet. G. Phonetic Databases 1. The DARPA Speech Recognition Research Databases consist of phonetic transcriptions of sentences read aloud by American adults from various parts of the country. These databases include both a speaker- independent (a few sentences from many speakers) and a speaker-dependent (a lot of speech from a few speakers) part-designed for use in training and testing both speaker-independent and speaker-dependent recognition systems. Digitized versions are also available. For more information see Fisher, Doddington, and Goudie-Marshall (1986), Lamel, Kassel, and Seneff (1986), and Price, Fisher, Bernstein, and Pallet (1988). 2. The Phonetic Database (PDB) at the University of Victoria consists of language files in MS-DOS format that run with Micro Speech Lab/KayLab hardware/software and illustrate speech sounds of some less frequently encountered languages. Each language has about 40-50 words and a few to several sentences of text encoded. It is intended to provide illustrative and archival samples of different languages from field data and lab recordings. Some languages represented are Egyptian Arabic, Cantonese, Modern Standard Chinese, Scots Gaelic, Inuktitut, Korean, Miriam, Ditidaht, Nyangumarta, Rutooro, Runyoro, Skagit, Spokane, Turkish, Umpila, Xhosa, Yoruba, Sinhala, and Japanese. Files are being converted to 20K sampled data for use with CSL (KayLab) and ASL programs (on the IBM). Concordance material is in written text format. For more information contact John Esling, Linguistics Department, University of Victoria, British Columbia, Canada; email: VQPLOT@UVVM.bitnet. 3. The Multi-Language Speech Database (in progress) is to be a large 10-language database of digitized speech recordings over the telephone. Plans are to gather five minutes of speech from each of 100 native speakers in each of 10 languages. This database is scheduled for completion in mid-1992, and will be made available to researchers at nominal cost together with software (developed for UNIX xwindows) to display and interactively modify the speech files, and signal processing functions that compute different parameters of the speech waveform. For more information: Ronald Cole, Center for Spoken Language Understanding, Oregon Graduate Institute of Science and [296] Technology, 19600 NW Von Neumann Dr., Beaverton, OR 97006-1999, USA; Tel: +1 (503) 690-1159; email: cole@cse.ogi.edu. H. Electronic Dictionaries 1. See the Wooldridge list of machine-readable dictionaries mentioned under "Data Sources: Surveys." 2. The Oxford Text Archive (OTA) distributes several machine-readable dictionaries, including some in languages other than English. These are listed and described, together with illustrative examples of the more widely used, in the file ota/dicts/info, available via anonymous ftp from black.ox.ac.uk (or 129.67.1.165). 3. The second edition of the Oxford English Dictionary is available on CD-ROM from: Electronic Publishing Division, Oxford University Press, 200 Madison Avenue, New York NY 10016; Tel (212) 679-7300, ext. 7370; or Electronic Publishing Division, Oxford University Press, Walton Street, Oxford OX2 6DP; Tel: +44 (865) 267979; email OUPJSC@VAX.OXFORD.AC.UK. 4. Le Robert Electronique is the electronic version of the nine-volume English-French dictionary by Robert Grant, De La Langue Franc,aise (1985 edition). It is available on CD-ROM for $995 (U.S.) from Chadwyck-Healey Inc., 1101 King Street, Alexandria, VA 22314, USA; Tel: +1 (703) 683-4890 or +1 (800) 752-0515; FAX: +1 (703) 683-7589; or Chadwyck-Healey Ltd., Cambridge Place, Cambridge CB2 1NR, UK; Tel: +44 (223) 311479; FAX: +44 (223) 66440. I. Lexical Databanks As noted in the introduction, this sampling of resources is probably less complete than for the corpora and textbanks of running texts. References relating to corpus-based lexicography include: Altenberg (1990), Atkins, Clear, and Ostler (1992), Boguraev and Briscoe (1988), Gellerstam (1988), Sinclair (1987, 1992), and Walker (1992). 1. The MRC Psycholinguistic Database, described in Coltheart (1981), consists of 150,837 entries from the Shorter OED with various forms of additional information (including part of speech, the British pronunciation, rating of concreteness, familiarity, and frequencies from Kucera-Francis and Thorndike-Lorge) for various subsets of words. It is available together with computer programs for efficient access, written in C for UNIX systems via anonymous ftp from laurel.ocs.mq.edu.au (or 137.111.3.11) as the file [297] pub/wrec/incoming/mrc.tar.Z (binary) or from the Oxford Text Archive (black.ox.ac.uk or 129.67.1.165) in the directory ota/dicts/1054. The Macintosh version has been produced by Philip Quinlan and is marketed by the Oxford University Press. 2. The DARPA-funded Consortium for Lexical Research (CLR) (under development). Begun in 1989 and modeled partly after large data projects such as the British National Corpus, the CLR is an organization for sharing lexical data and tools used to perform research on natural language dictionaries and lexicons, and for communicating the results of that research. It is intended to make available to the whole natural language processing community certain resources now held by only a few groups that have special relationships with companies or dictionary publishers. The CLR would as far as is practically possible accept contributions from any source, regardless of theoretical orientation, and make them available as widely as possible for research. It will be located at the Computing Research Laboratory, Box 30001, Las Cruces, New Mexico, USA, under the direction of Yorick Wilks and an ACL committee consisting of Roy Byrd, Ralph Grishman, Mark Liberman and Don Walker. An annual fee will be charged for membership. For information on participating in the CLR as a provider or consumer of data, tools, or services, or on joining the lexical information list: Natural Language Research, Consortium for Lexical Research, Computing Research Lab, New Mexico State University, Las Cruces, NM 88003, USA; Tel: +1 (505) 646-5466; FAX: +1 (505) 646-6218; email: lexical@nmsu.edu or lexical@nmsu.bitnet. 3. The Centre for Lexical Information (CELEX) has a relational database containing lexical data on present-day Dutch (400,000 word forms), English (150,000 word forms), and German (51,000 word forms) that it makes available to institutes and companies for language and speech research and for the development of language- and speech-oriented technological systems. It contains detailed information on orthography, phonology, morphology, and syntax, as well as word frequencies based on the COBUILD corpus (described above). New information on translation equivalency is currently being developed, along with additional syntactic and semantic subcategorizations to establish semantic links among the three databases. The CELEX user interface was specially designed to make it easy for nontechnical people to use the databases. Researchers from several countries can log onto CELEX remotely and use it interactively. Costs for noncommercial use are modest; for commercial use, somewhat more expensive. If the network connections are not sufficient, then CELEX can prepare the information you require and send it on tape. For more information: CELEX-Centre for Lexical Information, University of [298] Nijmegen, Wundtlaan 1, 6525 XD NIJMEGEN, The Netherlands; email: celex@celex.kun.nl or celex@hnympi52.bitnet. 4. ACQUILEX (Boguraev, Briscoe, Calzolari, Cater, Meijs, & Zampolli, 1988) is a project funded by the European Community to draw on and extend current work on extracting data from published machine-readable databases in multiple languages and formalizing the data to facilitate the algorithmic processing of language. It is described also in Walker (1992). 5. The Cambridge Language Survey (CLS) is described above under "English Language" Corpora. 6. Japanese Electronic Dictionary Research Project (in progress) is a corpus-based project described in Walker (1991). Details of the corpus itself were not mentioned in this source but will no doubt become widely known as the project continues. J. Treebanks These are databanks containing not only part of speech tags but also labeled constituent structures (e.g., noun phrase, adverbial phrase, coordinate clause). Some treebanks were mentioned briefly above in the descriptions of the Brown and LOB corpora (under "Running text: English Language"). Bracketed structures have also been added to some texts in the LLC (see Svartvik, 1990). For parsed child language data, see 5F above. For a discussion of treebanks and methods used in compiling them, see Leech & Garside (1991). 1. The Lancaster-Leeds Treebank, compiled by G. Sampson and G. N. Leech, is a treebank of hand-parsed phrase structure analyses of 45,000 words from the LOB (written British English) representing all 15 of the LOB categories of text types. For more information: Carol Lockhart, CCALAS Secretary, Department of Linguistics and Phonetics, University of Leeds, Leeds LS2 9JT, UK. 2. The Lancaster Parsed Corpus (Garside, Leech & Sampson, 1987) is a treebank of approximately 140,000 words from the LOB Corpus (written British English) from all 15 LOB text types. The sentences were all automatically parsed with the UCREL parsing systems, using statistics derived from the Lancaster-Leeds Treebank. It is available for limited distribution. For more information: UCREL Secretary, Department of Linguistics and Modern English Language, University of Lancaster, Lancaster LA1 4YT, UK. 3. The Linguistic DataBase System (LDB) (de Haan, 1987; Lancashire, 1991; van Halteren & Oostdijk, 1988; van Halteren & van den Heuvel, [299] 1990) was developed by the TOSCA group at Nijmegen University. It is a software package which is distributed together with "syntactic analysis trees" of all utterances from the 130,000-word Nijmegen Corpus of modern British English. The LDB was designed to be easy to use even for computing novices and is independent of both formalism and language, so it is possible to use it for any other kind of analyzed corpus. It can be used on VAX VMS systems, IBM PCs (AT preferred), and UNIX systems, and in 1991 cost about $60 US for academic institutions ($3000 US for others). It can be used to examine trees, search for utterances with given properties, and handle database-wide queries about constructs in the utterances. A fully functional demonstration version is available for any MS-DOS machine with hard disk. For more information: Hans van Halteren, TOSCA Group, Department of English, University of Nijmegen, P.O. Box 9103, 6500 HD Nijmegen, The Netherlands; Tel: +31 (80) 512836; email: cor_hvh@kunrc1.urc.kun.nl. 4. The Penn Treebank is a databank of labeled bracketed structures, for samples of written language (the Wall Street Journal) (98%) and spoken language (Mari Ostendorf's WBUR radio transcripts) (2%). For more information: The Penn Treebank Project, Department of Computer and Information Science, School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, PA 19104, USA; email: khanr@unagi.cis.upenn.edu or maryann@unagi.cis.upenn.edu. 5. Treebank of Written and Spoken American English (in progress) (as mentioned in Walker, 1992) is to contain potentially millions of sentences together with part of speech tags, skeletal syntactic parsings and intonational boundaries for spoken language. The data themselves will be derived at least in part from the ACL/DCI collection and to be available through it. For more information: Mitch Marcus, Department of Linguistics, University of Pennsylvania, Philadelphia, PA, USA; email: mitch@linc.cis.upenn.edu. K. Translation into English 1. English/French parallel texts are provided in the Hansards material of the ACL/DCI, already described. 2. English/Italian parallel texts are part of the Italian Reference Corpus in Pisa (see Bindi, Calzolari, Monachini, & Pirrelli, 1991). 3. Parallel texts in various combinations of languages are also one of the goals of the Cambridge Language Survey (CLS), described above under "English Language" Corpora. [300] 4. English translations of Pravda 1986-1987 on a CD-ROM disk for IBM PC or compatible for $249 U.S. (Product #CD-1505, Description: PRAVDA) are available from: Bureau of Electronic Publishing, P. O. Box 779, Upper Montclair, NJ 07043, USA; Tech. Support: Tel: +1 (201) 746-3033; Orders: +1 (201) 857-4300; FAX: +1 (201) 857-3031. 6. LITERATURE PERTAINING TO ELECTRONIC CORPORA As sources for further information and bibliographies in corpus linguistics, lexicography, computational linguistics, and humanities, there are: (1) the 200-work bibliography of research involving the London-Lund Corpus in Svartvik (1990, Chapter 1), (2) the Altenberg (1991) bibliography of corpus research on written and spoken language, which is available also via the ICAME fileserver (fileserv@nora.hd.uib.no), together with annual updates, and (3) Susan Hockey's survey of resources for computer-assisted research in literature and other humanities (included as the Appendix below). ACKNOWLEDGEMENTS This compilation is indebted to all of the sources cited above, but I wish to thank especially the following people, for their help in providing information, corrections, and suggestions concerning earlier versions: Lou Burnard, Helmut Feldweg, Stig Johansson, Knut Hofland, Henry Kucera, Laura Proctor, and Don Walker. As already noted, this survey has benefited from several other corpus surveys concerning the earlier corpora: Chafe et al. (1992), Taylor, Leech & Fligelstone (1989), the Georgetown University Catalog of Projects in Electronic Text, and the catalogs of the OTA and ICAME archives. Any errors that remain are my own. Given the rapid growth in this area, I have no doubt inadvertently overlooked some relevant projects. To them, my apologies. Similarly, mention of any resource is not intended as endorsement. This work was made possible financially by the Institute of Cognitive Studies, University of California at Berkeley, which, however, bears no responsibility for opinions expressed in these pages. Finally, I wish to thank Susan Hockey for her generosity in contributing the materials in the Appendix. [301] REFERENCES Ahmad, K., & Corbett, G. (1987). The Melbourne-Surrey Corpus, ICAME Jour- nal, 11, 39-43. Altenberg, B. (1990). Spoken English and the dictionary. In J. Svartvik (Ed.), The London-Lund Corpus of Spoken English: Description and Research (pp. 177-191). Lund, Sweden: Lund University Press. Altenberg, B. (1991). A bibliography of publications relating to English computer corpora. In S. Johansson & A. B. Stenstro"m (Eds.), English computer corpora: Selected papers and research guide. New York: Mou- ton de Gruyter. Atkins, B. H., Clear, J., & Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing, 7, 1-16. Bachenko, J., & Fitzpatrick, E. (1990). A computational grammar of discourse-neutral prosodic phrasing in English. Computational Linguistics, 16, 155-170. Bindi, R., Calzolari, N., Monachini, M., & Pirrelli, V. (1991). Lexical knowledge acquisition from textual corpora: A multivariate statistic approach as an integration to traditional methodologies. In Using Corpora: Proceedings of the Seventh Annual New OED Conference (pp. 170-196). Waterloo, Ontario: UW Centre for the New OED and Text Research. Brill, E., Magerman, D., Marcus, M., & Santorini, B. (1990). Deducing linguistic structure from the statistics of large corpora. Proceed- ings of the DARPA Speech and Natural Language Workshop, June 1990 (pp. 275-282). Arlington, VA: Defense Advanced Research Projects Agency. Boguraev, B., & Briscoe, T. (Eds.). (1988). Computational lexicography for natural language processing. London: Longman. Boguraev, B., Briscoe, T., Calzolari, N., Cater, A., Meijs, W., & Zampolli, A. (1988). Acquisition of lexical knowledge for natural language processing systems. Proposal for ESPRIT basic research activities. Cambridge: Cambridge University Press. Bruce, G. (1989). Report from the IPA Working Group on Suprasegmental Categories. Lund University, Department of Linguistics Working Papers, 35, 25-40. Bruce, G. (1992). Comments. In J. Svartvik (Ed.), Directions in corpus linguistics: Proceedings of the Nobel Symposium 82, Stockholm, August 4-8, 1991 (pp. 145-147). New York: Mouton de Gruyter. Bruce, G., & Touati, P. (1990). On the analysis of prosody in spontaneous dialogue. Lund University, Department of Linguistics Working Papers, 36, 37-55. Burnard, L. (1991). What is SGML and how does it help? (Document No. TEI EDW 25). Text Encoding Initiative listserver (listserv@uicvm.bitnet). Carroll, J. B., Davies, P., & Richman, B. (1971). The American Heritage word frequency book. Boston: Houghton Mifflin. Chafe, W. (1992). The importance of corpus linguistics to understanding the nature of language. In J. Svartvik (Ed.), Directions in corpus linguistics: Proceedings of the Nobel Symposium 82 (pp. 79-97). New York: Mouton de Gruyter. Chafe, W., Du Bois, J. W., & Thompson, S. A. (1992). Corpus of spoken American English. Unpublished manuscript, Linguistics Department, University of California, Santa Barbara. Church, K. W. (1991). [Review of J. Aarts & W. Meijs (Eds.), Theory and practice in corpus linguistics]. Computational Linguistics, 17, 99- 103. Church, K. W., & Hanks, P. (1990). Word association norms, mutual infor- mation, and lexicography. Computational Linguistics, 16, 22-29. Church, K. W., & Liberman, M. (1991). A status report on the ACL/DCI. Using corpora: Proceedings from the New OED Conference (pp. 84-91) Waterloo, Ontario: The University of Waterloo Centre for the New OED and Text Research. Coltheart, M. (1981). The MRC psycholinguistic database. Quarterly Jour- nal of Experimental Psychology, 33A, 497-505. Dawson, J. L. (1977). Texts in machine-readable form and the University of Cambridge Literary and Linguistics Computing Centre. CAMDAP, 7, 25-30. de Haan, P. (1987). Exploring the linguistic database: Noun phrase com- plexity and language variation. In W. Meijs (Ed.), Corpus linguistics and beyond. Amsterdam: Rodopi. Fawcett. R. P. (1980). Language development in children 6-12: Interim report. Linguistics, 18, 953-958. Fillmore, C. J. (1992). "Corpus linguistics" or "Computer-aided armchair linguistics." In J. Svartvik (Ed.), Directions in corpus linguis- tics: Proceedings of the Nobel Symposium 82 (pp. 35-60). New York: Mouton de Gruyter. Fisher, W. M., Doddington, G. R., & Goudie-Marshall, K. M. (1986). Proceedings of the Speech Recognition Workshop (Defense Advanced Research Projects Agency, Information Processing Techniques Office Report No. AD-A165 977). Francis, W. N. (1980). A tagged corpus-Problems and prospects. In S. Greenbaum, G. Leech, & J. Svartvik (Eds.), Studies in English linguis- tics for Randolph Quirk (pp. 192-209). New York: Longman. Francis, W. N. (1982). Problems of assembling and computerizing large corpora. In S. Johansson (Ed.), Computer corpora in English language research (pp. 7-24). Bergen: Norwegian Computing Centre for the Humanities. Francis, W. N. (1992). Language corpora B. C. In J. Svartvik (Ed.), Directions in corpus linguistics: Proceedings of the Nobel Symposium 82 (pp. 18-32). New York: Mouton de Gruyter. Francis, W. N. & Kucera, H. (Eds.). (1979). Manual of information to accompany a Standard Corpus of Present-Day Edited American English for use with digital computers (rev. ed.). Providence, RI: Brown Univer- sity, Department of Linguistics. Francis, W. N. & Kucera, H. (1982). Frequency analysis of English usage: Lexicon and grammar. Boston: Houghton Mifflin. Garside, R., Leech, G., & Sampson, G. (Eds.). (1987). The computational analysis of English: A corpus-based approach. New York: Longman. Gavioli, L., & Mansfield, G. (1990). The PIXI Corpora: Bookshop encounters in English and Italian. Bologna, Italy: CLUEB. Gellerstam, M. (Ed.). (1988). Studies in computer-aided lexicology. Stockholm: Almqvist & Wiksell International. Gellerstam, M. (1992). Modern Swedish corpora. In J. Svartvik (Ed.), Directions in corpus linguistics. (pp. 149-163). New York: Mouton de Gruyter. Greenbaum, S. (1988). A proposal for an international computerized corpus of English. World Englishes, 7, 315. Greenbaum, S. (1990). Standard English and the international corpus of English. World Englishes, 9, 79-83. Greenbaum, S. (1992). A new corpus of English: ICE. In J. Svartvik (Ed.), Directions in corpus linguistics (pp. 1761-179). New York: Mouton de Gruyter. Halliday, M. A. K. (1992). Language as system and language as instance: The corpus as a theoretical construct. In J. Svartvik (Ed.), Direc- tions in corpus linguistics (pp. 61-77). New York: Mouton de Gruyter. Hayes, D. P. (1988). Speaking and writing: distinct patterns of word choice. Journal of Memory and Language, 27, 572-585. Hayes, D. P., & Ahrens, M. G. (1988). Vocabulary simplification for chil- dren: A special case of motherese? Journal of Child Language, 15, 395-410. Hindle, D., & Rooth, M. (1991). Structural ambiguity and lexical rela- tions. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (229-236). Hockey, S. (1991). The ACH-ACL-ALLC Text Encoding Initiative: An overview (Document No. TEI J16). Text Encoding Initiative listserver (listserv@uicvm.bitnet). Hughes, J. J. (1987). Bits, bytes and Biblical studies: A resource guide for the use of computers in Biblical and Classical studies. Grand Rapids, MI: Academie Books. Ihalainen, O. (1987). The Helsinki Corpus of English Texts: Diachronic and dialectical-Report on work in progress, ICAME Journal, 11, 58-60. Johansson, S., Atwell, E., Garside, R., & Leech, G. (1986). The tagged LOB corpus: Users manual. Bergen: Norwegian Computing Centre for the Humanities. Johansson, S., Leech, G., & Goodluck, H. (1978). Manual of information to accompany the Lancaster-Oslo/Bergen corpus of British English for use with digital computers. Oslo: Department of English, University of Oslo. Kjellmer, G. (1984). Some thoughts on collocational distinctiveness. In J. Aarts & W. Meijs (Eds.), Corpus linguistics: Recent developments in the use of computer corpora in English language research (pp. 163- 171). Amsterdam: Rodopi. Knowles, G., & Lawrence, L. (1987). Automatic intonation assignment. In R. Garside, G. Leech, & G. Sampson (Eds.), The computational analysis of English: A corpus-based approach. London: Longman. Kucera, H. (1992). Brown corpus. In S. C. Shapiro (Ed.), Encyclopedia of artificial intelligence (Vol. 1, pp. 128-130). New York: John Wiley & Sons. Kucera, H., & Francis, W. N. (1967). Computational analysis of present- day American English. Providence, RI: Brown University Press. Kyto", M. (Ed.). (1991). Manual to the Diachronic part of the Helsinki Corpus of English Texts: Coding conventions and lists of source texts. Helsinki: University of Helsinki, Department of English. [Distributed by the Norwegian Computing Centre for the Humanities, Bergen]. Lamel, L. F., Kassel, R. H., & Seneff, S. (1986). Speech database development: Design and analysis of the acoustic-phonetic corpus. In Proceedings of the DARPA Speech Recognition Workshop (pp. 100-109). Lancashire, I. (1991). [Review of H. van Halteren & T. van den Heuvel, Linguistics exploration of syntactic databases: The use of the Nijmegen Linguistic DataBase program]. Computational Linguistics, 17, 457-461. Lancashire, I., & McCarty, W. (Eds.). (1988). Humanities computing year- book 1988. Oxford: Oxford University Press. Leech, G. (1991). The state of the art in corpus linguistics. In K. Aijmer & B. Altenberg (Eds.), English corpus linguistics: Studies in honour of Jan Svartvik (pp. 8-29). London: Longman. Leech, G. (1992). Corpora and theories of linguistic performance. In J. Svartvik (Ed.), Directions in corpus linguistics: Proceedings of the Nobel Symposium 82 (pp. 105-122). New York: Mouton de Gruyter. Leech, G., & Garside, R. (1991). Running a grammar factory: The produc- tion of syntactically analysed corpora or treebanks. In S. Johansson & A. B. Stenstro"vm (Eds.), English computer corpora: Selected papers and research guide (pp. 15-32). New York: Mouton de Gruyter. Leech, G., & Svartvik, J. (1975). A communicative grammar of English. London: Longman. Levelt, W. J. M., Mills, A., & Karmiloff, A. (1981). Child language research in ESF countries: An inventory. Strasbourg: ESF. Liberman, M. (1989). Text on tap: The ACL/DCI. In Proceedings of the DARPA Speech and Natural Language Workshop, Oct. 1989. San Mateo, CA: Morgan Kaufman. MacWhinney, B. (1991). The CHILDES project: Tools for analyzing talk. Hillsdale, NJ: Lawrence Erlbaum Associates. MacWhinney, B., & Snow, C. (1985). The child language data exchange sys- tem. Journal of Child Language, 12, 271-296. Maegaard, B., & Ruus, H. (1987). The compilation and use of a text corpus. In A. Cappelli, L. Cignoni, & C. Peters (Eds.), Studies in honour of Roberto Busa SJ (pp. 103-122). Pisa: Giardini. Mergenthaler, E. (1985). Textbank systems: Computer science applied in the field of psychoanalysis. New York: Springer-Verlag. Morris, J., & Hirst, G. (1991). Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Computational Linguistics, 17, 21-48. Oostdijk, N. A. (1988). Corpus for Studying Linguistic Variation. ICAME Journal, 12. Perdue, C. (Ed.). (1984). Second language acquisition by adult immi- grants. A field manual. Rowley, MA: Newbury House. Perdue, C. (Ed.). (in press). The crosslinguistic study of second languages. Cambridge: Cambridge University Press. Peters, P. H. (1987). Toward a corpus of Australian English. ICAME Jour- nal, 11, 27-28. Pierrehumbert, J. (1980). The phonology and phonetics of English intona- tion. Bloomington, IN: Indiana University Linguistics Club. Pierrehumbert, J., & Hirschberg, J. (1990). The meaning of intonational contours in the interpretation of discourse. In P. Cohen, J. Morgan, & M. Pollack (Eds.), Intentions in Communication. Cambridge, MA: MIT Press. Price, P. J., Fisher, W. M., Bernstein, J., & Pallet, D. S. (1988). The DARPA 1000-word resource management database for continuous speech recognition. In Proceedings of the 1988 IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 651-654). Poplack, S. (1989). The care and handling of a mega-corpus: The Ottawa- Hull French Project. In R. W. Fasold & D. Schiffrin (Eds.), Language change and variation (pp. 411-444). Philadelphia: John Benjamins. Quirk, R. (1974). The linguist and the English language. London: Long- man. Quirk, R. (1992). On corpus principles and design. In J. Svartvik (Ed.), Directions in corpus linguistics (pp. 457-469). New York: Mouton de Gruyter. Quirk, R., Greenbaum, S., Leech, G., & Svartvik, J. (1972). A grammar of contemporary English. London: Longman. Quirk, R., Greenbaum, S., Leech, G., & Svartvik, J. (1985). A comprehen- sive grammar of the English language. London: Longman. Raben, J., & Gaunt, M. (forthcoming). Electronic scholars research guide. Renouf, A. J. (1984). Corpus development at Birmingham University. In J. Aarts & W. Meijs (Eds.), Corpus linguistics: Recent developments in the use of computer corpora in English language research. Amsterdam: Rodopi. Renouf, A. J. (1987). Corpus development. In J. M. Sinclair (Ed.), Look- ing up: An account of the Cobuild Project in lexical computing. Lon- don: Collins ELT. Rissanen, M. (1992). The diachronic corpus as a window to the history of English. In J. Svartvik (Ed.), Directions in corpus linguistics: Proceedings of the Nobel Symposium 82 (pp. 185-205). New York: Mouton de Gruyter. Sampson, G. (1992). Probabilistic parsing. In J. Svartvik (Ed.), Direc- tions in corpus linguistics: Proceedings of the Nobel Symposium 82 (pp. 425-447). New York: Mouton de Gruyter. Shastri, S. V. (1985). A computer corpus of present-day Indian English: A preliminary report. ICAME Journal, 9, 9-10. Shastri, S. V. (1988). The Kolhapur Corpus of Indian English and work done on its basis so far. ICAME Journal, 12, 15-26. Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., & Hirschberg, J. (1992, October). TOBI: A standard for labeling English prosody. Paper presented at the International Conference on Spoken Language Processing, Banff, Alberta, Canada. Sinclair, J. M. (1982). Reflections on computer corpora in English language research. In S. Johansson (Ed.), Computer corpora in English language research (pp. 1-6). Bergen: Norwegian Computing Centre for the Humanities. Sinclair, J. M. (Ed.). (1987). Looking up: An account of the COBUILD project in lexical computing. London: Collins ELT. Sinclair, J. M. (1992). The automatic analysis of corpora. In J. Svart- vik (Ed.), Directions in corpus linguistics: Proceedings of the Nobel Symposium 82 (pp. 379-397). New York: Mouton de Gruyter. Sinclair, J. M., & Kirby, D. M. (1990). Progress in English computational lexicography. World Englishes, 9, 21-36. Sperberg-McQueen, C. M., & Burnard, L. (Eds.). (1992). Guidelines for electronic text encoding and interchange (Document No. TEI P2, Chapter 34). Text Encoding Initiative listserver (listserv@uicvm.bitnet). Summers, D. (1991). Longman computerization initiatives, corpus building, semantic analysis and Prolog version of LDOCE by Cheng-ming Guo. Proceedings of the International Workshop on Electronic Dictionaries (Document No. EDR TR-031, pp. 141-152). Tokyo: Japan Electronic Dic- tionary Research Institute. Svartvik, J. (Ed.). (1990). The London-Lund Corpus of Spoken English: Description and research. Lund, Sweden: Lund University Press. Svartvik, J. (Ed.). (1992a). Corpus linguistics comes of age. Direc- tions in corpus linguistics: Proceedings of the Nobel Symposium 82 (pp. 7-13). New York: Mouton de Gruyter. Svartvik, J. (Ed.). (1992b). The London-Lund corpus of spoken English: Users manual. Lund, Sweden: Lund University, Department of English. [Distributed by the Norwegian Computing Centre for the Humanities, Bergen]. Svartvik, J., & Quirk, R. (Eds.). (1980). A corpus of spoken English. Lund, Sweden: Lund University Press. Taylor, L., Leech, G., & Fligelstone, S. (1989). Lancaster preliminary survey of machine-readable language corpora. Lancaster, England: University of Lancaster, Linguistics Department. [available from the Humanist and NCCH fileservers, see text] Taylor, L., Leech, G., & Fligelstone, S. (1991). A survey of English machine-readable corpora. In S. Johansson & A. B. Stenstro"vm (Eds.), English computer corpora: Selected papers and research guide (pp. 319-354). New York: Mouton de Gruyter. Teubert, W. (1984). Setting up a lexicographical data-base for German. In R. R. K. Hartmann (Ed.), LEXeter 83 Proceedings: Papers from the International Conference on Lexicography at Exeter (pp. 425-429). Tuebingen: Max Niemeyer. van Halteren, H., & Oostdijk, N. (1988). Using an analyzed corpus as a linguistic database. In J. Roper (Ed.), Computers in literary and linguistic computing: Proceedings of the XIIIth ALLC Conference (Norwich 1986). Geneva: Slatkine. van Halteren, H., & van den Heuvel, T. (1990). Linguistic exploitation of syntactic databases. Amsterdam: Rodopi. Walker, D. E. (1987). Knowledge resource tools for accessing large text files. In A. Cappelli, L. Cignoni, & C. Peters (Eds.), Studies in honour of Roberto Busa SJ (pp. 279-300). Pisa: Giardini. Walker, D. E. (1991). The ecology of language. Proceedings of the Inter- national Workshop on Electronic Dictionarie (Document No. EDR TR-031, pp. 10-22). Tokyo, Japan: Japan Electronic Dictionary Research Insti- titute. Walker, D. E. (1992). Developing computational lexical resources. In E. F. Kittay & A. Lehrer (Eds). Frames, fields, and contrasts: New essays in semantic and lexical organization. Hillsdale, NJ: Lawrence Erlbaum Associates. Walker, D. E., & Hockey, S. (1991). The Text Encoding Initiative. Bul- letin du CID. Paris: Centre des Hautes Etudes Internationales d'Informatique Documentaire. [307] APPENDIX HUMANITIES COMPUTING BIBLIOGRAPHY (APRIL 1990) SUSAN HOCKEY (HOCKEY@ZODIAC.BITNET) (REPRODUCED WITH PERMISSION) The following bibliography was distributed at a tutorial on Text Analysis Computing given by the CTI Centre for Literature and Linguistic Studies at the Conference on Computers and Teaching in the Humanities held in St. Andrews, Scotland in April 1990. The CTI Centre for Literature and Linguistic Studies is based at Oxford University. While some of these items date back over 10 years, they do cover all the basic techniques for text-based humanities computing, some of which are not so easy to find in more recent publications. All these items except the very latest, and of course many more, can be found in Ian Lancashire and Willard McCarty (Eds.), Humanities Computing Yearbook, Oxford University Press, 1989, which is an excellent starting point. The CTI promotes and supports the use of computers in teaching text-based subjects and is part of the Centre for Humanities Computing at Oxford, which supports several research projects in text analysis computing. The bibliography has been compiled over several years and is used in a course taught by Susan Hockey, the Director of the CTI Centre, at Oxford and in lectures and seminars given elsewhere by staff of the Centre. The CTI Centre has a mailing list, which can be contacted at CTITEXT@VAX.OX.AC.UK. Books-Monographs Butler, C. (1985). Computers in linguistics. New York: Blackwell. Hockey, S. (1980). A guide to computer applications in the humanities. London: Duckworth. Oakman, R. L. (1980). Computer methods for literary research (1st ed.). Columbia: University of South Carolina Press. Oakman, R. L. (1984). Computer methods for literary research (rev. ed.). Athens: University of Georgia Press. Rudall, B. H., & Corns, T. N. (1987). Computers and literature: A practi- cal guide. Cambridge, MA: Tunbridge Wells; Kent: Abacus Press. Books-Resources Guides Hughes, J. J. (1987). Bits, bytes and Biblical studies: A resource guide for the use of computers in Biblical and Classical studies. Grand Rapids, MI: Academie Books. Lancashire, I., & McCarty, W. (Eds.). (1988). Humanities computing year- book 1988. Oxford: Oxford University Press. Conference Proceedings Ager, D. E., Knowles, F. E., & Smith, J. M. (Eds.). (1978). Advances in computer-aided literary and linguistic research. Birmingham, England: University of Aston, Department of Modern Languages. (ALLC, 1978) Aitken, A. J., Bailey, R. W., & Hamilton-Smith, N. (Eds.). (1973). The computer and literary studies. Edinburgh: Edinburgh University Press. (Edinburgh conference, 1972) Allen, R. F. (Ed.). (1986). Data bases in the humanities and social sci- ences. Osprey, FL: Paradigm. Bailey, R. W. (Ed.). (1982). Computing in the humanities: Papers from the Fifth International Conference on Computing in the Humanities. Amsterdam: North Holland. Burton, S. K., & Short, D. D. (Eds.). (1983). Sixth International Conference on Computers and the Humanities. Rockville, MD: Computer Science Press. Cameron, K. C., Dodd, W. S., & Rahtz, S. P. Q. (Eds.). (1986). Comput- ers and modern language studies. Chichester, England: Ellis Horwood; New York: Halsted. Charpentier, C., & David, J. (Eds.). (1985). La recherche franc,aise par ordinateur en langue et litterature. Geneva: Slatkine. Choueka, Y. (Ed.). (1990). Computers in literary and linguistic research: Proceedings of the Fifteenth International ALLC Conference. Geneva: Slatkine. Cignoni, L., & Peters, C. (Eds.). (1983). Computers in literary and linguistic research: Proceedings of the Seventh International Sympo- sium of the Association for Literary and Linguistic Computing, Pisa 1982. Pisa: Giardini. Hamesse, J., & Zampolli, A. (Eds.). (1985). Computers in literary and linguistic computing: Proceedings of the Eleventh International ALLC Conference. Geneva: Slatkine. Jones, A., & Churchhouse, R. F. (Eds.). (1977). The computer in literary and linguistic studies: Proceedings of the Third International Sympo- sium. Cardiff: University of Wales Press. (ALLC, 1974) Lusignan, S., & North, J. S. (Eds.). (1977). Computing in the humani- ties: Proceedings of the Third International Conference on Computing in the Humanities. Waterloo, Ontario: University of Waterloo Press. Miall, D. S. (1990). Humanities and the computer: New directions. Oxford: Oxford University Press. (Conference on computers and teach- ing in the humanities, 1988) Mitchell, J. L. (Ed.). (1974). Computers in the humanities. Edinburgh: Edinburgh University Press. (ICCH, 1973) Patton, P. C., & Holoien, R. A. (Eds.). (1981). Computing in the humani- ties. Lexington, MA: Heath. Raben, J., & Marks, G. (Eds.). (1980). Databases in the humanities and social sciences. Amsterdam: North Holland. Rahtz, S. (Ed.). (1987). Information technology in the humanities: Tools, techniques and applications. Chichester: Ellis Horwood; New York: Halsted. Roper, J. P. G. (Ed.). (1988). Computers in literary and linguistic research: Proceedings of the Thirteenth International ALLC Confer- ence. Geneva: Slatkine. Wisbey, R. A. (Ed.). (1971). The computer in literary and linguistic research. Cambridge: Cambridge University Press. (Cambridge confer- ence, 1970) Periodicals Bulletin of the Association for Literary and Linguistic Computing ("ALLC Bulletin") (1973-1985). Three issues per year. Computational Linguistics, formerly American Journal of Computational Linguistics. Now in volume 16 (1990). Quarterly published by ACL. Computers and the Humanities (1966- ). Has had several publishers. Now published by Kluwer. Four issues per year (six from 1989). Covers language, literature, history, archaeology, music, and education. Sponsored by ACH. ICAME Journal, formerly ICAME News, International Computer Archive of Modern English, Norwegian Computing Centre for the Humanities, PO Box 53, Bergen, Norway. Journal of the Association for Literary and Linguistic Computing ("ALLC Journal") (1980-1985). Was also published by the ALLC. Two issues per year. Linguistica Computazionale, Giardini, Pisa. Literary and Linguistic Computing (1986- ). In 1986, the ALLC publications were merged into a single journal, Literary and Linguistic Computing, published by Oxford University Press. It covers all aspects of com- puter usage in literary and linguistic research. Revue: Informatique et Statistique dans les Sciences Humaines. Newsletters Bits and Bytes Review (1986- ). Bits and Bytes Computer Resources, 623 North Iowa Avenue, Whitefish, MT 59937, USA. Reviews of software, hardware, and new publications. Computers in Literature (1990). Newsletter of the CTI Centre for Litera- ture and Linguistic Studies, OUCS, 13 Banbury Road, Oxford, UK. There are also a number of newsletters for specific subjects, some of which, for example, CALCULI (Classics) and CAMDAP (Medieval Studies), are now defunct but contain useful information. The Humanities Com- puting Newsletter, Office for Humanities Communication, Bath, UK, and Ontario Humanities Computing, obtainable from CCH, Toronto are two of the best general ones. English for Language Research-Corpus Linguistics Garside, R., Leech, G., & Sampson, G. (Eds.). (1987). The computational analysis of English: A corpus-based approach. New York: Longman. Sinclair, J. M. (Ed.). (1987). Looking up: An account of the COBUILD project in lexical computing. London: Collins. Stylistic Analysis Burrows, J. F. (1987). Computation into criticism: A study of Jane Austens novels and an experiment in method. Oxford: Oxford University Press. Dolezel, L., & Bailey, R. W. (1969). Statistics and style. New York: Elsevier. Ellegard, A. (1962). Who was Junius? Stockholm: Almqvist and Wiksell. Kenny, A. (1978). The Aristotelian ethics. Oxford: Clarendon. Kenny, A. (1982). The computation of style: An introduction to statistics for students of literature and humanities. New York: Pergamon. Morton, A. Q. (1978). Literary detection-How to prove authorship and fraud in literature and documents. Epping, England: Bowker; New York: Scribner. Morton, A. Q., & Winspear, A. D. (1971). Its Greek to the computer. Montreal: Harvest House. Mosteller, F., & Wallace, D. L. (1964). Inference and disputed author- ship: The Federalist. Reading, MA: Addison Wesley. Muller, C. (1973). Initiation aux mithodes de la statistique linguistique. Paris: Hachette. -------------------------- end of text ------------------------------