From corpora-request@uib.no Mon Nov 16 03:04:29 1992 id <06759-0@alf.uib.no>; Mon, 16 Nov 1992 02:02:53 +0100 Date: Mon, 16 Nov 1992 02:04:29 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: German Text ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** I am sorry about the delay of this message. -Knut Send-date: Fri, 6 Nov 1992 13:09:45 UTC-0500 From: (Jeffery D Martin) Subject: German Text I am looking for machine readable german corpora to test the coverage of a german paring system. Information on any kind of on line German text would be greatly appreciated; I am especially interested in text which contains morphological and syntactic errors, since my system incorporates an error diagnosis component. I heard something about a "Miami corpus" but have not been able to find any references on this. __ Jeffery D. Martin | jeffmar@umiacs.umd.edu Linguistics | work: (301) 405 7040 University of Maryland | home: (301) 779 5981 College Park, MD 20740 | "Processes common to all living things are nutrition, digestion, exhaustion, and discretion." -- anonymous student From corpora-request@uib.no Mon Nov 16 03:07:35 1992 id <06880-0@alf.uib.no>; Mon, 16 Nov 1992 02:06:00 +0100 Date: Mon, 16 Nov 1992 02:07:35 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: RE: Enqiry ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** I am sorry about the delay of this message. -Knut Send-date: Sat, 7 Nov 1992 11:35:31 UTC From: Subject: RE: Enqiry Dear colleagues, There is some material on language policies in Africa in my "English in Africa.An Introduction" Longman Linguistics Library 1991; it includes economic points in a flow diagram of decision making. If you have specific questions, I can also establish contacts with African colleagues. Best wishes, Josef From corpora-request@uib.no Thu Nov 19 13:29:54 1992 id <01853-0@alf.uib.no>; Thu, 19 Nov 1992 12:28:17 +0100 Date: Thu, 19 Nov 1992 12:29:54 +0100 From: knut@nora.hd.uib.no (Knut Hofland) To: corpora@nora.hd.uib.no Subject: Oxford Text Archive Cc: knut@nora.hd.uib.no ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** As a result of the query about German texts, I got the snapshot list of texts from Oxford. This list is 3555 lines, so I will not redistribute it. But the list (and other information about the Oxford archive) is available on our file servers in the directory INFO. I have split the list file in subfiles for each language, see the enclosed extract of the directory. If you want information about German texts and the order form, you send the following message to FILESERV@NORA.HD.UIB.NO !!! ------------------------------------------- To: fileserv@nora.hd.uib.no Subject: bla bla bla ... send info ota.German send info oxford.textarchive.form ------------------------------------------- To get more information about our file servers, send (another) message with the line send icame file.servers to FILESERV@NORA.HD.UIB.NO The files are also available with FTP or Gopher from nora.hd.uib.no Knut Hofland Norwegian Computing Centre for the Humanities, Harald Haarfagres gt. 31, N-5007 Bergen, Norway Phone +47 5 212954/5/6 Fax: +47 5 322656 E-mail: knut@x400.hd.uib.no ================================================================ Extract of contents of the INFO directory: 496 Nov 17 14:49 ota.Arabic 220 Nov 17 14:49 ota.Danish 378 Nov 17 14:49 ota.Dutch 120382 Nov 17 14:49 ota.English 6879 Nov 17 14:49 ota.French 634 Nov 17 14:49 ota.Fufulde 2285 Nov 17 14:49 ota.Gaelic 4376 Nov 17 14:49 ota.German 14956 Nov 17 14:49 ota.Greek 922 Nov 17 14:49 ota.Hebrew 424 Nov 17 14:49 ota.Icelandic 2833 Nov 17 14:49 ota.Italian 246 Nov 17 14:49 ota.Japanese 321 Nov 17 14:49 ota.Kurdish 9452 Nov 17 14:49 ota.Latin 129 Nov 17 14:49 ota.Latvian 449 Nov 17 14:49 ota.Malayan 974 Nov 17 14:49 ota.Miscellaneous 1210 Nov 17 14:49 ota.Non-linguistic 289 Nov 17 14:49 ota.Pali 111 Nov 17 14:49 ota.Portuguese 378 Nov 17 14:49 ota.Provençla;al 111 Nov 17 14:49 ota.Russian 986 Nov 17 14:49 ota.Sanskrit 968 Nov 17 14:49 ota.Serbo-Croat 1142 Nov 17 14:49 ota.Spanish 109 Nov 17 14:49 ota.Swedish 618 Nov 17 14:49 ota.Turkish 1179 Nov 17 14:49 ota.Welsh 1612 Nov 17 14:49 ota.snapshot.intro 7060 Feb 11 1992 oxford.textarchive.form 3408 Nov 17 13:34 oxford.textarchive.ftp 6152 Nov 17 13:33 oxford.textarchive.info 175069 Nov 17 13:25 oxford.textarchive.list From corpora-request@uib.no Thu Nov 19 13:34:32 1992 id <02311-0@alf.uib.no>; Thu, 19 Nov 1992 12:33:01 +0100 Date: Thu, 19 Nov 1992 12:34:32 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: parsing techniques ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Mon, 16 Nov 1992 10:23:00 UTC From: lcjohn Subject: parsing techniques I'm sorry that I can't help by providing German texts with errors, but I was particulary interested in the parsing techniques mentioned by Jeffery Martin. Any chance he could give me/us more details? Thanks, John Milton Language Centre, Hong Kong University of Science and Technology From corpora-request@uib.no Thu Nov 19 13:35:54 1992 id <02406-0@alf.uib.no>; Thu, 19 Nov 1992 12:34:18 +0100 Date: Thu, 19 Nov 1992 12:35:54 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Longman's Dictionary ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Wed, 18 Nov 1992 11:06:10 UTC-0500 From: Subject: Longman's Dictionary I understand that there is an electronic, public domain version of Longman's Dictionary around. Does anyone know how to obtain a copy and any other details? Thanks! Elizabeth Adams adams@merlin.hood.edu Math & Computer Science 301-696-3733 Hood College, Frederick MD 21701 From corpora-request@uib.no Thu Nov 19 13:36:54 1992 id <02536-0@alf.uib.no>; Thu, 19 Nov 1992 12:35:17 +0100 Date: Thu, 19 Nov 1992 12:36:54 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: request for tagger output ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Thu, 19 Nov 1992 10:48:25 UTC+0200 From: ko (Kemal Oflazer) Subject: request for tagger output Hello All, I am in the process of developing a tagger for Turkish text based on a two-level morphological analyzer that we have developed here. I would greatly appreciate if someone could send me the tagged output of a corpus tagger for a short English text, so that we can have some information as to what types of information one should try to produce with a tagger. Thanks in advance Kemal Oflazer Bilkent University Computer Engineering Department Bilkent, ANKARA, 06533 TURKIYE e-mail: ko@trbilun.bitnet fax: (90) 4 - 266-4127 tel: (90) 4 - 266-4133 From corpora-request@uib.no Fri Nov 20 02:04:24 1992 id <14735-0@alf.uib.no>; Fri, 20 Nov 1992 01:02:46 +0100 Date: Fri, 20 Nov 1992 01:04:24 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Czech corpora ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Thu, 19 Nov 1992 14:43:04 UTC+0100 From: Eric Akkerman Subject: Czech corpora Doe anyone know if there are any Czech corpora available? A student of mine would like to experiment with that language as part of her final assignment for a course in computer-assisted text analysis. Eric Akkerman Free University of Amsterdam eric@let.vu.nl From corpora-request@uib.no Fri Nov 20 02:04:37 1992 id <14739-0@alf.uib.no>; Fri, 20 Nov 1992 01:02:59 +0100 Date: Fri, 20 Nov 1992 01:04:37 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: Longman's Dictionary ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** 1) -------------------------------------------------- From: Adam Kilgarriff Subject: Re: Longman's Dictionary 2) -------------------------------------------------- From: ted Subject: Longman's Dictionary 1) ================================================== Send-date: Thu, 19 Nov 1992 16:03:30 UTC+0100 From: Adam Kilgarriff Subject: Re: Longman's Dictionary > I understand that there is an electronic, public domain version of > Longman's Dictionary around. Does anyone know how to obtain a copy and > any other details? Thanks! > Elizabeth Adams adams@merlin.hood.edu Longmans have been making LDOCE (Longman Dictionary of Contemporary English, First Edition, 1978) available to academic researchers for quite a while now. It is however definitely not public domain. The lexical information is a valuable asset, and the terms on which it is made available are that it may only be used for research. The type of research under consideration must be described as part of a contract permitting the researcher to use it, with reports and papers based on it sent to Longman. The book `Computational Lexicography and Natural Language Processing' edited by Boguraev and Briscoe (Longman 1989) describes some of the work done using LDOCE. If you are interested in obtaining a copy, please contact: Della Summers Director Longman Dictionaries Longman House Burnt Mill Harlow Essex England Thanks, Adam Kilgarriff Computational Linguist Longman Dictionaries (e-mail to change shortly) 2) ================================================== Send-date: Thu, 19 Nov 1992 12:10:40 UTC-0700 From: ted Subject: Longman's Dictionary Date: Thu, 19 Nov 1992 12:35:54 +0100 From: corplst%nora.hd.uib.no (CORPORA list) Send-date: Wed, 18 Nov 1992 11:06:10 UTC-0500 From: Subject: Longman's Dictionary I understand that there is an electronic, public domain version of Longman's Dictionary around. Does anyone know how to obtain a copy and any other details? Thanks! there is *NO* public domain version of any of longman's dictionaries around. in fact, if somebody is passing copies around without permission from longman's then they are doing a massive disservice to the research community because the dictionary publishers are enormously touchy about the possibility that they will lose control of their materials. having this come true in the slightest will cause an enormous twitch that none of us would like to see. if we all can avoid this sort of situation then i think that the publishers will soon loosen up considerably. but until then, please be very careful. From corpora-request@uib.no Fri Nov 20 16:07:36 1992 id <01708-0@alf.uib.no>; Fri, 20 Nov 1992 15:05:58 +0100 Date: Fri, 20 Nov 1992 15:07:36 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: OCS Codex Marianus ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Fri, 20 Nov 1992 12:28:27 UTC+0200 From: (Jouko Lindstedt) Subject: OCS Codex Marianus The Dept. of Slavonic Languages at the University of Helsinki has prepared a computer version of the Old Church Slavonic Codex Marianus (see the description below). We are ready to e-mail it to any OCS scholar interested in it. Four our record, please tell what use you are probably going to make of it. We would of course be grateful if you report of any errors you notice in the text, and even more grateful if you can send us e-texts in Slavonic, Baltic or classical languages in exchange. An e-text of the Codex Assemanianus is in preparation. Jouko Lindstedt Institutum Slavicum, Universitas Helsingiensis ---------------------------------------------------------------------- Department of Slavonic Languages, University of Helsinki or letters: Hallituskatu 11, 00100 Helsinki, Finland (From Jan 1, 1993: P.O.Box 4, 00014 University of Helsinki, Finland) fax: +358-0-1912974 ---------------------------------------------------------------------- AN ELECTRONIC TEXT OF THE CODEX MARIANUS The e-text version of the Codex Marianus consists of four files, one for each Gospel. The file sizes under Unix are as follows: 105622 marmt.txt 76309 marmc.txt 131799 marlc.txt 92189 marjo.txt The e-text should be considered as a tertiary source as it is not based on the manuscript itself, but on Vatroslav Jagic's edition thereof. The files are not meant to be completely self-explanatory: they must be used with the edition. The lines in the files do not correspond to the manuscript lines, being arranged according to Gospel chapters and verses. Each line begins with an seven-digit number which is to be interpreted as follows: - first digit: Gospel (1=Matthew, 2=Mark, 3=Luke, 4=John) - the following two digits: Gospel chapter - the following two digits: verse - the last but one digit: line number inside the verse (in the file, not in the codex: 0,1,2,...) - the last digit: always 0 (reserved for special uses) In the beginning of each Gospel (except for Matthew, the beginning of which is missing in the codex) there is a "00-chapter, 00-verse" section into which the pericope lists found in the codex are placed. In these passages the line division does correspond to that of the codex. The transliteration used in the text only makes use of the 7-bit ASCII code so as to ensure maximal portability. Upper-case letters are used to represent different graphemes than the corresponding lower-case letters. A transliteration table will be mailed with the files. From corpora-request@uib.no Tue Nov 24 16:07:19 1992 id <23975-0@alf.uib.no>; Tue, 24 Nov 1992 15:05:39 +0100 Date: Tue, 24 Nov 1992 15:07:19 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Hard/Software for corpus analyses ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Mon, 23 Nov 1992 12:33:11 UTC+0100 From: M.M.B.Corley Subject: Hard/Software for corpus analyses Our new grant, sponsored by the (British) ESRC, will involve us, among other things, in counts of syntactic structures in various corpora. Before we can begin on this work, we need appropriate hardware and software to access the materials available. We are complete novices to the field, and would greatly appreciate any advice on suitable hardware etc. from those of you in the know! -- Don Mitchell D.C.Mitchell @ cen.ex.ac.uk Martin Corley M.M.B.Corley @ cen.ex.ac.uk Dept of Psychology University of Exeter 0392 264626 Exeter EX4 4QG direct: 0392 264622 From corpora-request@uib.no Fri Nov 27 22:44:46 1992 id <15342-0@alf.uib.no>; Fri, 27 Nov 1992 21:43:05 +0100 Date: Fri, 27 Nov 1992 21:44:46 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: data format of annotated corpora ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Fri, 27 Nov 1992 17:40:12 UTC+0100 From: (Helmut Feldweg) Subject: data format of annotated corpora I'ld like to know if there are any strong feelings out there in corpora-land concerning the format of annotated corpora. We are about to prepare an annotated corpus of German and are facing the quesiton whether to do the annotation in a SGML-based TEI-conformant format or on the basis of a vertikalized text with one wordform plus annotations per line. Is TEI merely a format for information interchange or is it feasible to do basic analysis (frequencies, concordances, distributional analysis etc.) with this format? As we do not want to spend time for developping yet another freq- and kwic-program, we'ld like to know if there are any programs available to do this kind of analysis on the basis of SGML-tagged texts? Are there tools which allow me to deal with an SGML-tagged text as easily as I can manipulate a vertikalized text with grep, awk, icon and similar tools? Has anybody *worked* with annotated texts in TEI-format? It is our understanding that the final product will be an TEI-conformant SGML-tagged text. The question is whether this is the right format to use during data preparation. -- Helmut Feldweg Seminar f"ur Sprachwissenschaft, Universit"at T"ubingen Wilhelmstr. 113, D-7400 T"ubingen 1, Germany email: feldweg@mailserv.zdv.uni-tuebingen.de feldweg@bach.sns.neuphilologie.uni-tuebingen.de phone: +49 (0)7071 29-4279 From corpora-request@uib.no Mon Nov 30 11:37:16 1992 id <29604-0@alf.uib.no>; Mon, 30 Nov 1992 10:35:33 +0100 Date: Mon, 30 Nov 1992 10:37:16 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: data format of annotated corpora ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Mon, 30 Nov 1992 9:34:25 UTC+0100 From: (Torbjoern Lager) Subject: Re: data format of annotated corpora > SGML-tagged text. The question is whether this is the > right format to use during data preparation. I'd like to know as well. Could you please share answers you get, if any? Regards, Torbjoern Lager ---------------------------------**-------------------------------------*------ Torbjoern Lager E-mail: lager@ling.gu.se Department of Linguistics Phone: +46 31 7731175 University of Gothenburg Fax: +46 31 7734853 Renstroemsparken 412 98 Gothenburg Sweden **-*-----*-*------------------*------------------------------------------------ From corpora-request@uib.no Tue Dec 1 02:04:58 1992 id <18045-0@alf.uib.no>; Tue, 1 Dec 1992 01:03:14 +0100 Date: Tue, 1 Dec 1992 01:04:58 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Spanish corpora (2 msgs + note) ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Mon, 30 Nov 1992 11:45:13 UTC-0500 From: (Doug Mckee x7820) Subject: Spanish corpora I'm looking for online Spanish corpora, preferably newspaper or magazine articles. I've heard there is a collection at the University of Miami, but I haven't been able to find it. Can anyone help he out? BTW, I already know what is available in the Oxford Text Archive. ---------------------------------------------------------------- Doug McKee E-mail: mckeed@sra.com SRA Corp. Phone: (703) 558-7820 2000 15th St. N Fax: (703) 558-4723 Arlington, VA 22201 USA ---------------------------------------------------------------- Send-date: Mon, 30 Nov 1992 11:56:37 UTC-0800 From: (Jane Edwards) Subject: Spanish corpora? I am posting this on behalf of Sheryl. I will forward posted responses to her, but would like to know the answer to this myself, as well. Many thanks, -Jane Edwards (edwards@cogsci.berkeley.edu) > > Date: Sun, 29 Nov 92 11:43:09 MST > From: > Subject: Spanish corpora > > Does anyone know of any databases of Spanish corpora and how I could access > them? > > Thanks! > > Sheryl (scoleman@vm.ucs.ualberta.ca) [ from list adm. I would like to mention the Catalogue of Projects in Electronic Text (CPET) at Georgetown University, Washington DC. This catalogue can be accessed via Telnet to: guvax3.georgetown.edu with username: CPET (you will need VT-100 keys). A manual can be fetched from our fileserver (FILESERV@NORA.HD.UIB.NO) by sending send info cpet.manual either as the subject or the only line in the message. A list of roman language projects (of feb. 1991, 64 KB) can be fetched from the file server with the line: send info roman.projects For further information about CPET, contact Margaret Friedman (mfriedman@guvax.georgetown.edu) - Knut Hofland ] From corpora-request@uib.no Tue Dec 1 02:14:46 1992 id <18288-0@alf.uib.no>; Tue, 1 Dec 1992 01:13:02 +0100 Date: Tue, 1 Dec 1992 01:14:46 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: New files at server ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Mon, 30 Nov 1992 12:19:02 UTC+0100 From: (Lothar Lemnitzer) Organization: Westfaelische Wilhelms-Universitaet, Muenster, Germany Department of Mathematics I uploaded four files giving a profile of our institute, in particular a list of corpus resources at hand. The names are ms_index: list of files ms_database: description of our lexical database ms_korpora: description of our corpora ms_tools: corpus-related UNIX tools Regards lothar@hendrix.uni-muenster.de [ These files have been placed in the corpora directory and can be fetched by sending request to FILESERV@NORA.HD.UIB.NO like: send corpora ms_index send corpora ms_korpora ..... (can also be fetched via FTP or Gopher to nora.hd.uib.no) - Knut Hofland ] From corpora-request@uib.no Wed Dec 2 01:38:14 1992 id <25364-0@alf.uib.no>; Wed, 2 Dec 1992 00:36:30 +0100 Date: Wed, 2 Dec 1992 00:38:14 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Linguistic Research - Call For Partners ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Tue, 1 Dec 1992 14:43:12 UTC+0100 From: (Miriam Mulders) Subject: Linguistic Research - Call For Partners ====================================================================== ======================CALL FOR PARTNERS=============================== ================LINGUISTIC RESEARCH AND ENGINEERING=================== ====================================================================== The Institute for Language Tehnology and Artificial Intelligence (ITK), Tilburg University, The Netherlands, intends to submit a proposal for the Linguistic Research and Engineering programme of the European Communities. We are looking for partners to form a consortium that will formulate the definite proposal and carry out the research. Partners should preferably satisfy the following profile: * be located in United Kingdom, Denmark, or Germany; the consortium will consist of one participant per language domain (English, Dannish German, Dutch); * have access to and experience with tools for natural language processing; * have experience with automatic processing and annotation of natural language corpora. The proposal to be submitted will be a internationalization of a Dutch project that is currently being executed by ITK, the Dutch software firm Syllogic, and the Dutch institute for development of educational courses for all kinds of professions (SLO). The goal of this small-scale project is to use NLP techniques to translate NL sentences into data structures in order to support storage, retrieval and comparison of professional skills and educational objectives. The extension of this project to other language domains of the European Communities will stimulate the development and use of standard encodings for professional skills and educational objectives and will make comparison possible between educational programmes of different countries. The following tasks will be part of the EC project: * definition of an international standard encoding format (eg. TEI) for the specific corpora used in the educational field; * automatic analysis and coding of large corpora of sentences. These tasks will be carried out for each language domain by a research institute in close collaboration with an institute for educational programme development. We are looking forward to hear from any institute that is willing to cooperate in this kind of research embedded in an international project. Institutes interested are kindly requested to react within a few days; signs of interest will, of course, not be considered as binding. ====Miriam Mulders================================================= ====Inst. for Language Technology and Artificial Intelligence====== ====Tilburg University, The Netherlands============================ ====E-mail: miriam@kub.nl========================================== ====Phone: +3113 662692============================================ From corpora-request@uib.no Wed Dec 2 01:38:44 1992 id <25385-0@alf.uib.no>; Wed, 2 Dec 1992 00:37:00 +0100 Date: Wed, 2 Dec 1992 00:38:44 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Survey of Modern Greek corpora ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Tue, 1 Dec 1992 14:54:06 UTC+0100 From: GOUTSOSD Subject: Survey of Modern Greek corpora Postal address: School of English University of Birmingham Birmingham B15 2TT UK fax: (int +) 44 21 414 3600 e-mail: goutsosd@uk.ac.birmingham 27 November 1992 Dear Colleague, We have recently become aware of the lack of communication between researchers on Modern Greek and the need for exchange of information, and so we are taking the initiative to distribute this survey of machine-readable corpora of Modern Greek. Its aim is to collect information about the nature and structure of collections of text in machine-readable form and the specifications of hardware and software tools. This information will be available to interested researchers and is intended to provide a basis for discussion and exchange of information on the future of Modern Greek corpora. By corpus, we mean broadly a text collection, comprising texts to be studied individually, not linked in any coordinated way, collected works of an author, texts selected to study a particular author, textbanks, databases or bibliographies. If you are not personally involved in the compilation of such a machine- readable corpus, could you pass the survey to others or suggest their names to us. We would hope to complete the results of the survey by March 1993; depending on the extent of the response we may come back to you for more detail. We would like to thank you in advance for your help and we'd be happy to hear any suggestions from you. Dionysis Goutsos Rania Hatzidaki Philip King Modern Greek Corpus Initiative Survey of machine-readable corpora of Modern Greek A. CORPUS PROFILE A1. By what name is the corpus known? A2. Who compiled the corpus? A3. Where was it compiled? (Institution) A4. Contact Address Telephone Fax E-mail A5. When did the compilation start? A6. What was the incentive for starting the compilation? B. COMPUTER FACILITIES AND SOFTWARE B1. How are texts entered? (word-processor, text-editor, typesetting tapes, optical scanning, other) B2. How is the corpus stored and in what format? B2.1.What computer facilities do you use? (IBM Personal Computer or compatible, Apple Macintosh - workstation - mainframe) B2.2. What software do you use for corpus processing? (please specify item and function: word frequency, concordancing of selected items etc.) B2.3. Do you use ready-made or customized software? B2.4. If you use your own software, which programming language do you use? B3. How do you handle the special problem of Greek characters? - in input processing - in screen output - in printing B4. Do you have software for linguistic annotation (tagging, parsing, lemmatization)? If yes, specify C. TEXT DETAILS C1. How was the text acquired? C2. How is the corpus organized? C3. Can you give some details of the content? C3.1. Written texts: C3.1.1. What genres are included in your collection? C3.1.2. What are the media of the original texts? (printed book, periodical, manuscript, ephemera, other) C3.1.3. Do you encode typographic and layout information? If so, specify C3.2. Spoken texts (transcriptions): C3.2.1. What genres are included in your collection? C3.2.2. What is the medium of the original source? (TV, radio, telephone, direct: talk, conversation, other) C3.2.3. Is the material spontaneous or not, surreptitious or not? C3.2.4. Do you encode information about speakers (e.g. age, sex) or about the recording? C3.2.5. What transcription system do you use? (phonetic, phonological, enhanced orthographical, orthographical) C4. What period do the texts in the corpus represent? from _____________ to ____________ C5. What is the total amount of data stored in your collection? - in bytes - in words - in minutes of spoken text recording C6. What use is made of the corpus? (specify, where appropriate) - to build up a multifunctional linguistic corpus - for lexicographic purposes - for literary research - for stylistic research - for preparation of a scholarly edition - for research in linguistics - for research in language learning/ teaching - for commercial applications - for natural language processing applications - other C7. Is it available to other interested parties? If so, under what conditions? D. VIEWS AND PERSPECTIVES: D1. Do you plan any changes in the composition of your corpus? D2. Are you planning to develop new text-handling software? D3. Are there any specialized areas of Modern Greek for which a corpus approach would be particularly useful? D4.1. What are your views on the development of a general corpus of Modern Greek (such as the Brown Corpus of English or the Birmingham English Corpus)? D4.2. What would you consider to be the optimal size of it? D5. Do you prefer a 'clean text' strategy (i.e. plain orthographic files) as opposed to annotated, phonologically coded, parsed etc. text? D6. Do you think that multilingual corpora or corpora containing 'parallel texts' are needed? D7. Do you have any other views on the development of Modern Greek corpora and software for processing them? E. PUBLICATIONS: Please list any publications that you are aware of that were based on the electronic text you describe From corpora-request@uib.no Tue Dec 1 08:13:36 1992 id <26113-0@alf.uib.no>; Wed, 2 Dec 1992 01:14:04 +0100 Date: Tue, 1 Dec 92 16:13:36 -0800 From: sundheim@cod.nosc.mil (Beth M. Sundheim) To: corpora@nora.hd.uib.no Subject: 5th Message Understanding Conference--Call for Participation ------- * * * CALL FOR PARTICIPATION * * * FIFTH MESSAGE UNDERSTANDING SYSTEM EVALUATION AND MESSAGE UNDERSTANDING CONFERENCE (MUC-5) 1 MARCH - 27 AUGUST, 1993 Preparation: 1 March - 23 May 29 May - 25 July Evaluations: 24-28 May (dry run) 26-30 July (formal run) Conference: 25-27 August Sponsored by: Defense Advanced Research Projects Agency Software and Intelligent Systems Technology Office (DARPA/SISTO) The Message Understanding Conferences have provided on ongoing forum for assessing the state of the art and practice in text analysis technology and for exchanging information on innovative computational techniques. They have also encouraged experimentation in the context of fully implemented systems that perform the realistic task of extracting factual information from free text. The first two conferences focused on short naval messages; the two most recent conferences challenged the systems with longer and stylistically varied terrorism news stories. The four conferences have seen the application of a wide variety of approaches to the information extraction task. There is a growing appreciation of the potential utility of the technologies. At the same time, performance constraints attributed to inadequate computational methods are becoming serious issues for the more highly developed systems. The Fifth Message Understanding Conference (MUC-5) will continue the technology assessment cycle, with new information extraction tasks in new domains. MUC-5 will also continue the effort to define an insightful, objective set of performance evaluation criteria. DARPA sponsors the Message Understanding Conferences as part of the TIPSTER Text program. Participation in MUC-5 is actively sought from both new and veteran organizations. Veteran evaluation participants will be able to measure their progress in designing robust, end-to-end information extraction systems and to continue the fruitful interchange of ideas about systems and evaluation. New participants will also contribute to and benefit from such interactions, while learning to manage the challenges posed by the evaluation task. In this process, all organizations enjoy some advantages and suffer from some disadvantages in the evaluation. These differing circumstances are recognized by the evaluators and should not deter organizations from participating. The conference itself will consist primarily of presentations and discussions of test results, system design, and innovative techniques. Attendance at the conference is limited to evaluation participants and to guests invited by DARPA. A conference proceedings, including all test results, will be published. Modest amounts of financial support will be made available to selected participants in an effort to maximize the number of participants and to attract the widest possible variety of technical approaches and system architectures. This funding is intended only as a supplement to other support. Both U.S. and non-U.S. participants are eligible for this funding. SCHEDULE: 3 January 1993 Deadline for applications that include funding requests 15 January 1993 Final application deadline (no funding requests) 1 February 1993 Notification of acceptance and funding 1 March 1993 Release of system development corpus and evaluation software 24-28 May 1993 Performance evaluation (dry run) on test corpus 26-30 July 1993 Performance evaluation (formal run) on new test corpus 25-27 August 1993 Fifth Message Understanding Conference DATA AND TASK DESCRIPTION: Subject to successful completion of negotiations to obtain proper permissions concerning the data, the data and task to be used for MUC-5 will be the same as those already in use for the data extraction portion of the DARPA/SISTO TIPSTER Text program. There are two languages, English and Japanese, and two domains, joint ventures and microelectronic chip fabrication. These form four separate corpora. The texts are newswire articles selected to produce the desired mix of relevant and nonrelevant texts, and they were blindly divided into pools of development (training) and test data. The task is to extract information about the nature and status of activities in the domain, the entities involved, etc. Analysts have been doing software-assisted manual generation of the "key" templates against which the system-generated templates will be evaluated. The template design is object oriented, and each slot in the template has its own fill specifications for data type, valency, etc. The fill specifications in each domain vary slightly between English and Japanese, reflecting differences in language usage; however, the general design of the template is the same for both languages. An English and a Japanese sample text and corresponding template in the joint ventures domain are available from the program chair (address at end of this announcement). Please specify which language(s) you are interested in. A microelectronics example may be available shortly. The total amount of data that will be available in March to support system development is expected to be between 200 and 1,000 templates and corresponding texts. This number will vary according to the corpus and the data rights that are obtained. To receive the data, participants will be required to acknowledge its copyright status by signing agreements to safeguard the data and to use it for research purposes only. TEST PROTOCOL AND EVALUATION CRITERIA: MUC-5 participants may elect to do either language or both languages; they are limited to selecting just one domain. Participants will have access to TIPSTER Government-Furnished Information and shared resources such as the training texts and templates, task documentation, gazetteers, and evaluation software. TIPSTER data extraction contractors will be participating in MUC-5, for which previously unseen test data will be used. Each test set will consist of 100-300 texts, depending on language and domain. A dry-run test will be conducted about three months after the release of the training data; the formal test will be conducted about two and one-half months after the dry run. Each test will be carried out by the participants at their own sites in accordance with a prepared test procedure and the results submitted to NRaD for official scoring by domain analysts. Systems will be evaluated using the criteria applied to the TIPSTER Text data extraction systems. These criteria, which are still under development, are likely to use the scoring categories (correct, partially correct, incorrect, spurious, missing, and noncommittal) to support not only the measures used for MUC-4 (recall, precision, overgeneration, fallout, and F-measure) but also new measures (probability of detection, probability of false alarm, and a measure that combines them). MUC-5 participants will be able to familiarize themselves with the evaluation criteria through usage of the evaluation software, which will be released along with the training data. INSTRUCTIONS FOR RESPONDING TO THE CALL FOR PARTICIPATION: Organizations within and outside the U.S. are invited to respond to this call for participation. Minimal requirements include development before the dry-run test of a system that can accept texts without manual preprocessing, process them without human intervention, and output templates in the expected format. Organizations should plan on allocating at least three person-months of effort for participation in the evaluation and conference; a substantially greater level of effort is likely to be needed in order to achieve relatively high performance. It is understood that organizations will vary with respect to experience with information extraction, domain expertise/engineering, resources, contractual demands/expectations, etc. Recognition of such factors will be made in any analyses of the results. Organizations wishing to participate in the evaluation and conference must respond by submitting a summary of their text analysis approach and a system architecture description, not to exceed five pages in total. The summary should include the strengths of the approach and highlight its innovative aspects. Acceptance or rejection of each application will be determined on the basis of a technical assessment by the program committee. The body of the application will serve as the basis for an article in the conference proceedings. Participants will have the opportunity to make revisions prior to publication. The application must also include the following information: 1. Domain (choose only one) a. Joint ventures b. Microelectronics 2. Language (choose one or two) a. English b. Japanese 3. An estimate of the degree of coverage and/or length of time under development of existing software to be applied to the MUC-5 task in the selected language(s) and domain. 4. Primary point of contact for notification of acceptance/rejection of application. Please include name, surface and email addresses, and phone and fax numbers. Those organizations wishing to request funding to supplement their own resources must provide a second statement, not to exceed two pages. This statement should include an estimate of the amount of funding available from other sources to support participation in this work and a specification of the amount of funding desired and the minimal acceptable amount. In addition, it should describe any software to be used for MUC-5 that the organization is willing to deliver to NRaD and MUC participants for possible redistribution. Please indicate clearly whether the organization is interested in participating in MUC-5 even if no funding is available. Evaluators of funding requests will not include any MUC system developers. RESPONSES THAT INCLUDE FUNDING REQUESTS MUST BE SUBMITTED BY JANUARY 3, 1993. THE DEADLINE FOR OTHER RESPONSES IS JANUARY 15, 1993. All participants are expected to have Internet access and to be able to do electronic file transfer via anonymous FTP. All responses should be submitted to the program chair via email to sundheim@nosc.mil. If Internet access is currently unavailable, responses may be sent via surface mail to Beth Sundheim, NCCOSC/NRaD, Code 444, San Diego, CA 92152-5000, and if a quick reply to questions is needed, the program chair may be reached by phone at 619/553-4145. PROGRAM COMMITTEE: Beth Sundheim, NCCOSC/NRaD, program chair Sean Boisen, BBN Systems and Technologies Lynn Carlson, U.S. Department of Defense Nancy Chinchor, Science Applications International Jim Cowie, New Mexico State University Ralph Grishman, New York University Jerry Hobbs, SRI International Joe McCarthy, University of Massachusetts, Amherst Mary Ellen Okurowski, U.S. Department of Defense Boyan Onyshkevych, U.S. Department of Defense Lisa Rau, General Electric R&D Center Carl Weir, Paramax Systems Corporation REFERENCE: _Proceedings_of_the_Fourth_Message_Understanding_Conference_ (MUC-4)_, Morgan Kaufmann, June, 1992. To order, call (800)745-7323 (toll free in North America) or (415)578-9928 (direct), send fax to (415)578-0672 or email to morgan@unix.sri.com. Please refer to ISBN 1-55860-273-9. ------- From corpora-request@uib.no Wed Dec 2 13:09:03 1992 id <01679-0@alf.uib.no>; Wed, 2 Dec 1992 12:06:47 +0100 Wed, 2 Dec 92 12:09:04 +0100 To: corplst@nora.hd.uib.no, jgt@vinga.hum.gu.se Subject: Re: Spanish corpora (2 msgs + note) Date: Wed, 02 Dec 92 12:09:03 +0100 From: Jan-Gunnar Tingsell X-Mts: smtp > From: (Doug Mckee x7820) > Subject: Spanish corpora > > I'm looking for online Spanish corpora, preferably newspaper or > magazine articles. I've heard there is a collection at the University > of Miami, but I haven't been able to find it. Can anyone help he out? > BTW, I already know what is available in the Oxford Text Archive. > There is a swedish archive at Gothenburg University containing spanish newspaper and magazine articles. Please contact: David Mighetto /Jan-Gunnar Tingsell From corpora-request@uib.no Wed Dec 2 15:43:01 1992 id <10743-0@alf.uib.no>; Wed, 2 Dec 1992 14:41:17 +0100 Date: Wed, 2 Dec 1992 14:43:01 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: Spanish corpora ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Wed, 2 Dec 1992 12:09:03 UTC+0100 From: Jan-Gunnar Tingsell Subject: Re: Spanish corpora (2 msgs + note) > From: (Doug Mckee x7820) > Subject: Spanish corpora > > I'm looking for online Spanish corpora, preferably newspaper or > magazine articles. I've heard there is a collection at the University > of Miami, but I haven't been able to find it. Can anyone help he out? > BTW, I already know what is available in the Oxford Text Archive. > There is a swedish archive at Gothenburg University containing spanish newspaper and magazine articles. Please contact: David Mighetto /Jan-Gunnar Tingsell From corpora-request@uib.no Wed Dec 2 15:42:36 1992 id <10697-0@alf.uib.no>; Wed, 2 Dec 1992 14:40:57 +0100 Date: Wed, 2 Dec 1992 14:42:36 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: 5th Message Understanding Conference--Call for Participation ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Tue, 1 Dec 1992 16:13:36 UTC-0800 From: (Beth M. Sundheim) Subject: 5th Message Understanding Conference--Call for Participation Preparation: 1 March - 23 May, 29 May - 25 July Evaluations: 24-28 May (dry run), 26-30 July (formal run) Conference: 25-27 August Sponsored by: Defense Advanced Research Projects Agency Software and Intelligent Systems Technology Office (DARPA/SISTO) The Message Understanding Conferences have provided on ongoing forum for assessing the state of the art and practice in text analysis technology and for exchanging information on innovative computational techniques. They have also encouraged experimentation in the context of fully implemented systems that perform the realistic task of extracting factual information from free text. [rest of message deleted] The whole message can be fetched by sending the line send corpora message.understand.conference to FILESERV@NORA.HD.UIB.NO From corpora-request@uib.no Fri Dec 4 10:19:21 1992 id <20606-0@alf.uib.no>; Fri, 4 Dec 1992 09:17:36 +0100 Date: Fri, 4 Dec 1992 09:19:21 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: data format of annotated corpora (2 msgs/297 lines) ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** 1) ========================================================================== Send-date: Thu, 3 Dec 1992 18:18:00 UTC From: Lou Burnard Subject: Data formats for annotated corpora 2) ========================================================================== Send-date: Thu, 3 Dec 1992 10:14:45 UTC-0600 From: C. M. Sperberg-McQueen Subject: Re: data format of annotated corpora 1) ========================================================================== Send-date: Thu, 3 Dec 1992 18:18:00 UTC From: Lou Burnard Subject: Data formats for annotated corpora Helmut Feldweg asked: "Is TEI merely a format for information interchange or is it feasible to do basic analysis (frequencies, concordances, distributional analysis etc.) with this format?" He also said his team was "facing the quesiton whether to do the annotation in a SGML-based TEI-conformant format or on the basis of a vertikalized text with one wordform plus annotations per line." As European editor of the TEI I am rather depressed by the fact that the TEI has clearly been so unsuccessful in getting across the nature of its objectives! I hope you will allow me to elaborate a little on the subject. First, the TEI is very emphatically a format for information interchange. (and, by the way, I don't see why the word "merely" goes in there!) If you want to build a resource which is usable with different pieces of software, or the same piece of software in different computing environments, you *must* have some sort of interchange format. Resources like corpora cost lots of time and money to create, so I hope that *all* in "corpora-land" have some concern for re-usability. That is one of the reasons for the existence of the TEI and why it is seen as so important by research funders as well as researchers. Contrary to what Dr Feldweg seems to imply in the sentences quoted above, the TEI format is not limited in its application to interchange only. It uses SGML for the very good reason that SGML is a defined international standard for which general purpose software is around now and will continue to be around for many years to come. That means also that more and more SGML-aware software will become available, for the kinds of specialized activities which delight the readers of this list. But an honest answer to the question "Is there a KWIC concordance generator which will directly process TEI-tagged texts?" now, as of December 1992, is *NO*. Let me say why I don't think this is a disastrous state of affairs: - the TEI recommendations are not yet finalized, so any software developer who claims to 'support TEI' is using the word "support" in some rather general sense which needs to be more exactly specified - the general principles of the TEI are however very clearly and straightforwardly expressed so that writing software which conforms to them is easy (provided you understand SGML) - few people want to use a KWIC concordance *only* If you put your data into the format that the KWIC program wants, then you can't get it formatted properly; if you put it into the format that the formatter wants, then the KWIC program is confused... so you wind up having to invent an interchange format all of your own anyway. In which case, why not use SGML from the start? But enough of these generalities! Let me try to answer some of Dr Feldweg's specific questions. >Are there tools which allow me to deal >with an SGML-tagged text as easily as I can manipulate a vertikalized >text with grep, awk, icon and similar tools? Yes. Use the publicdomain SGML parser sgmls to produce a normalized form of your document, which can be piped into whatever set of unix tools you like. sgmls (which you can download from lots of ftp servers, not excluding this one, I think!) outputs a very simple form of an SGML document (technically known as the ESIS) in which all the various aspects of an SGML document that are variable have been normalized, (see further below) thus making it very easy to pipe into any tool you like. Two trivial examples: sgmls foo.bar | \ awk '/^-/ {print substr($0,2)}' | \ sed 's/\\n/ /' | tr -ds ' ' '\012' | sort -u will give you a sorted list of the tokens in the contents of the SGML file foo.bar (i.e. just the text, without the markup or attributes) sgmls foo.bar | \ awk '/^^\(BLORT/,/^\)BLORT/ {if ($0 ~ /~-/) print substr($0,2)}' | \ sed 's/\\n/ /' | tr -ds ' ' '\012' | sort -u will do the same thing, but only for the tokens in content marked as being a BLORT (i.e. between and in the conventional SGML format) Why should you use the SGML parser? Let's suppose that you're using a simple dtd (or only using a small amount of a complex one) in which documents consist of paragraphs that consist of sentences. You are following the TEI and so you have marked the sentences with the S tag and the paras with the P tag. You also want identifiers on the sentences. So you have an input text like this

The cat sat on the hippopotamus. L'état, c'est moi. Or like this

The cat sat on the hippopotamus. L'état, c'est moi.

(I've arbitrarily taken some liberties with the tagging in order to show the range of possibilities which all produce *exactly the same* ESIS -- i.e. the same document) The parser does three things: it expands the entity references (so, the 'é' will turn into whatever the right character for your machine is); it validates the tags (so, if you have spelled 'P' with a q at the startby mistake it can complain); and it outputs a regularised version of the document (so, all the tagnames appear in uppercase, the end-tags are supplied etc etc). That's probably enough low-grade techy stuff for this note, for now. >Has anybody *worked* with annotated texts in TEI-format? Since, see above, the TEI Guidelines are not yet finalized this would be difficult. However, there are many examples of projects which are applying the TEI principles and getting real experience of using the draft recommendations. If by 'annotated texts' you mean 'texts with word-class tagging attached', I know of two or three projects, of which the British National Corpus is probably the largest. >It is our understanding that the final product will be an >TEI-conformant SGML-tagged text. The question is whether this is the >right format to use during data preparation. For data PREPARATION one should definitely use the most convenient tool at ones disposal. That might be word perfect (which, by the bye, recently announced a beta test of an SGML-aware version), emacs , pagemaker... Some people like to define abbreviatory macros or conventions to save keystrokes one way, and some another; what's crucial (I think) is that the distinctions made by the local data preparation format can be straightforwardly mapped onto the interchange format. I would suggest that 'data capture' is one of the many processor-specific application formats amongst which the interchange format mediates, albeit a particularly important one. I hope this discussion will be of interest to your readers, and is also helpful to Dr Feldweg. --------------------------------------------------------------------- Lou Burnard tel. +44 865 273200 Euro Editor TEI fax. +44 865 273275 Oxford University Computing Services lou@ox.ac.uk ---------------------------------------------------------------------- 2) ========================================================================== Send-date: Thu, 3 Dec 1992 10:14:45 UTC-0600 From: C. M. Sperberg-McQueen Subject: Re: data format of annotated corpora On Fri, 27 Nov 1992 21:44:46 +0100 you said: >Send-date: Fri, 27 Nov 1992 17:40:12 UTC+0100 >From: (Helmut Feldweg) >Subject: data format of annotated corpora > >I'ld like to know if there are any strong feelings out there in >corpora-land concerning the format of annotated corpora. My colleague Lou Burnard has already responded to this query, I think, but I'd like to throw in my own two cents' worth as well. >We are about to prepare an annotated corpus of German and are facing >the quesiton whether to do the annotation in a SGML-based >TEI-conformant format or on the basis of a vertikalized text with one >wordform plus annotations per line. Whether you are using the TEI DTDs or not, SGML provides a much firmer basis for reusable text than a verticalized format. The vertical formats I am familiar with are all very simple to process, but all are also fundamentally non-extensible. If I discover, the year after next, that I would like to enrich my text by adding a new type of annotation, the vertical format is much less likely to support this process than is SGML. This is true, I think, even for word-by-word annotation, but it is critically true for phrase-level annotation and other phenomena which don't map one-to-one with the tokens of the source text. (How many complications and compromises are required in the Brown and LOB corpora by the limitation of annotation to the token level! The 'New York-born financier' would not cause problems for annotation if one were not forced to annotate the three tokens 'New', 'York-born', and 'financier' -- similarly the Queen of England's taxes could be annotated properly, with the apostrophe and S applying to the NP, not just to the last word, if phrase-structure annotation were aided, instead of being impeded, by the notation. If resources are to be reusable, they must be represented in a notation which can support their enrichment and modification over a period of time. SGML is far and away the most suitable notation now publicly defined, for such enrichment. >Is TEI merely a format for information interchange or is it feasible >to do basic analysis (frequencies, concordances, distributional >analysis etc.) with this format? As we do not want to spend time for The TEI Guidelines define an interchange format. This allows those who already have a strong commitment to some locally supported format to support the TEI encoding scheme for import and export of files to and from their local format. It also provides an opaque interface behind which one can optimize or simplify or otherwise do exactly as one wants. (In just the same way, ISO 646, ISO 10646, and ANSI X3.4 (ASCII) do not define character sets for use in computers or on data storage devices; they only define character sets for interchange. What the CPU, disk drive, or tape drive store internally is their own business as long as they provide ASCII at the interface to other devices. On high-density tapes, in fact, I am told ASCII bytes are *not* used -- instead, other codes, optimized for high-density reading and writing, are used. Many devices, however, find it simpler to store internally exactly what they receive and transmit at their external interfaces, so your RAM and disk drives may well use ASCII internally, just as some tape drives do.) Just as ASCII, though defined as an interchange format, is also (intentionally) suitable for internal use, so the TEI tag set, though formally defined for interchange, also tries to avoid features which might make it wholly unsuitable for local use. As Lou has explained, the output of sgmls, a public-domain SGML parser based on the ARCSGML engine, can be used quite profitably with existing pipeline tools to do useful work. So I submit to you that the answer to your question is yes, it *is* feasible to do basic analysis and manipulation using the TEI format. Personally, I think it's more feasible, and easier, with SGML than with any other text format I know, because the logical model of text supported by SGML (and represented in the element structure information set, or ESIS, mentioned by Lou) is so much stronger and more informative than the simple (simplistic) stream model supported by many other formats. >developping yet another freq- and kwic-program, we'ld like to know if >there are any programs available to do this kind of analysis on the >basis of SGML-tagged texts? Are there tools which allow me to deal >with an SGML-tagged text as easily as I can manipulate a vertikalized >text with grep, awk, icon and similar tools? Not yet, but I am confident that they are coming. If those now developing stream-oriented tools for their own use can be induced to share them with others, we can have a useful library in very short order. Obviously one useful filter would take an sgmls data stream and verticalize it so one can work with it using other existing tools. >Has anybody *worked* with annotated texts in TEI-format? Among the projects which have worked with texts using SGML and tag sets taken from or based on the first version of the TEI guidelines are the British National Corpus, the Stockholm-Umeaa Swedish Text Corpus, and the Brown Women Writers Project; the Perseus Project at Harvard uses SGML (not specifically TEI-derived) internally, for all manipulation and annotation, and distributes HyperCard materials derived from the SGML. >It is our understanding that the final product will be an >TEI-conformant SGML-tagged text. The question is whether this is the >right format to use during data preparation. This depends on your local environment, as Lou has pointed out. In general, if you are making a choice more or less from scratch, and are not already locked in to a format, I would urge you very strongly to consider using SGML as your local storage format, from which you can transduce data into application-specific formats required by existing applications. The list of application programs which understand an ad hoc vertical format will grow only as you yourselves, or other users of your data, write programs for it. The list of application programs which understand an SGML file will grow, and grow, and continue to grow, independent of your local software development budget. Storing your data in SGML is the best way to ensure that you can use these applications as they arrive. Commercial vendors very seldom aim to support research work with their software; but when they support SGML, researchers using suitably designed SGML tag sets will be able to exploit commercial software much more easily than is now the case. Isn't it about time we were able to take advantage of commercial software development instead of being forced to develop everything on our own because the research-oriented market is too small for commercial developers to take seriously? -C. M. Sperberg-McQueen ACH / ACL / ALLC Text Encoding Initiative University of Illinois at Chicago From corpora-request@uib.no Fri Dec 4 21:48:54 1992 id <21750-0@alf.uib.no>; Fri, 4 Dec 1992 20:47:12 +0100 Date: Fri, 4 Dec 1992 20:48:54 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: reply to FELDWEG data format of annotated corpora ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Fri, 4 Dec 1992 15:50:18 UTC+0100 From: (Winfried Bader) Subject: reply to FELDWEG data format of annotated corpora Because I think the discussion about the data format of annotated corpora is of general interest, I answer to Mr. Feldweg here in the list, though his institute is only a few hundred meters from mine, but until now we weren't in contact. There is no question: using a generic markup like SGML or TEI is always better than a vertikalized format (like Lou Burnard and Sperberg-MaQueen mentioned in their contributions). My experience is, that it is not necessary to use exactly the proposals of SGML and TEI, but to use the idea of them in a very strict way: that is, you have to think about the structure of the *content* of your text, and then tagging each element in your text with a distinct content (i.e. object language, reference, grammatical analysis, etc.) with an unequivocal sign or string (i.e. like SGML <....> ). So you can be sure that later you can make with this text by converting with a program what ever you want. In this field I have experience in several projects of different fields: textcritical editions, lexicography, linquistics. In volume 4, 1992 of the periodical Historcal Social Research (which comes out at the beginning of 1993) there will be an article treating this theme. Feldweg asked for KWIC and frequency programs. A very strong tool to do this kind of analysis also with SGML-tagged texts is the TUebingen System of Text-Processing Programs TUSTEP. Not only for KWIC and frequency analysis but also for data retrieval in tagged texts and for converting programs TUSTEP is a very strong tool which runs under UNIX and MS-DOS identically (also: VMS, MVS, VM, BS2000). In your institut, Helmut Feldweg, TUSTEP is in use exactly for this purpose. For further informations please contact me. Winfried Bader Tuebingen University - Center for Data Processing Brunnenstrasse 27 7400 Tuebingen email: bader@mailserv.zdv.uni-tuebingen.de phone: +49 - 7071 - 29 6973 FAX: +49 - 7071 - 29 5912 From corpora-request@uib.no Wed Dec 9 14:59:27 1992 id <06006-0@alf.uib.no>; Wed, 9 Dec 1992 13:57:38 +0100 Date: Wed, 9 Dec 1992 13:59:27 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: data format of annotated corpora (2 msgs/123 lines) ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** 1) =========================================================================== Send-date: Mon, 7 Dec 1992 9:58:31 UTC+0100 From: (Helmut Feldweg) Subject: Re: data format of annotated corpora (2 msgs/297 lines) 2) =========================================================================== Send-date: Sat, 5 Dec 1992 11:55:07 UTC From: ZWEIG Subject: RE: Re: data format of annotated corpora (2 msgs/297 lines) 1) =========================================================================== Send-date: Mon, 7 Dec 1992 9:58:31 UTC+0100 From: (Helmut Feldweg) Subject: Re: data format of annotated corpora (2 msgs/297 lines) Lou Burnard writes: > As European editor of the TEI I am rather depressed by the fact that the > TEI has clearly been so unsuccessful in getting across the nature of > its objectives! I hope you will allow me to elaborate a little on the > subject. First, the TEI is very emphatically a format for information I think there is no reason for Lou Burnard to be "rather depressed". As I said in my original posting, the final textual product of our enterprise will be a TEI-conformant SGML-tagged text. That should make Lou happy. The question was not whether a verticalized format is better than SGML/TEI. I was already convinced that SGML/TEI is the better system! My concern was, and still is, the feasibililty to *work* with this format *here and now*, and I have learned from the responses to my posting, that whereas some software is already around which allows processing of SGML-texts by converting the SGML-format to some other format (like sgmls and TUSTEP), software with direct support of this format is still to come. Helmut Feldweg Seminar f"ur Sprachwissenschaft, Universit"at T"ubingen Wilhelmstr. 113, D-7400 T"ubingen 1, Germany email: feldweg@mailserv.zdv.uni-tuebingen.de feldweg@bach.sns.neuphilologie.uni-tuebingen.de phone: +49 (0)7071 29-4279 2) =========================================================================== Send-date: Sat, 5 Dec 1992 11:55:07 UTC From: ZWEIG Subject: RE: Re: data format of annotated corpora (2 msgs/297 lines) In the message of 4 Dec 1992, Lou Burnard says: >Use the publicdomain SGML parser sgmls ... >... (which you can download from lots of ftp servers, not >excluding this one, I think!) and -C. M. Sperberg-McQueen: > As Lou has explained, >the output of sgmls, a public-domain SGML parser based on the ARCSGML >engine, can be used quite profitably... I could not find sgmls on fileserv@nora.hd.uib.no or on listserv@uicvm.bitnet. Could someone tell me where I could find it? Thanks a lot, Pierre Zweigenbaum Bitnet: zweig@frsim51 DIAM (Departement Intelligence Tel: (+33) 1 45.83.67.28 Artificielle et Medecine) Fax: (+33) 1 45 86 56 85 INSERM U.194 & Service d'Informatique Medicale 91, bd de l'Hopital F-75635 Paris Cedex 13 ---------------------------------------------------------------------------- [ FROM LIST ADM.: Here is the results of a search with Archie for FTP sites: Host ccadfa.cc.adfa.oz.au Location: /pub/other FILE -rw-r--r-- 352575 Oct 14 11:44 sgmls-1.0.tar.Z Host dsrbg2.informatik.tu-muenchen.de Location: /physik/ftp.uu.net FILE -rw-r--r-- 352575 Oct 9 19:27 sgmls-1.0.tar.Z Host ftp.uu.net Location: /pub/text-processing/sgml FILE -rw-r--r-- 352575 Sep 28 16:59 sgmls-1.0.tar.Z FILE -rw-r--r-- 79400 Sep 28 16:53 sgmls1_0.zip Host ifi.uio.no Location: /pub/SGML/SGMLS FILE -rw-r--r-- 352575 Oct 6 23:05 sgmls-1.0.tar.Z FILE -rw-r--r-- 79400 Oct 6 23:07 sgmls1_0.zip Host mailer.cc.fsu.edu Location: /pub/sgml/SGMLS FILE -rw-r--r-- 352575 Nov 3 16:30 sgmls-1.0.tar.Z Host pinus.slu.se Location: /pub/text-processing/sgml FILE -r--r--r-- 352575 Sep 28 16:59 sgmls-1.0.tar.Z FILE -r--r--r-- 79400 Sep 28 16:53 sgmls1_0.zip Host reseq.regent.e-technik.tu-muenchen.de Location: /physik.archive/ftp.uu.net FILE -rw-r--r-- 352575 Oct 9 18:27 sgmls-1.0.tar.Z Host rs3.hrz.th-darmstadt.de Location: /pub/text/sgml/sgmls FILE -rw-rw-r-- 352575 Sep 28 22:59 sgmls-1.0.tar.Z FILE -rw-rw-r-- 79400 Sep 28 22:53 sgmls1_0.zip Host rusmv1.rus.uni-stuttgart.de Location: /.serv2/soft/mac/tips/unix FILE -rw-r--r-- 352575 Nov 18 20:16 sgmls-1.0.tar.Z Host src.doc.ic.ac.uk Location: /text/sgml/SGMLS FILE -r--r--r-- 352575 Nov 3 16:30 sgmls-1.0.tar.Z Host unix.hensa.ac.uk Location: /pub/uunet/pub/text-processing/sgml FILE -rw-r--r-- 352575 Sep 28 21:59 sgmls-1.0.tar.Z FILE -rw-r--r-- 79400 Sep 28 21:53 sgmls1_0.zip ] From corpora-request@uib.no Thu Dec 10 01:55:37 1992 id <07769-0@alf.uib.no>; Thu, 10 Dec 1992 00:53:48 +0100 Date: Thu, 10 Dec 1992 00:55:37 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Reply to FELDWEG ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Wed, 9 Dec 1992 17:42:37 UTC+0100 From: (Winfried Bader) Subject: Reply to FELDWEG Dear Mr Feldweg, let us continue discussing via Bergen. I think in your new reply you misunderstand a sentence of my message. > A very strong tool to do this kind of analysis also with SGML-tagged texts > is the TUebingen System of Text-Processing Programs TUSTEP. Not only for > KWIC and frequency analysis but also for data retrieval in tagged texts > and for converting programs TUSTEP is a very strong tool which runs under > UNIX and MS-DOS identically (also: VMS, MVS, VM, BS2000). I wanted to express to things in one sentence and you didn't get it. First: TUSTEP is a tool - not a ready program to use - for making KWIC-index very easily and rapidly, and - if you want and you put some parameters to your TUSTEP-KWIC-program - it can use as data input a file in SGML-format. The only condition is, that the tagging of your text keeps all the informations you want to interpret with the KWIC- and frequency-programs. Second: TUSTEP is not only a tool for KWIC. You can do a lot of things with it: comparing versions, preparing editions, writing algorithmic programs, and - this was the second thing I said in my sentence - you can write converting programs with TUSTEP, if you need to do so. You write: > software with direct support of this > format is still to come. The question which arises here is, what do you mean by "direct support". Do you want to push one buttom and get one result which the program designer has decided you need this, or is the work of a scholar to ask new questions and therefore to make new solutions (with the help of available tools)? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Winfried Bader University of Tuebingen - Center for Data Processing Brunnenstrasse 27 D-7400 Tuebingen email: bader@mailserv.zdv.uni-tuebingen.de phone: +49 - 7071 - 29 6973 FAX : +49 - 7071 - 29 5912" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From corpora-request@uib.no Thu Dec 10 01:55:51 1992 id <07775-0@alf.uib.no>; Thu, 10 Dec 1992 00:54:03 +0100 Date: Thu, 10 Dec 1992 00:55:51 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: where to find sgmls ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Wed, 9 Dec 1992 18:08:44 UTC+0100 From: (Helmut Feldweg) Subject: where to find sgmls Pierre Zweigenbaum wrote: > > I could not find sgmls on fileserv@nora.hd.uib.no or on listserv@uicvm.bitnet. > Could someone tell me where I could find it? > > Thanks a lot, > 'sgmls' is available via anonymous ftp at scheria.nmsu.edu (alias clr.nmsu.edu, 128.123.1.12) in the directory pub/tools/sgml/sgmls. -- Helmut Feldweg Seminar f"ur Sprachwissenschaft, Universit"at T"ubingen Wilhelmstr. 113, D-7400 T"ubingen 1, Germany email: feldweg@mailserv.zdv.uni-tuebingen.de feldweg@bach.sns.neuphilologie.uni-tuebingen.de phone: +49 (0)7071 29-4279 From corpora-request@uib.no Wed Dec 16 02:46:59 1992 id <14140-0@alf.uib.no>; Wed, 16 Dec 1992 01:45:10 +0100 Date: Wed, 16 Dec 1992 01:46:59 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Query: Software tools for bi/multilingual corpora ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Tue, 15 Dec 1992 13:47:22 UTC+0300 From: Subject: Query: Software tools for bi/multilingual corpora Dear Colleagues, Currently, I am working on semiotic aspects of language for special purposes (LSP) including, but not limited to, technical communication and translation, computationally tractable methods for writers and translators, natural language processing (parsing) and computer-assisted translation and composition. The ability to somehow compare texts in DIFFERENT languages is of crucial importance in translation studies; i.e. linking together parts of texts, which would be considered to be correspondences both on a sentential and a suprasentential level, as well as on a global text level within a bilingual or multilingual corpus. Does anybody know of any available software tools for this purpose? I am aware of the work by Eugenio Picchi and his group in Pisa, by Ken Church et. al. at AT&T, and by Poul Soeren Kjaersgaard in Odense. The work done by Victor Sadler and the DLT group in Utrecht on a Bilingual Knowledge Bank is also very interesting. However, none of these seems to be immediately applicable to studies in LSP. I do have some notion of how such linking could be done, but that would require extensive programming, for example in Prolog. That is why it seems reasonable first to make a few enquires and see if any tools of this kind already exist. Yours, Arne Larsson ------------------------------------------------------------------------- Arne Larsson Nokia Telecommunications Translator Transmission Systems, Customer Services larsson@ntc02.tele.nokia.fi P.O. Box 12, SF-02611 Espoo, Finland Phone +358 0 5117476, Fax +358 0 51044287 *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* From corpora-request@uib.no Wed Dec 16 12:12:56 1992 id <10905-0@alf.uib.no>; Wed, 16 Dec 1992 11:11:03 +0100 Date: Wed, 16 Dec 1992 11:12:56 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: Query: Software tools for bi/multilingual corpora ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Wed, 16 Dec 1992 9:11:49 UTC From: P.Holmes-Higgin Subject: Re: Query: Software tools for bi/multilingual corpora As part of the Translators Workbench (TWB) ESPRIT project we are extending the facilities of our Machine-Assisted Terminology Elicitation (MATE) system to take advantage of parallel texts in different languages, to help establish foreign language equivalences and so on. We are using the term "shadow corpora" to cover this work - there is an original text which has possibly several translations (shadows). Currently MATE uses a variety of concordance, collocation and frequency techniques to establish LSP from corpora of technical texts - our aim is to extend its functionality over the next year to cover shadow corpora. MATE has been developed using Prolog and runs on Unix under X-Windows. A subset of the tools has been ported to PCs under MS Windows, and we hope to port the full function system over the next year. Needless to say, we are keen to hear from anyone with ideas on how to exploit "shadow corpora". I am the principal system developer, so queries or suggestions of a linguistic nature are best directed to Andrea Davies here (A.Davies@mcs.surrey.ac.uk). Many thanks, Paul. -- Paul Holmes-Higgin JANET: P.Holmes-Higgin@surrey.ac.uk Artificial Intelligence Group Department of Mathematical and Computing Sciences University of Surrey Guildford GU2 5XH England From corpora-request@uib.no Thu Dec 17 23:42:32 1992 id <07109-0@alf.uib.no>; Thu, 17 Dec 1992 22:40:39 +0100 Date: Thu, 17 Dec 1992 22:42:32 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: Query: Software tools for bi/multilingual corpora ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Wed, 16 Dec 1992 12:30:42 UTC-0500 From: (John Fought) Subject: Re: Query: Software tools for bi/multilingual corpora I have no helpful suggestions at the moment, but I am most interested in hearing from you as the work goes on. I have wanted to do something quite a lot like your project, and (naturally) I think it is a fine plan. When some of the dust settles here after the holidays, perhaps we can email at greater length. Meanwhile, anything you can send me about the project would be welcomed. John Fought Director, Language Analysis Center Univ. of Pennsylvania From corpora-request@uib.no Mon Dec 28 09:26:45 1992 id <27345-0@alf.uib.no>; Mon, 28 Dec 1992 08:24:46 +0100 Date: Mon, 28 Dec 1992 08:26:45 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: Query: Software tools for bi/multilingual corpora ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Fri, 18 Dec 1992 9:45:49 UTC+0200 From: Subject: Re: Query: Software tools for bi/multilingual corpora I cannot answer your question but got interested in your research. I've recently finished a contrastive textlinguistic analysis of LSP texts in Finnish and English (academic texts) and would be very interested in seeing how far if at all the kinds of analytical tools I have used in mainly qualitative study could be applied to corpus study. So I'd like to know what sorts of variables you have in mind at suprasentential and global ext levels. Anna Mauranen researcher Language Centre for Finnish Universities University of Jyv{skyl{ From corpora-request@uib.no Mon Dec 28 09:26:58 1992 id <27349-0@alf.uib.no>; Mon, 28 Dec 1992 08:24:59 +0100 Date: Mon, 28 Dec 1992 08:26:58 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: overview of formats? ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Mon, 21 Dec 1992 14:42:17 UTC-0600 From: stan kulikowski ii <@livid.uib.no:STANKULI@UWF.bitnet> Subject: overview of formats? corpora cognoscente, forgive me if this seems naive, but this seems like the place to ask this. if not, please give me a nudge in the proper direction. i am in the process of preparing a series of proposals which will include the creation of a corpus of textual data. i work in education and i expect to collect machine-readable school materials from K-12 students in several countries and several languages. i expect to work with several 100 megabytes of material (probably less than a gigabyte), some of it collected from student network activities (both WAN and LAN), and some of it scanned in from hardcopy materials. i joined this list to learn the technical features of corpus construction, but going has been slower than i hoped. in the past weeks i have followed up on your source references to sgmls, at least enough to find the ISO 8879 references to 'standard generalized markup language'. i presume i can follow this onward to determine if it will help me collect and distribute my work on the networks. i have looked through the corpora file index for titles which might provide me with an overview of the common formats for a textual corpus. i am hoping that someone out there will assist me with this. how do i look up the meaning of 'TEI-compliance' or 'TUSTEP' which i have seen referred to here? do you all have an intro textbook on these things? or is there a network source for this kind of information? i need something that gives an overview to determine what will fill my needs then pointers to enough technical details to follow through with implementation. thanks for any assistance you may give me, stan stankuli@UWF.bitnet . === we all help each other get a little further down the road, : : or be damned for the fools that we are. --- -- the motorcycle modificationist's motto From corpora-request@uib.no Tue Dec 29 17:01:50 1992 id <17865-0@alf.uib.no>; Tue, 29 Dec 1992 15:59:51 +0100 Date: Tue, 29 Dec 1992 16:01:50 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Announcement of a List on NL Processing In Turkish ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Mon, 28 Dec 1992 12:13:26 UTC+0200 From: ko (Kemal Oflazer) Subject: Announcement of a List on Natural Language Processing In Turkish Dear Colleagues We announce the formation of a mailing / discussion list for natural language processing and computational linguistics studies on the Turkish language. Detailed information follows. Kemal Oflazer Cem Bozsahin Bilkent University Middle East Technical Univ. Computer Engineering Department Computer Engineering Dept Bilkent, ANKARA, 06533 TURKIYE Ankara, TURKIYE e-mail: ko@trbilun.bitnet bozsahin@trmetu.bitnet fax: (90) 4 - 266-4126 tel: (90) 4 - 266-4133 ------------------------------------------- Turkish Natural Language Processing Discussion Group The purpose of this list is to form a discussion group on natural language processing (nlp) studies on the Turkish language. We welcome all submissions that are on, or related to, (a) computer-based analysis or synthesis of turkish, (b) application of linguistic theories to the language, (c) linguistic tools and their applicability, (d) implications/adaptation of current computational linguistic models to turkish (e) announcements of relevant events (seminar, colloquia, etc.) (f) announcements of software tools and databases such as parsers, morpholgical analyzers, MRD's and lexicons, Turkish text corpus, etc. The list is not moderated at this time. Contributions may be in Turkish, English or any other language that may find an audience in the group. To subscribe, please send a message to: listserv@trmetu.bitnet with sub bildil in its body. To post articles, send your message to : bildil@trmetu.bitnet ----- End Included Message ----- From corpora-request@uib.no Tue Dec 29 17:02:03 1992 id <17876-0@alf.uib.no>; Tue, 29 Dec 1992 16:00:03 +0100 Date: Tue, 29 Dec 1992 16:02:03 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: PC-KIMMO specification for Turkish Morphology ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Mon, 28 Dec 1992 15:01:48 UTC+0200 From: ko (Kemal Oflazer) Subject: PC-KIMMO specification for Turkish Morphology A full scale two-level description of Turkish morphology based on 24K root words and implemented using PC-KIMMO is now available from Bilkent University Archive Server (bilserv@trbilun.bitnet). To get a copy of this description, send mail to bilserv@trbilun.bitnet with contents send turklex.tar.Z The UNIX version 1.08 of the public domain program PC-KIMMO is also available from the same archive using the commands send pckimmo.tar.Z send pckimmo.man.Z To get more information about Bilkent Archive Server send the command send help to the same server. Please let us know of any problems. Kemal Oflazer Bilkent University Computer Engineering Department Bilkent, ANKARA, 06533 TURKIYE e-mail: ko@trbilun.bitnet fax: (90) 4 - 266-4126 tel: (90) 4 - 266-4133 From corpora-request@uib.no Thu Dec 31 09:01:38 1992 id <25985-0@alf.uib.no>; Thu, 31 Dec 1992 07:59:37 +0100 Date: Thu, 31 Dec 1992 08:01:38 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: overview of formats? ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Tue, 29 Dec 1992 16:10:07 UTC From: (Peter Flynn) Subject: Re: overview of formats? Stan writes: > > corpora cognoscente, > "Corporaphiliacs", perhaps :-) > forgive me if this seems naive, but this seems like the place to ask this. > if not, please give me a nudge in the proper direction. Nope, exactly the right place to ask, IMHO. > i am in the process of preparing a series of proposals which will include > the creation of a corpus of textual data. i work in education and i expect > to collect machine-readable school materials from K-12 students in several > countries and several languages. i expect to work with several 100 megabytes > of material (probably less than a gigabyte), some of it collected from > student network activities (both WAN and LAN), and some of it scanned in from > hardcopy materials. This sounds fascinating: I can think of lots of people who would like to see this kind of project developed further. Do you have any kind of prospectus or project description yet? > i joined this list to learn the technical features of corpus construction, > but going has been slower than i hoped. in the past weeks i have followed > up on your source references to sgmls, at least enough to find the ISO 8879 > references to 'standard generalized markup language'. i presume i can follow > this onward to determine if it will help me collect and distribute my work > on the networks. IMHO (again) SGML would be a good way to proceed. You can certainly plow onwards by yourself, but it's heavy going in the early stages. I found there was no substitute for sitting down with some of the more experienced people and discussing the project in detail. A pity you missed the SGML '93 meeting in October: I don't know when the next gathering of cognoscenti will be but the ALLC/ACH 93 meeting is in Georgetown, DC in early June, and there will be a lot of expertise there (which I hope to tap...) > i have looked through the corpora file index for titles which might provide > me with an overview of the common formats for a textual corpus. i am hoping > that someone out there will assist me with this. how do i look up the > meaning of 'TEI-compliance' or 'TUSTEP' which i have seen referred to here? There is a TEI-L@UICVM (listserv) for the Text Encoding Initiative which is worth joining. I don't know of a list for TUSTEP (what's TUSTEP?) SGML is a non-trivial concept, but worth the effort. You will need some bucks for software and hardware, possibly quite a lot of them, and a lot of manual help to get the text into usable form. Please keep us in touch with developments. ///Peter From corpora-request@uib.no Sun Jan 3 08:03:43 1993 id <26330-0@alf.uib.no>; Sun, 3 Jan 1993 07:01:41 +0100 Date: Sun, 3 Jan 1993 07:03:43 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: overview of formats? (4 msgs/229 lines) ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** 1) -------------------------------------------------------------------------- Send-date: Thu, 31 Dec 1992 11:51:07 UTC+0300 From: Subject: Re: overview of formats? 2) -------------------------------------------------------------------------- Send-date: Thu, 31 Dec 1992 8:54:15 UTC From: W Schipper Subject: TUSTEP 3) -------------------------------------------------------------------------- Send-date: Thu, 31 Dec 1992 10:35:23 UTC-0600 From: C. M. Sperberg-McQueen Subject: Re: overview of formats? 4) -------------------------------------------------------------------------- Send-date: Thu, 31 Dec 1992 12:53:09 UTC-0500 From: (Heather Davenport) Subject: Re: overview of formats? - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 1) -------------------------------------------------------------------------- Send-date: Thu, 31 Dec 1992 11:51:07 UTC+0300 From: Subject: Re: overview of formats? Stan wrote: [...stuff left out...] > i joined this list to learn the technical features of corpus construction, > but going has been slower than i hoped. in the past weeks i have followed > up on your source references to sgmls, at least enough to find the ISO 8879 > references to 'standard generalized markup language'. i presume i can follow > this onward to determine if it will help me collect and distribute my work > on the networks. [...] To everybody who would like to use SGML in real life, I would suggest a book by Eric van Herwijnen of CERN: van Herwijnen, Eric, Practical SGML, Kluwer Academic Publishers, Dordrecht, 1990. ISBN 0-7923-0635-X. The author covers a lot of ground, from the basics to SGML applications, handling math and graphics, and implementations, all the way to using SGML for databases, CALS and EDI. The book contains even an Appendix E entitled 'How to read ISO 8879'. I'm myselft setting out to do corpus work in translation studies using a bi/multilingual corpus consisting of aligned texts in technical sublanguages (Finnish, Swedish, English, perhaps also German), so I would greatly appreciate being kept up to date with Stan's work (if possible). Yours, Arne *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* Arne Larsson Nokia Telecommunications Translator Transmission Systems, Customer Services larsson@ntc02.tele.nokia.fi P.O. Box 12, SF-02611 Espoo, Finland Phone +358 0 5117476, Fax +358 0 51044287 *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* 2) -------------------------------------------------------------------------- Send-date: Thu, 31 Dec 1992 8:54:15 UTC From: W Schipper Subject: TUSTEP TUSTEP is a text analysis program developed in the 70s at Tuebingen, Germany, under the direction of Prof. Wilhelm Ott. Originally developed to run on mainframes, it is now being rewritten for micro computers as well. WS -- ....................................................................... W. Schipper Email: schipper@morgan.ucs.mun.ca Department of English, Tel: 709-737-4406 Memorial University Fax: 709-737-4000 St John's, Nfld. A1C 5S7 ........................................................................ 3) -------------------------------------------------------------------------- Send-date: Thu, 31 Dec 1992 10:35:23 UTC-0600 From: C. M. Sperberg-McQueen Subject: Re: overview of formats? On Mon, 28 Dec 1992 08:26:58 +0100 Stan Kulikowski said: > i have looked through the corpora file index for titles which might provide >me with an overview of the common formats for a textual corpus. i am hoping >that someone out there will assist me with this. how do i look up the >meaning of 'TEI-compliance' or 'TUSTEP' which i have seen referred to here? >do you all have an intro textbook on these things? or is there a network >source for this kind of information? i need something that gives an overview >to determine what will fill my needs then pointers to enough technical >details to follow through with implementation. Don't know if there is a network source other than the one you're using (viz. the rest of us out here), but I can at least tell you that the Text Encoding Initiative markup language is defined in its publication Guidelines for Electronic Text Encoding and Interchange (TEI P2), which is being published chapter by chapter as they become ready. New chapters are announced, and relevant topics may be discussed, on the list TEI-L@UICVM (or on the internet TEI-L@uicvm.uic.edu), which is a Listserv list which functions much the way this one (corpora-L) does. To subscribe, send mail to LISTSERV@UICVM containing the line subscribe tei-l Stan Kulikowski (those of you other than Mr. Kulikowski are encouraged, of course, to use your own names rather than his). The formal definition definition of TEI compliance is to be included in a chapter which is not yet published (and given the press of other chapters probably won't be published for several months) but the TEI markup language can be used without reference to strict definitions of compliance. If you really really want to know about compliance issues, send me a note asking for a copy of document TEI ML W43, which contains the current definition of TEI conformance. TUSTEP is the TUebingen System for TExt Processing, developed over the past twenty years or so by Wilhelm Ott and his associates at the University of Tuebingen. It is a toolkit of small functional programs a lot like Unix filters; it has a great many virtues, including a commitment to modularity and generality of tools which anyone must admire. Like Unix, however, it seems to inspire and possibly to require an almost religious commitment in its users; most problematic for me personally is that because it uses its own unique file structure, Tustep files can be reliably edited only with the Tustep editor. The circumstances of its development make this design feature perfectly understandable, but still it makes me nervous. It runs under a number of operating systems including DOS and some mainframe systems, and for further information you can contact Wilhelm Ott at Tuebingen. My email address for him is Ott@mailserv.zdv.uni-tuebingen.de --- but this may be out of date. Hope this is some help. If there is some more general treatment of formats in common use for corpora, I hope someone will mention it. -C. M. Sperberg-McQueen ACH / ACL / ALLC Text Encoding Initiative University of Illinois at Chicago 4) -------------------------------------------------------------------------- Send-date: Thu, 31 Dec 1992 12:53:09 UTC-0500 From: (Heather Davenport) Subject: Re: overview of formats? (This was originally sent only to to Stan Kulikowski in reponse to his "overview of formats" query, however, I thought there might be others out there who might want some of this info...) In regards to... | 'standard generalized markup language' ... | i presume i can follow this onward to determine if it will help me collect | and distribute my work on the networks. This is indeed the single most important aspect of your corpus creation. To achieve insurance of longevity and interchange capability it is vital that you use a standard, easy to use format... which is SGML. | how do i look up the | meaning of 'TEI-compliance' or 'TUSTEP' which i have seen referred to here? | do you all have an intro textbook on these things? or is there a network | source for this kind of information? i need something that gives an overview... Ok, first of all, the Text Encoding Initiative (TEI) is a body that puts out _guidelines_ for tagging certain kinds of texts, such as poetry, tables of content, etc. So don't worry so much about being TEI-conformant ... concentrate on being SGML-conformant. Inevitably, the TEI guidelines are somewhat subjective, although they can be helpful. This year's edition has not come out yet; however, it's called _Guidelines for the Encoding and Interchange of Machine-Readable Texts_, and Michael Sperberg-McQueen and Lou Burnard are editors. As for an intro book, the BIBLE for SGML is Charles F. Goldfarb's _The SGML Handbook_ which you cannot be without. He includes the entirety of ISO 8879, with comments (Clarendon Press, Oxford, 1990. ISBN: 0-19-853737-9). However, it's fairly technical and I would suggest _also_ getting Eric van Herwijnen's _Practical SGML_, and read them simultaneously (Kluwer Academic Publishers, London, 1990. ISBN: 0-7923-0635-X). For network information, there's a newsgroup called "comp.text.sgml" which you can post to, plus an ftp sight at "ftp.ifi.uio.no". | ...to collect machine-readable school materials from K-12 students in several | countries and several languages. Just wanted to add that since you're in the tricky business of multilingual texts, be very careful with your 8-bit (non-ASCII) characters. If you don't have 8-bit fonts available, then make sure whatever character mappings you use are _consistent_, so that they can me mapped over easily later. Also, if you can, use the ISO 8-bit sets that are designed specifically for this purpose (Goldfarb lists most of them in the back of his book), because this will be a _very_ important part of your texts being SGML conformant. And lastly, | ...in the past weeks i have followed up on your source references to sgmls... Just in case you didn't know, "sgmls" is a free sgml parser, which you can ftp from the site I mentioned above. Hope this helped a little. Or maybe you knew all that. If so, happy tagging. Cheers, Heather Davenport Tagging Coordinator Language Analysis Center: Univ. of Pennsylvania Univ. City Science Center 3700 Market St., Suite 202 Ph: 215-898-2988 Philadelphia, PA USA 19104-3147 Fx: 215-573-2126 email: heather@apollo.lap.upenn.edu From corpora-request@uib.no Tue Jan 5 16:22:22 1993 id <14435-0@alf.uib.no>; Tue, 5 Jan 1993 15:20:19 +0100 Date: Tue, 5 Jan 1993 15:22:22 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: New gopher server ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Mon, 4 Jan 1993 11:02:02 UTC+0100 From: Jan-Gunnar Tingsell Subject: New gopher server There is now a new gopher server available at Faculty of Arts, G|teborg University, Sweden. We try to give this gopher a "humanistic" profile, and we wish to link to other gophers with special interest for humanists. Any suggestions are welcome. Host=vinga.hum.gu.se Port=70 /Jan-Gunnar Tingsell -- ****************************************************************** Jan-Gunnar Tingsell Humanistiska fakultetens dataservice tel: +46 (0)31 773 4553 Gvteborgs universitet fax: +46 (0)31 773 4455 From corpora-request@uib.no Wed Jan 6 16:37:52 1993 id <12121-0@alf.uib.no>; Wed, 6 Jan 1993 15:35:51 +0100 Date: Wed, 6 Jan 1993 15:37:52 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Pedagogic corpora questionnaire ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Wed, 6 Jan 1993 13:42:45 UTC+0100 From: BARNBROG Subject: Pedagogic corpora questionnaire Postal address: School of English University of Birmingham Birmingham B15 2TT UK fax: UK +21 414 3600 e-mail: barnbrog@uk.ac.bham 6th January 1993 Dear Colleague, The development and ease of use of computer (especially PC) facilities in recent years means that there has been a move beyond the large carefully set-up corpora held on mainframes to a growing "cottage industry" of users who have set up their own corpora for their own purposes. These purposes may reflect an interest not simply in linguistic research but specifically in pedagogic ends. It is in the belief that there is a good number of such small-scale, personally assembled corpora out there that we are attempting to collate information about them, ultimately in a paper for publication. Hence the questionnaire which follows. Its aim is to collect information about the nature and structure of collections of text in machine-readable form and specifications of hardware and software tools. This information will be available to interested parties and is intended to provide a basis for discussion and exchange. By corpus, we mean broadly a text collection. We are particularly interested in corpora of present-day English which are (or could be) used in the teaching of English as a Second or Foreign Language. If you are not personally involved in the compilation of such a machine-readable corpus, could you pass the survey to others or suggest their names to us. We would hope to complete the results of the survey by April 1993; depending on the extent of the response we may come back to you for more detail. We would like to thank you in advance for your help and we'd be happy to hear any suggestions from you. Geoff Barnbrook Philip King Survey questionnaire: machine-readable corpora of Modern English A. CORPUS PROFILE A1. By what name is the corpus known? A2. Who compiled the corpus? A3. Where was it compiled? (Institution) A4. Contact Address Telephone Fax E-mail A5. When did the compilation start? A6. What was the incentive for starting the compilation?B. COMPUTER FACILITIES AND SOFTWARE B1. How are texts entered? (word-processor, text-editor, typesetting tapes, optical scanning, other) B2. How is the corpus stored and in what format? B2.1.What computer facilities do you use? (IBM Personal Computer or compatible, Apple Macintosh - workstation - mainframe) B2.2. What software do you use for corpus processing? (please specify item and function: word frequency, concordancing of selected items etc.) B2.3. Do you use ready-made or customized software? B2.4. If you use your own software, which programming language do you use? B3. Do you use any special characters in addition to standard ASCII ones? If so, how do you handle them? - in input processing - in screen output - in printing B4. Do you have software for linguistic annotation (tagging, parsing, lemmatization)? If yes, specify C. TEXT DETAILS C1. How was the text acquired? C2. How is the corpus organized? C3. Can you give some details of the content? C3.1. Written texts: C3.1.1. What genres are included in your collection? C3.1.2. What are the media of the original texts? (printed book, periodical, manuscript, ephemera, other) C3.1.3. Do you encode typographic and layout information? If so, specify C3.2. Spoken texts (transcriptions): C3.2.1. What genres are included in your collection? C3.2.2. What is the medium of the original source? (TV, radio, telephone, direct: talk, conversation, other) C3.2.3. Is the material spontaneous or not, surreptitious or not? C3.2.4. Do you encode information about speakers (e.g. age, sex) or about the recording? C3.2.5. What transcription system do you use? (phonetic, phonological, enhanced orthographical, orthographical) C4. What period do the texts in the corpus represent? from _____________ to ____________ C5. What is the total amount of data stored in your collection? - in bytes - in words - in minutes of spoken text recording C6. What use is made of the corpus? (specify, where appropriate) - to build up a multifunctional linguistic corpus - for lexicographic purposes - for literary research - for stylistic research - for preparation of a scholarly edition - for research in linguistics - for research in language learning/ teaching - for commercial applications - for natural language processing applications - other C7. Is it available to other interested parties? If so, under what conditions? D. VIEWS AND PERSPECTIVES: D1. Do you plan any changes in the composition of your corpus? D2. Are you planning to develop new text-handling software? D3. Are there any specialized areas for which a corpus approach is in your experience particularly useful? D4. Do you prefer a 'clean text' strategy (i.e. plain orthographic files) as opposed to annotated, phonologically coded, parsed etc. text? D5. Have you worked with, or have you considered working with multilingual corpora or corpora containing 'parallel texts' are needed? E. PUBLICATIONS: Please list any publications that you are aware of that were based on the corpus you have described From corpora-request@uib.no Wed Jan 6 16:37:40 1993 id <12108-0@alf.uib.no>; Wed, 6 Jan 1993 15:35:37 +0100 Date: Wed, 6 Jan 1993 15:37:40 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: RE:overview of formats? ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Wed, 6 Jan 1993 11:57:42 UTC-0600 From: Subject: RE:overview of formats? Stan wrote: i am in the process of preparing a series of proposals which will include the creation of a corpus of textual data. i work in education and i expect to collect machine-readable school materials from K-12 students in several countries and several languages. i expect to work with several 100 megabytes of material (probably less than a gigabyte), some of it collected from student network activities (both WAN and LAN), and some of it scanned in from hardcopy materials. I agree this is a very worth while and interesting project. For the last year I have been trying to markup 120,000 words of students informal- writing-to-learn mathematics. The idiosyncratic formatting of each piece of writing, the use of arrows to link bits of the text, the lack of sentence structure, and the use personal ideograms, has frustrated every attempt to reduce the text to a structured machine readable form. The decision as to what information is preserved in the coding and what is lost depends on the use to be made of the corpus. In education the main interest, it seems to me, is development. If this is the focus, then standard tools don't handle non-standard texts all that well. I am at present meeting this same problem in a more technical setting. I have been trying to plan the construction of a corpus of mathematical writing. To do this sensibly I need to preserve the semantics of the mathematical expressions within the text. I can't see anyway of doing this. (As far as I can gather the group working on the TEI standards for coding mathematics is using a typsetting model rather than a structural model, as used implemented in the rest of SGML. This is all pretty fuzzy to me - I will follow up the Van Herwijnen book.) Does anyone have any suggestions ? I joined this list for much the same reasons as Stan. I'm not an expert but need to develop corpora to carry through with research questions. To date the list has focussed on where rescourses are. Stan's request focusses on the how of constructing rescourses. I would really appreciate seeing more of this sort of discussion. PS Stan, if you are interested in using Australian data, there are large amounts of k-6 writing on LANs in schools here. I have been wanting to collect some of this for a while and I would be happy to have a reason and a deadline for doing it. Andrew Waywood, | Phone : 61 3 563 3628 | Fax : 61 3 563 3605 Christ Campus, | E-mail : awaywood@christ.acu.edu.au Australian Catholic University. | Post : PO Box 213, Oakleigh, 3166. From corpora-request@uib.no Tue Jan 12 14:09:11 1993 id <03642-0@alf.uib.no>; Tue, 12 Jan 1993 13:07:04 +0100 Date: Tue, 12 Jan 1993 13:09:11 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: 'empiricist' list ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Wed, 6 Jan 1993 11:50:45 UTC-0800 From: (Jane Edwards) Subject: "empiricist" list [FORWARDED FROM "LN"] Date: Thu, 24 Dec 92 10:30:34 +0100 From: yarowsky@unagi.cis.upenn.edu (David Yarowsky) ************************************************************ Mailing List on Statistics, Natural Language, and Computing ************************************************************ We will be maintaining a special-purpose mailing list to provide a platform for - discussing technical issues, - distributing abstracts of new papers, - locating and sharing information, and - announcements (workshops, jobs) related to corpus-based studies of natural language, statistical natural language processing, methods that enable systems to deal with and scale up to actual language use, psycholinguistic evidence for the representation of distributional properties of language, as well as applications in such areas as information retrieval, human-computer interaction, and machine translation. Special care will be taken to keep uninformed or redundant messages to a minimum. To be added to or dropped from the distribution list send a message to empiricists-request@csli.stanford.edu. Contributions should go to empiricists@csli.stanford.edu. Martin Roscheisen roscheis@cs.stanford.edu David Yarowsky yarowsky@unagi.cis.upenn.edu David Magerman magerman@watson.ibm.com Ido Dagan dagan@research.att.com From corpora-request@uib.no Tue Jan 12 14:09:27 1993 id <03646-0@alf.uib.no>; Tue, 12 Jan 1993 13:07:20 +0100 Date: Tue, 12 Jan 1993 13:09:27 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: RE:overview of formats? ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Wed, 6 Jan 1993 15:51:05 UTC-0600 From: C. M. Sperberg-McQueen Subject: Re: RE:overview of formats? On Wed, 6 Jan 1993 15:37:40 +0100 Andrew Waywood said: >I have been trying to plan the construction of a corpus of >mathematical writing. To do this sensibly I need to preserve the >semantics of the mathematical expressions within the text. I can't see >anyway of doing this. (As far as I can gather the group working on the TEI >standards for coding mathematics is using a typsetting model rather than a >structural model, as used implemented in the rest of SGML. ... Clarification may be in order. The TEI work group on mathematical formulae has not, in fact, developed a tag set for mathematics at all. This is not because they were too lazy or too unimaginative to do so; instead, they decided in very short order that one thing definitely high on the list of Things The World Does Not Need Right Now was Yet Another SGML Tag Set for Mathematics. Their net recommendation, therefore, was for the TEI to use one of the existing schemes; best of all, to wait until the current effort at reconciling the major existing schemes has achieved success, and use *that one*. (The schemes now being reconciled include, to the best of my recollection, those developed by the Euromath(s) project and by the Association of American Publishers, and that included in ISO TR 9573.) All of these do include some largely typographic constructs, though the ISO tag set, thanks to Anders Berglund, seems to allow reasonably clear structural markup for at least basic algebra and probably somewhat beyond (sorry --- it's been too long since I read it, I can't remember it all) The Euromaths project seems to have debated at some length whether to take a typographic or a semantic/structural/whatever-you-call-it approach, and decided that practicing mathematicians would find the typographic tagging more useful. (Before you start crying, remember that practicing mathematicians don't always use well established notation: it's research, remember, so sometimes they must be inventing new notation, recombining existing typographic effects/signals and extending the already massive ambiguity of most existing typographic effects in mathematics.) For people who do not see their normal use of mathematics as involving the invention of new notation, and who in fact seldom use any notation not in common use for the last few decades --- this seems to include 90% of the users of mathematics I know --- it would be much more useful to have a useful SGML notation for the constructs encountered in (say) algebra, analytic geometry, and calculus (and first-order predicate calculus, if you count logic as part of math). With any luck, the coming reconciliation of AAP, Euromath, and the ISO tag set will include something of the sort. (If Andrew Waywood's corpus includes serious mathematical literature, though, he will presumably need to try to capture *both* the semantics and the presentation of the notation actually used. Good luck!) It is gratifying to see people's thoughts turning, whenever SGML comes up, to the TEI. But for better or worse the major activity in SGML markup for mathematics is happening elsewhere. -C. M. Sperberg-McQueen ACH / ACL / ALLC Text Encoding Initiative University of Illinois at Chicago From corpora-request@uib.no Wed Jan 13 01:23:21 1993 id <28616-0@alf.uib.no>; Wed, 13 Jan 1993 00:21:16 +0100 Date: Wed, 13 Jan 1993 00:23:21 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: CFP: ACM TIS Special Issue on Text Categorization ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Tue, 12 Jan 1993 13:07:00 UTC-0500 From: (David Lewis) Subject: CFP: ACM TIS Special Issue on Text Categorization Call For Papers Special Issue on Text Categorization ACM Transactions on Information Systems Submissions due: June 1, 1993 Text categorization is the classification of units of natural language text with respect to a set of pre-existing categories. Reducing an infinite set of possible natural language inputs to a small set of categories is a central strategy in computational systems that process natural language. Some uses of text categorization have been: --To assign subject categories to documents in support of text retrieval and library organization, or to aid the human assignment of such categories. --To route messages, news stories, or other continuous streams of texts to interested recipients. --As a component in natural language processing systems, to filter out nonrelevant texts and parts of texts, to route texts to category-specific processing mechanisms, or to extract limited forms of information. --As an aid in lexical analysis tasks, such as word sense disambiguation. --To categorize nontextual entities by textual annotations, for instance to assign people to occupational categories based on free text responses to survey questions. ACM Transactions on Information Systems is the leading forum for presenting research on text processing systems. For this special issue we encourage the submission of high quality technical descriptions of algorithms and methods for text categorization. Experiments comparing alternative methods are especially welcome, as are results on deploying systems into regular use. Five copies of each manuscript should be submitted to either of the special issue editors at the addresses below: David D. Lewis Philip J. Hayes AT&T Bell Laboratories Carnegie Group, Inc. 600 Mountain Ave. Five PPG Place Room 2C409 Pittsburgh, PA 15222 Murray Hill, NJ 07974 USA USA hayes@cgi.com lewis@research.att.com Submission June 1, 1993 Notification October 1, 1993 Revision February 1, 1994 Publication mid-1994 The July 1990 issue of TIS contains a description of the style requirements.