From knut Wed Sep 23 14:08:06 1992 Send-date: Tue, 22 Sep 1992 0:06:08 UTC+0200 From: Knut Hofland To: Subject: Welcome to corpora list Status: O Welcome to the text corpora distribution list! I hope that you will be able to share information, questions, answers, programs etc. with the other members on the list. The list is unmoderated, but if necessary, it may be moderated in the future. It is not a LISTSERV list, the adresses are added/deleted manually. Please send administrative messages to: CORPORA-REQUEST@X400.HD.UIB.NO and not to the list. The list will be logged on the machine nora.hd.uib.no, see below how to access the log files. It will be possible to upload files, either by mail to corpora-request (text files and binhex/uuencoded files) or by FTP. Files can also be sent on diskettes. The list is hosted at the Norwegian Computing Centre for the Humanities in Bergen, Norway. Information stored at our machine nora.hd.uib.no can be accessed in several ways: FILESERV ======== The machine nora.hd.uib.no has been established as a mail based server for the Norwegian Computing Centre for the Humanities. Information is grouped in different directories, some of which have information in Norwegian only. Some of the available directories: corpora Information from the distribution list CORPORA, log files etc. icame International Computer Archive of Modern English info Information on texts, projects etc., mostly in English konferanser Information on conferences mac Macintosh programs ncch Norwegian Computing Centre for the Humanities Information in English nettinfo Information on network resources, mostly in English pc MS-DOS programs unix Unix programs The server is called FILESERV and runs the DECWRL archive server. FILESERV accepts three types of commands; several commands can be placed in the body of the mail message. However, the results will be sent in one file, so do not request several large files in one message. The commands are: Help Help file Index Top level index Index Index for a directory send Fetch a file in a directory Example: You want to get the index for the CORPORA and the KONFERANSER directories and the file cogn-lin.kon in the KONFERANSER directory. Send the following two notes ("index" and "send" commands cannot be put in the same message, the "send" commands will then be ignored): ---------------- To: fileserv@nora.hd.uib.no Subject: whatever index corpora index konferanser ---------------- To: fileserv@nora.hd.uib.no Subject: whatever send konferanser cogn-lin.kon ---------------- FTP SERVER ========== The files are also available via anonymous FTP from nora.hd.uib.no (129.177.24.42). To make use of this server, you must have access to a machine connected to Internet with TCP/IP and a program running the FTP protocol. Example: To get the directories of the server write the following: ftp nora.hd.uib.no cd pub dir The server has a directory for uploading, this is writeable but not readable. cd incoming (binary) (if transfer of programs or 8-bit data) put xx-program.zip Please send a note and a description to CORPORA-REQUEST if you upload any files! Other commands: get mget (to get several files, example: mget *.ex) cd (change directory) cd .. (up on level in the directory tree) binary (set binary transfer, for transfer of programs or 8-bit files) ascii (set transfer of 7-bit text data) GOPHER SERVER ============= The information is now also available through our Gopher server at nora.hd.uib.no (port 70). If you are connected to the Internet (with TCP/IP protocol), you can get client versions of Gopher for MS-DOS, Macintosh and Unix. Gopher is a tree structured menu system and several hundred servers are connected. Main menu on the nora.hd.uib.no machine: Internet Gopher Information Client v1.02 Root gopher server: nora.hd.uib.no --> 1. About this Gopher at NCCH. 2. Andre Gopher tjenere/ 3. Corpora (distribution list)/ 4. Forskjellig (various) Info/ 5. Humanistisk datasenter/ 6. ICAME (Text Corpora)/ 7. Konferanser (Conferences)/ 8. NCCH file servers. 9. Nettverk (Network) Info/ 10. Nordic Linguistic Bulletin/ 11. Norwegian Computing Centre for Humanities/ 12. Programs/ Press ? for Help, q to Quit, u to go up a menu Page: 1/1 Questions about these services can be directed to: Knut Hofland (knut@x400.hd.uib.no) Humanistisk datasenter, Norwegian Computing Centre for the Humanities, Harald Haarfagres gt. 31, N-5007 Bergen, Norway Phone +47 5 212954/5/6 Fax: +47 5 322656 ============================================================================ From knut Wed Sep 23 14:08:09 1992 Send-date: Tue, 22 Sep 1992 1:02:44 UTC+0200 From: To: Subject: German corpora Status: O I am looking for on-line German corpora to use in testing a spelling checker program. Any information would be appreciated. Ken Beesley beesley.parc@xerox.com ============================================================================ From knut Wed Sep 23 14:08:11 1992 Send-date: Tue, 22 Sep 1992 1:04:08 UTC+0200 From: To: Subject: Khalkha (Modern Mongolian) Status: O I am looking for on-line corpora in Khalkha (Modern Mongolian). Cyrillic orthography and Ulaanbaatar dialect preferred. Any information would be appreciated. Ken Beesley beesley.parc@xerox.com ============================================================================ From knut Wed Sep 23 14:08:14 1992 Send-date: Tue, 22 Sep 1992 2:14:30 UTC+0200 From: bert peeters To: Subject: French corpora Status: O Would anyone know of French corpora (on-line or available via ftp or so) suitable for lexicological/semantic research ? I do know about the ARTFL and the Oxford Text Archive. Any others ? Thanks. --------------------------------------------------------------------- Dr Bert Peeters Tel: +61 02 202344 Department of Modern Languages 002 202344 University of Tasmania at Hobart Fax: 002 207813 GPO Box 252C Bert.Peeters@modlang.utas.edu.au Hobart TAS 7001 Australia ============================================================================ From knut Wed Sep 23 14:08:16 1992 Send-date: Tue, 22 Sep 1992 2:43:48 UTC+0200 From: Bob Clark To: CORPORA Subject: French corpora Status: O Not an answer to Bert Peeters' query but a request that he (or someone) post info about ARTFL and the Oxford Text Archive for the uninitiated. Thanks. Bob Clark Kansas State Univ. ============================================================================ From knut Wed Sep 23 14:08:19 1992 Send-date: Tue, 22 Sep 1992 3:55:35 UTC+0200 From: bert peeters To: Subject: French corpora Status: O I mentioned the Oxford Text Archive because it did not contain as much as I had hoped for. I am therefore not qualified to give an unbiased view. As for ARTFL, there is an email address from which full info will be provided on simple request: artfl@artfl.uchicago.edu --------------------------------------------------------------------- Dr Bert Peeters Tel: +61 02 202344 Department of Modern Languages 002 202344 University of Tasmania at Hobart Fax: 002 207813 GPO Box 252C Bert.Peeters@modlang.utas.edu.au Hobart TAS 7001 Australia ============================================================================ From knut Wed Sep 23 14:08:21 1992 Send-date: Tue, 22 Sep 1992 18:18:32 UTC+0200 From: Dr G Knowles To: , Subject: Re: Khalkha (Modern Mongolian) Status: O A small corpus of Khalkha is being prepared at Lancaster University. The researcher is currently on a UNESCO project somewhere in the Gobi, and I am expecting her back in early December. We have already worked on a set of grammatical tags, and I am expecting her to devise an improved set on her return. Her name is Dorothy Bond. She won't have an e-mail address yet, but I can relay any messages. Gerry Knowles ============================================================================ From knut Wed Sep 23 14:08:23 1992 Send-date: Wed, 23 Sep 1992 3:52:57 UTC+0200 From: To: Subject: Finnish and Estonian corpora Status: O Does anyone know of any corpora in Finnish and Estonian? I am also interested in researches carried out using such corpora. Any information would be appreciated. Kazuto MATSUMURA ILCAA, Tokyo University of Foreign Studies Nishigahara 4-51-21 Kita-ku, Tokyo 114 Japan E-mail: G00814@sinet.ad.jp ============================================================================ From knut Wed Sep 23 14:08:30 1992 Send-date: Wed, 23 Sep 1992 9:28:22 UTC+0200 From: (Stig Johansson) To: Subject: Corpora of Finnish and Estonian Status: O An Estonian corpus is under development. Contact: Laboratory of the Estonian Language, Tartu University, EE2400 Tartu. Estonia. As regards Finnish corpora, contact: Fred Karlsson, Department of General Linguistics, University of Helsinki, Hallituskatu 11, SF-00100 Helsinki, Finland. Stig Johansson Oslo ============================================================================ From knut Wed Sep 23 14:08:33 1992 Send-date: Wed, 23 Sep 1992 10:44:00 UTC+0200 From: Vincent Ooi To: Subject: Wrongly directed messages Status: O From the number of wrongly directed messages to "corpora" instead of "corpora-request", as well as a message which tried to retrieve files from "corpora" (i.e. treating "corpora" as a listserv), perhaps I could help clarify the way the system has been set up, as I understand it: (1) a message sent to "corpora@x400.hd.uib.no" gets forwarded to ALL members of the list i.e. everyone gets to read your message; (2) a message sent to "corpora-request@x400.hd.uib.no" gets forwarded to the list administrator, presumably Knut; (3) a message sent to "fileserv@nora.hd.uib.no" is the proper way to access the Nora fileserver. The ftp number given in Knut's message is the alternative to retrieving files from the Nora machine. Hope this helps. Vincent Ooi ============================================================================ From postmaster@x400.hd.uib.no Thu Sep 24 04:31:06 1992 Date: Thu, 24 Sep 1992 02:31:06 +0200 From: Knut Hofland To: corpora Subject: Adm. message Due to some "noise" on the list, I am now inspecting mail to CORPORA before it is resent, filtering out requests for joining etc. This will mean that the turn-around time for messages will be about half a day. The original sender will not figure in the From: field, so it will not be possible to use the reply function of your mail program to make a simple reply. Knut Hofland ============================================================================ From postmaster@x400.hd.uib.no Thu Sep 24 04:32:45 1992 Date: Thu, 24 Sep 1992 02:32:45 +0200 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: Corpora of Arabic? Send-date: Wed, 23 Sep 1992 17:41:00 UTC+0200 From: Prof G N Leech To: Cc: Subject: Corpora of Arabic? I am interested in learning about any corpora of Arabic, or work on the collection, processing, etc. of texts in Arabic, currently in progress. Main interest is in modern Arabic. Information gratefully received. Geoffrey Leech, Lancaster University ============================================================================ From postmaster@x400.hd.uib.no Thu Sep 24 04:33:30 1992 Date: Thu, 24 Sep 1992 02:33:30 +0200 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: Madarin Corpora Send-date: Wed, 23 Sep 1992 17:17:15 UTC+0200 From: Yen Ketty To: CORPORA Subject: Madarin Corpora RFC-822-HEADERS: Return-Receipt-To: "Yen Ketty" ================== I need to know any Mandarin Corpora available (preferred in characters) to test a parsing system. Appreciate any information provided. Ketty Yen PRC Inc. 1500 PRC Dr., McLean, Va 22102 MS:5S3 U.S.A. tel: +703-556-1033 fax: +703-556-1174 yen_ketty@po.gis.prc.com (internet) ============================================================================ From postmaster@x400.hd.uib.no Thu Sep 24 05:05:40 1992 Date: Thu, 24 Sep 1992 03:05:40 +0200 From: Knut Hofland To: corpora@x400.hd.uib.no Subject: Mail loop: I am sorry I am sorry that I caused a mail loop, so instead of getting less mail you get more mail. You probably got several copies of the last messages and also some error messages. I have now corrected the this. Knut Hofland ============================================================================ From postmaster@x400.hd.uib.no Thu Sep 24 04:33:06 1992 Date: Thu, 24 Sep 1992 02:33:06 +0200 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: Oxford Text Archive information Send-date: Wed, 23 Sep 1992 18:24:46 UTC+0200 From: Lou Burnard To: Cc: Subject: Oxford Text Archive information As someone on this list asked what the Oxford Text Archive was, I take the liberty of sending a copy of our standard information. Those who have already seen it several times should feel free to delete it at once! Lou Burnard Oxford Text Archive ======================================================================== =This file (about 250 lines) contains general information about the = =Oxford Text Archive, about how to get copies of its catalogue, and how= =to get copies of the texts deposited there. = = Last update: 15 Aug 92 = ======================================================================== WHAT IS THE OXFORD TEXT ARCHIVE? The Oxford Text Archive is a facility provided by Oxford University Computing Services. It has no connexion with Oxford University Press or any other commercial organisation and exists to serve the interests of the academic community by providing archival and dissemination facilities for electronic texts at low cost. The Archive offers scholars long term storage and maintenance of their electronic texts free of charge. It manages non-commercial distribution of electronic texts and information about them on behalf of its depositors. WHAT TEXTS DOES IT CONTAIN? The Archive contains electronic versions of literary works by many major authors in Greek, Latin, English and a dozen or more other languages. It contains collections and corpora of unpublished materials prepared by field workers in linguistics. It contains electronic versions of some standard reference works. It has copies of texts and corpora prepared by individual scholars and major research projects worldwide. The total size of the Archive exceeds a gigabyte and there are about a thousand titles in its catalogue. WHERE CAN I GET A CATALOGUE? The Catalogue is available in paper form by post from the address below. New editions are published at least twice a year. It is also available in electronic form, either as a formatted file for display at a terminal or in a tagged form using SGML. These files are available from a number of different places under various names... (1) on the Oxford VAX Cluster as OX$DOC:TEXTARCHIVE.LIST and OX$DOC:TEXTARCHIVE.SGML (2) from various ListServers, e.g. LISTSERV@BROWNVM (send the mail message GET HUMANIST FILELIST for details) (3) by anonymous FTP from Internet site black.ox.ac.uk (129.67.1.165) in the directory /ota Wherever you are, you can send a note to ARCHIVE@VAX.OXFORD.AC.UK specifying which form you want. WHAT ARE THE TEXTS LIKE? Because the texts come from so many different sources, they are held in many different formats. The texts also vary greatly in their accuracy and the features which have been encoded. Some have been proof read to a high standard, while others may have come straight from an optical scanner, Some have been extensively tagged with special purpose analytic codes, and others simply designed to mimic the appearance of the printed source. The Archive does not require texts to conform to any standard of formatting or accuracy. HOW USABLE ARE THE TEXTS? Most of the texts can be used with commonly available text indexing and concordancing software, or can easily be converted for that purpose. All texts are held as `plain ASCII' files on magnetic tape, with no special formatting codes. Documentation of the coding scheme used in each text is supplied with it, wherever possible. WHAT ABOUT COPYRIGHT? Many of the texts in the Archive are subject to some form of copyright restriction. The Archive's obligations to its depositors generally restrict use of the texts to private study and research. In some cases, depositors have also authorised use of the texts in teaching. In all cases, users of the texts must agree not to use the texts commercially and not to redistribute copies of them without consultation. HOW DO I ACCESS THE TEXTS? If you are a registered user of Oxford University Computing Services (i.e. you have an account on OXFORD.VAX or black), just send an e-mail message to the username ARCHIVE (on either machine) specifying which texts you want to use and for what purpose. If you are not a registered OUCS user, you can access only texts in categories P, U and A as described further below. P category texts are in the public domain. No formality is needed for these texts. They can be downloaded directly by anonymous FTP, from black.ox.ac.uk or from other sites offering this facility. At present, very few texts are in this category; subject to agreement with our depositors we hope to increase the number greatly in the future. U and A texts are usually distributed on magnetic tape or cartridge, though smaller texts can be sent on diskette. We will also send copies to you via the network, if you send us the required information (i.e. a secure account-name and password), provided that this can be done with reasonable success. Where copies are made on disk or tape, we make a small distribution charge to cover media and postage which *must* be paid in advance. WHAT DO THE CODES IN THE CATALOGUE MEAN? Each title in the list is preceded by a code made of of a single letter indicating the availability of the text (U, A, P, or X), in some cases followed by a star, a number identifying the text and another single letter which gives some idea of the size of the text. Availability codes: X Available only to registered OUCS users. May not be copied U Freely available for scholarly use in private research. U* Freely available for scholarly use in private research and also for teaching purposes. A Available for scholarly use, but only with written authorisation from the depositor. P Public domain text. Available without formality to anyone. Size codes: A Size less than 512 Kb B Size between 512 Kb and 1 Mb C Size between 1 and 2 Mb D Size between 2 and 5 Mb E Size greater than 5 Mb Depending on format, a standard 600 foot magnetic tape will hold up to 50 texts of size category A. Most texts of size code A will fit on a standard double density floppy diskette; any text of size code A or B will fit on a standard high density diskette. WHAT DO I DO TO ORDER A COPY OF A TEXT? Texts with availability code P may be downloaded directly, either from our anonymous FTP server at black.ox.ac.uk [129.67.1.165] or from other FTP servers on the InterNet. For more information on using FTP, please contact your local computing service. For all other texts, you must complete and return the following proforma. For texts with availability code U, the only authorisation needed is your signature on the Order Form. For A category texts, you must also provide written authorisation from the depositor of the text; you should therefore ask us for depositor details before ordering. All orders must be prepaid to the account of Oxford University Computing Service, in sterling or in US dollars. We cannot issue invoices, and any orders which are not prepaid or not submitted on the standard order form will be ignored. ====================================================================== Oxford Text Archive email ARCHIVE @ UK.Ac.Oxford.VAX OUCS, 13 Banbury Road voice +44 (865) 273 238 Oxford OX2 6NN, UK fax +44 (865) 273 275 ====================================================================== OXFORD TEXT ARCHIVE OFFICIAL ORDER FORM *+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+* * Hardcopy of this form must be returned, duly completed, to * * Oxford Text Archive * * 13 Banbury Road * * Oxford OX2 6NN * * UK * * NB The whole of this document must be returned IN HARDCOPY * * All relevant parts of the form must be completed * * Payment must accompany the order * * Forms returned electronically will be ignored * *+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+* SECTION ONE: User Declaration Please sign the following, in the place indicated: In consideration of The Oxford Archive agreeing to supply me certain texts in machine-readable form together with supporting documentation as listed in Part Two below, hereinafter called 'the texts', I hereby undertake:- (1) To use the texts for purposes of private scholarly research only and not for profit (this shall not preclude the publication in a scholarly context of analyses or interpretations derived from the texts). To use and make available to others for educational purposes only texts specifically designated as `available for teaching purposes'. (2) To acknowledge in any work, published or unpublished, based in whole or in part on analyses made of the texts both the original depositors and the Archive. (3) Not to copy in whole or in part the text, except insofaras this may be necessary for security purposes or for my own personal use. Not to distribute the text to third parties, nor to publish or reproduce it in anyway, except for teaching purposes, where so permitted. Copyright of all machine-readable texts issued by the Archive is reserved to the Depositors. (4) To give access to the text only to persons directly associated with me or working under my control and to require of such persons signed undertakings neither to use the text except in connexion with my academic purposes nor to give access to the text to others; these signed undertakings to be made available to the Archive on request. (5) Not to hold the Archive liable for any errors of transcription which may be found in the texts, but to notify the Archive of such errors wherever possible. (6) To pay such charges as the Archive may determine from time to time to cover the cost of supplying the texts. SIGNATURE : [ ] DATE : [ ] PART TWO: Texts Required Please enter for each text the NUMBER and SHORT TITLE, as given in the current Archive Shortlist number Short title [ ][ ] [ ][ ] [ ][ ] [ ][ ] [ ][ ] [ ][ ] [ ][ ] [ ][ ] [ ][ ] NOTE: Only texts with an availability code of P, U or A may be ordered. Texts with Availability Code of A may be included in this list only if authorisation from the depositor accompanies this form. Depositor details are available on demand. [This page may be copied as necessary] SECTION THREE: Order Details -------------------------Return Address------------------------- Name Institution Department Street Post town Post code Country (if not UK) Electronic mail address ---------------------Format Required------------------------- Texts may be supplied on Magnetic Tape, Diskette or Data Cartridge. Texts may also be transferred over the network. Pricing is different for each format. Use tape if you are ordering more than a megabyte or so of data. Please complete ONE of the following sections A, B, C or D A-------------------Texts required on TAPE------------------------- For each heading below, please circle or tick ONE choice only Tape density: 1600 or 6250 Tape format: ASCII or EBCDIC Labelled or Unlabelled Fixed or Variable length Number of texts ordered [ ] @ 5 (pounds) Number of tapes required [ ] @ 15 (within Europe) [ ] @ 25 (outside Europe) B-------------------Texts Required on Diskette------------------ DD (360/720 Kb) or HD (1.2/1.4 Mb) MS/DOS or Macintosh 3.5" or 5.25 Number of diskettes: [ ] @ 15 (pounds) C-------------------Texts Required on Data Cartridge-------------- DC300, TAR format only Number of cartridges: [ ] @ 30 (pounds) D-------------------Texts Required via network-------------- Network: JANET InterNet (tick one) Account-name: Password: This account is under my personal control. I undertake to ensure security of the text stored here, in particular to ensure that others do not have access to it. Signature: At present, there is no charge for texts supplied over the network. --------------------------------------------------------------------- TOTAL SUM ENCLOSED [ ] (pounds) Payments in US Dollars should be converted at $2 = 1 pound --------------------For Text Archive Use Only------------------- Order number [ ] Received [ ] Processed [ ] Despatched [ ] *+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+* * Hardcopy of this form must be returned, duly completed, to * * Oxford Text Archive * * 13 Banbury Road * * Oxford OX2 6NN * * UK * * NB The whole of this document must be returned IN HARDCOPY * * All relevant parts of the form must be completed * * Payment must accompany the order * * Forms returned electronically will be ignored * * * *+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+* ============================================================================ From postmaster@x400.hd.uib.no Thu Sep 24 17:40:30 1992 Date: Thu, 24 Sep 1992 15:40:30 +0200 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: ACL European Corpus Initiative Send-date: Thu, 24 Sep 1992 13:07:38 UTC+0200 From: Henry S. Thompson To: Message-ID: corpora:19 313.9209241102(a)fine.cogsci.ed.ac.uk Subject: ACL European Corpus Initiative European Corpus Initiative Call For Contributions September, 1992 The European Corpus Initiative was founded to oversee the acquisition and preparation of a large multi-lingual corpus to be made available in digital form for scientific research at cost and without royalties. We believe that widespread easy access to such material would be a great stimulus to scientific research and technology development as regards language and language technology. We support existing and projected national and international efforts to carefully design, collect and publish large-scale multi-lingual written and spoken corpora, but also believe it will be some time before the scientific and material resources necessary to bring these projects to fruition will be found. In the interim, a small and rapid effort to collect and distribute existing material can serve to show the way. No amount of abstract argument as to the value of corpus material is as powerful as the experience of actually having access to some in one's laboratory. We aim to make that experience possible very soon, at a very low cost. The ECI is carrying out the first phase of this activity on a purely voluntary basis, under the guidance of an ad-hoc steering committee, using facilities donated by the Human Communication Research Centre at the University of Edinburgh and a small sum for expenses and production costs provided by the European Network for Language and Speech under its Linguistic Resources programme together with the Network of European Reference Corpora. Our present goal is to produce in short order (we're currently aiming for November 1992) a multi-lingual corpus covering as many as possible of the major European languages, in a consistent format, with standardised (TEI-conformant) markup, insofar as resources allow. Our primary focus in this first effort is on textual material of all kinds, including transcriptions of spoken material, but if space and resources permit we may be able to include some sampled speech data as well. If in doubt as to the appropriateness of a contribution, please contact us before assuming we won't want it. As our main method of distribution for this corpus, we will produce a CD-ROM, possibly two if enough material can be collected and prepared in time. We estimate that we should be able to make the results available for around 25 ECU. Because of the low level of resource available for this effort, we are entirely dependent on the goodwill of those members of the research community who have appropriate corpus material, to make it available to us for wide distribution. PLEASE SEND US YOUR DATA. We have promises of material for many, but by no means all, of the languages we would like to cover, and in only one or two cases do we have as much as we would like. We can't guarantee to use everything which is offered, but please, let us judge whether it would be useful. If you know of someone with material which might be appropriate, who may not have received this notice, please pass it on to them. To contribute data, please send electronic or paper mail to one of the addresses given below, describing the data, its current format and the medium it is stored in, and the restrictions on its use, if any, which you would have to impose in making it available to us. European Corpus Initiative Steering Committee The current members of the Steering Committee are Nicoletta Calzolari (University of Pisa), Robert Dale (ELSNET), Mark Liberman (University of Pennsylvania), Wolf Paprotte (University of Munster), Henry Thompson (University of Edinburgh) and Susan Warwick-Armstrong (ISSCO, Geneva). Addresses for further information and offers of material for inclusion: Henry S. Thompson (ECI) HCRC 2 Buccleuch Place Edinburgh EH8 9LW SCOTLAND Fax: +44 31 650-4587 eucorp@cogsci.ed.ac.uk Susan Warwick-Armstrong (ECI) ISSCO 54 route des Acacias CH-1227 Geneve SWITZERLAND Fax: +41 22 300 1086 susan@divsun.unige.ch --- end of message --- ============================================================================ From postmaster@x400.hd.uib.no Thu Sep 24 17:45:00 1992 Date: Thu, 24 Sep 1992 15:45:00 +0200 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: Re: Mandarin Corpora Date: Thu Sep 24 12:14:59 1992 From: HSCHUREN Subject: Re: Madarin Corpora To: corpora@nora.hd.uib.no ++Mandarin Corpora++ The ROC Computational Linguistics Society (ROCLING) has a 10 million character untagged corpora that is available to its members for research for a small fee. There are several other Mandarin Corpora that are being developed by ROCLING members and will be open to the research community in due time. You can join ROCLING by sending a request with your smail mail address to rocling@ccvax.as.edu.tw (OOps, snail mail). Youcan then fill the forms out and send it back with your US $50.00 annual fee. ROCLING members are entitled to discounts on all ROCLING conferences, tutorials, and publications. Theyu will also receive monthly newsletters (sent through air mail to overseas addresses). Chu-Ren Huang Institute of History and Philology Scademia Sinica hschuren@ccvax.as.edu.tw ============================================================================ From postmaster@x400.hd.uib.no Fri Sep 25 00:43:35 1992 Date: Thu, 24 Sep 1992 22:43:35 +0200 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: Follow-up to ACL/ECI Corpus call Send-date: Thu, 24 Sep 1992 16:28:12 UTC+0200 From: Henry S. Thompson To: Message-ID: corpora:22 575.9209241423(a)fine.cogsci.ed.ac.uk Subject: Follow-up to ACL/ECI Corpus call With respect to the ACL/ECI call for data, I omitted to mention that Lori Lamel from LIMSI is also a member of the ECI steering committee. David McKelvie ============================================================================ From postmaster@x400.hd.uib.no Fri Sep 25 00:44:02 1992 Date: Thu, 24 Sep 1992 22:44:02 +0200 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: Old English corpus? Send-date: Thu, 24 Sep 1992 18:35:52 UTC+0200 From: (John Bro) To: Message-ID: corpora:23 9209241633.AA20050(a)elm.circa.ufl.edu Subject: Old English corpus? RFC-822-HEADERS: X-Mailer: ELM [version 2.3 PL11] ================== Does anyone have an Old English corpus (preferably glossed) in machine readable format (preferably ascii) ? Even something as small as 10 or 20 Kb would be nice. The goal is merely for an exercise in taxonomy for a beginning Historical Linguistics class. I could also contribute a pair of transcripts of parallel role-plays by 2 groups of 4 speakers: the first by American education majors, the second in English but by native speakers of Mandarin. Both groups were given the same roles and background info. More details on request. Thanks. ============================================================ John Bro | bro@elm.circa.ufl.edu Linguistics | bougie@pine.circa.ufl.edu University of Florida | bougie@ufpine.bitnet Gainesville, Fl 32611 | bro@reef.cis.ufl.edu ============================================================================ From postmaster@x400.hd.uib.no Fri Sep 25 12:03:15 1992 Date: Fri, 25 Sep 1992 10:03:15 +0200 From: Knut Hofland To: Receivers of list CORPORA Subject: ACH-ALLC Conference 1993 ASSOCIATION FOR COMPUTERS AND THE HUMANITIES ASSOCIATION FOR LITERARY AND LINGUISTIC COMPUTING 1993 JOINT INTERNATIONAL CONFERENCE ACH-ALLC93 JUNE 16-19, 1993 GEORGETOWN UNIVERSITY, WASHINGTON, D.C. CALL FOR PAPERS This conference is the major forum for literary, linguistic and humanities computing. It is concerned with the development of new computing methodologies for research and teaching in the humanities, the development of significant new networked-based and computer-based resources for humanities research, and the application and evaluation of computing techniques in humanities subjects. TOPICS: We welcome submissions on topics such as text encoding; statistical methods for text analysis; hypertext; text corpora; computational lexicography; morphological, syntactic, semantic and other forms of text analysis; also, computer applications in history, philosophy, music and other humanities disciplines. For the 1993 conference, ACH and ALLC extend a special invitation to members of the library community to contribute to the conference on the topics of creating and cataloguing network-based resources in the humanities, developing and integrating databases of texts and images of works central to the humanities, and refining retrieval techniques for humanities databases. LOCATION: Georgetown, an historic residential district along the Potomac River, is a six-mile ride by taxi from Washington National Airport. International flights arrive at Dulles Airport, which offers regular bus service to the Nation's Capital. REQUIREMENTS: Proposals should describe substantial and original work. Proposals describing the development of new computing methodologies should make clear how these methodologies are applied to research and/or teaching in the humanities. Those concerned with a particular application (e.g., a study of the style of an author) should cite previous approaches to the problem and should include some critical assessment of the computing methodologies used. All proposals should include references to important sources. ABSTRACT LENGTH: Abstracts of 1500-2000 words in length should be submitted for presentations of thirty minutes including questions. SESSION PROPOSALS: Proposals for sessions (90 minutes) are also invited. These should take the form of either: (a) Three papers. The session organizer should submit a 500-word statement describing the session topic, include abstracts of 1000-1500 words for each paper, and indicate that each author is willing to participate in the session. (b) A panel of up to 6 speakers. The panel organizer should submit an abstract of 1500-2000 words describing the panel topic, how it will be organized, the names of all the speakers, and an indication that each speaker is willing to participate in the session. DEADLINE FOR SUBMISSIONS: November 1, 1992 NOTIFICATION OF ACCEPTANCE: February 1, 1993 FORMAT FOR SUBMISSIONS: Electronic submissions are strongly encouraged, and should follow strictly the format given below. Submissions that do not conform to this format will be returned to the authors for reformatting, or may not be considered if they arrive near the deadline. All submissions should include a header in the following format: TITLE: title of paper AUTHOR(S): names of authors AFFILIATION: affiliations of author(s) CONTACT ADDRESS: full postal address of main author (for contact) E-MAIL: electronic mail address of main author followed by other authors (if any) FAX NUMBER: fax for main author PHONE NUMBER: phone for main author ELECTRONIC SUBMISSIONS: Please submit plain ASCII text files. Files that include formatting by a wordprocessor, TAB characters, and soft hyphens are not acceptable. Paragraphs should be separated by blank lines. Headings and subheadings should be on separate lines and be numbered. References (up to six) and notes should appear at the end of the abstract. Where necessary, a simple markup scheme for accents and other characters that cannot be transmitted by electronic mail should be used; provide an explanation of the markup scheme after the title information. If diagrams are necessary for the evaluation of an electronic submission, they should be faxed to 1-202-687-6003 (after dialing one's international access code) or 202-687-6003 (from within the US), and a note to indicate the presence of diagrams should be inserted at the beginning of the abstract. Address for electronic submissions: Neuman@GUVAX.Georgetown.edu (include a subject line " Submission for ACH-ALLC93"). PAPER SUBMISSIONS: Submissions should be typed or printed on one side of the paper only, with ample margins. Six copies should be sent to ACH-ALLC93 (Paper submission) Dr. Michael Neuman Academic Computer Center 238 Reiss Science Building Georgetown University Washington, D.C. 20057 PUBLICATION: A selection of papers presented at the conference will be published in the series Research in Humanities Computing edited by Susan Hockey and Nancy Ide, published by Oxford University Press. INTERNATIONAL PROGRAM COMMITTEE Chair: Marianne Gaunt, Rutgers University (ACH) Thomas Corns, University of Wales, Bangor (ALLC) Paul Fortier, University of Manitoba (ACH) Jacqueline Hamesse, Universite Catholique Louvain-la-Neuve (ALLC) Susan Hockey, Rutgers and Princeton Universities (ALLC) Nancy Ide, Vassar College (ACH) Randall Jones, Brigham Young University (ACH) Michael Neuman, Georgetown University (ACH) (Local organizer) Antonio Zampolli, University of Pisa (ALLC) INQUIRIES Please address all inquiries to: ACH-ALLC93 Dr. Michael Neuman, Local Organizer Academic Computer Center 238 Reiss Science Building Georgetown University Washington, D.C. 20057 Phone: 202-687-6096 FAX: 202-687-6003 Bitnet: Neuman@Guvax Internet: Neuman@Guvax.Georgetown.edu Please include your name, full mailing address, telephone and fax numbers, and e-mail address with any inquiry. ============================================================================ From postmaster@x400.hd.uib.no Fri Sep 25 12:10:43 1992 Date: Fri, 25 Sep 1992 10:10:43 +0200 From: Knut Hofland To: Receivers of list CORPORA Subject: Large Corpora Workshop I resend this message since I got more than 250 new members after this was sent out. I apologise to those who got this message before. -Knut Hofland ================== WORKSHOP ON VERY LARGE CORPORA: ACADEMIC AND INDUSTRIAL PERSPECTIVES Call for Papers WHEN: Tuesday, June 22, 1993 (just before ACL-93) WHERE: Ohio State University Sponsored by the Association for Computational Linguistics (ACL), Chemical Abstracts, Mead Data Central (MDC), Online Computer Library Center (OCLC) Corpus linguistics is a hot topic, and for good reason. Text is more available than ever before. And, consequently, it is easier to use corpus data more effectively than it was in the 1950s, the last time that empiricism was in fashion. All of this data provides a great opportunity, as evidenced by all of the recent activity in Europe, Asia and America. How large is ``large''? Large can mean anything from about 10^4 words to 10^9 words. This workshop will bring together a range of people working at a range of different points along this scale. We expect to hear from industrialists who routinely deliver products based on tens of billions of words of text, and from academics who will tell us about recent advances in text analysis. The discussion will hopefully push the academics to think about even larger corpora, and the industrialists to think about somewhat more ambitious analysis techniques. Authors should submit three copies of a full-length paper (5-10 pages) to the program chair by April 1, 1993. Paper submissions are strongly preferred over electronic submissions. Notifications of acceptance or rejection will be sent out by May 1, 1993. Relevant topics include (but are not limited to) Text Analysis Techiques: - ``robust'' parsing - part of speech tagging - sense tagging - identification of phrases - collocation - morphology - discourse structure Applications: - Information Retrieval (IR) - Recognition: Speech, OCR, handwriting, etc. - Spelling Correction - Translation - Lexicography Program Chair: Kenneth Ward Church AT&T Bell Laboratories, 2b422 600 Mountain Ave Murray Hill, NJ 07974 USA tel: 908-582-5325 fax: 908-582-7550 email: kwc@research.att.com ============================================================================ From postmaster@x400.hd.uib.no Fri Sep 25 23:38:14 1992 Date: Fri, 25 Sep 1992 21:38:14 +0200 From: Knut Hofland To: Receivers of Corpora list Subject: Re: Old English corpus? In the ICAME material we have the Helsinki Diachronic Corpora, which covers the period from 850-1710. Parts of the manual are machine readable and can be fetched by sending the following requests to fileserv@nora.hd.uib.no send icame helsinki.manual.part1 send icame helsinki.manual.part2 send icame helsinki.manual.part3 (send these as 3 letters, some of the files are > 100 KB) The files are also available via FTP in catalogue pub/icame, as described in the welcome message. To get the order form: send icame orderform.helsinki To get a description of the ICAME material: send icame icame.material ICAME = International Computer Archive of Modern English Knut Hofland Norwegian Computing Centre for the Humanities, Harald Haarfagres gt. 31, N-5007 Bergen, Norway Phone +47 5 212954/5/6 Fax: +47 5 322656 E-mail: knut@x400.hd.uib.no ============================================================================ From postmaster@x400.hd.uib.no Fri Sep 25 23:51:11 1992 Date: Fri, 25 Sep 1992 21:51:11 +0200 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: Re: Old English corpus? From: Leena Sadeniemi Subject: Re: Old English corpus? To: corplst@nora.hd.uib.no Date: Fri, 25 Sep 92 9:14:41 EET DST In-Reply-To: <199209242043.AA21129@nora.hd.uib.no>; from "postmaster@x400.hd.uib.no" at Sep 24, 92 10:44 pm X-Mailer: ELM [version 2.3 PL11] Status: RO At Helsinki University we have an old English corpus called Helsinki Corpus. Ask Dr. Merja Kyt| (mkyto@cc.helsinki.fi) for more detailed information. Leena Sadeniemi Computer Center of Helsinki Unversity ============================================================================ From postmaster@x400.hd.uib.no Fri Sep 25 23:50:24 1992 Date: Fri, 25 Sep 1992 21:50:24 +0200 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: Re: Old English corpus? Date: Thu, 24 Sep 92 22:02:23 EDT From: David Megginson Subject: Re: Old English corpus? To: CORPORA list In-Reply-To: Your message of Thu, 24 Sep 1992 22:44:02 +0200 Status: RO On Thu, 24 Sep 1992 22:44:02 +0200 you said: > Does anyone have an Old English corpus (preferably glossed) >in machine readable format (preferably ascii) ? Even something as small >as 10 or 20 Kb would be nice. The goal is merely for an exercise in >taxonomy for a beginning Historical Linguistics class. The Dictionary of Old English in Toronto has a relatively complete OE corpus available (excluding duplicate mss.), which is now available through the Oxford Text Archive -- there is one text which requires special permission, but the others are available. They are not glossed, and occupy about 30MB depending on your markup scheme. David %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% David Megginson Department of English, dmeggins@acadvm1.uottawa.ca University of Ottawa %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ============================================================================ From postmaster@x400.hd.uib.no Fri Sep 25 23:45:11 1992 Date: Fri, 25 Sep 1992 21:45:11 +0200 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: Character Sets Send-date: Fri, 25 Sep 1992 19:22:33 UTC+0200 From: Jon Whalen To: Subject: Character Sets RFC-822-HEADERS: Summary: Looking for info Keywords: character-set, charset, font To All, I. I'm looking for pointers to documents describing the encodings used by several different character sets. In particular, I'm looking for the encodings for: 1. EBCDIC 2. ISO Latin-1 (ISO 8859-1) 3. ISO Latin-2 (ISO 8859-2) 4. Unicode [more than just the ascii part] 5. ISO 10646 If you know of any electronic document[s] describing any of the above, especially one that's ftpable, I'd really appreciate hearing from you. II. Are the following statements correct? If not, how far off the mark are they? 1. A grapheme is logical unit of meaning used in the written representation of a language. 2. A glyph is a pictorial representation of a grapheme. 3. A font is an ordered set of glyphs. (i.e. a mapping between numeric (integer) values and glyphs.) 4. A character set is an ordered set of graphemes. (i.e. a mapping between numeric (integer) values and graphemes.) 5. A font encoding is a mapping between a character set and a font. III. I'm looking for pointers to documentation on standard and ad-hoc mappings between 7-bit ascii and 8-bit charsets. For example, on the mailing list GAELIC-L, the accented vowels: a_acute, e_acute, i_acute, o_acute, u_acute, a_grave, e_grave, i_grave, o_grave, u_grave; are represented as vowel + / and vowel + \, respectively. What other encodings are in use? Where might I get documentation (formal or informal) on them? Are there ISO standards in existence or under development? Thanx very much for your help! :-) --jon *Jon S. Whalen Phone: (708) 576-0166* *Staff Engineer, Motorola, Inc. Fax: (708) 576-0892* *Corporate, Computer & Communications R&D * *Internet: jon@hook.corp.mot.com / Compuserve: 76665,3043 / AOL:JonSWhalen* ============================================================================ From postmaster@x400.hd.uib.no Mon Sep 28 15:33:14 1992 Date: Mon, 28 Sep 1992 14:33:14 +0100 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: 'graded' corpora Send-date: Fri, 25 Sep 1992 22:13:15 UTC+0200 From: Robert Morris To: Reply-To: Message-ID: corpora:36 199209252006.AA27612(a)claude.cs.umb.edu Subject: "graded" corpora While I know this is somewhat distant from the interests of most readers of this list, I wonder if anyone knows a source of electronic American english corpora of fixed reading difficulty. Reading difficulty should be measured by any standard test and should be around secondary school levels. I need this for reading rate studies in which I would like to control for the difficulty of the text. I need at least 50,000 words, and probably much more. It should be sparse in proper names (which tends to eliminate news articles, the most obvious source). Thanks Bob Morris ============================================================================ From postmaster@x400.hd.uib.no Mon Sep 28 15:33:31 1992 Date: Mon, 28 Sep 1992 14:33:31 +0100 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: Character Sets 1) ============================================== Send-date: Fri, 25 Sep 1992 22:47:41 UTC+0200 From: (Glenn Adams) Subject: Character Sets 2) ============================================== Send-date: Fri, 25 Sep 1992 23:04:11 UTC+0200 From: ted Subject: Character Sets 3) ============================================== Send-date: Mon, 28 Sep 1992 9:06:16 UTC+0100 From: Dominic Dunlop Subject: Re: Character Sets 1) ===================================================================== Send-date: Fri, 25 Sep 1992 22:47:41 UTC+0200 From: (Glenn Adams) To: Message-ID: corpora:37 9209252036.AA05934(a)boas.metis.com Subject: Character Sets > Date: Fri, 25 Sep 1992 21:45:11 +0200 > From: Jon Whalen > Subject: Character Sets > > 1. EBCDIC > 2. ISO Latin-1 (ISO 8859-1) > 3. ISO Latin-2 (ISO 8859-2) > 4. Unicode [more than just the ascii part] > 5. ISO 10646 ISO standard documents may purchased in the US from Omnicom, (703) 281-1135. Unicode, and by extension ISO10646, is available in two volumes at your local computer bookstore (published by Addison-Wesley). As for EBCDIC, see "EBCDIC Bibliographic Character Sets - Sources and Uses: A Brief Report," Journal of Library Automation, Vol. 12/4, December 1979. This report documents the extended EBCDIC used by many bibliographic services. Also see IBM Publication 1403-03, Order #GA 24-3073-7, p. 36-37. > 1. A grapheme is logical unit of meaning used in the written representation > of a language. In my recent text, "Introduction to Unicode," Proceedings of the Third Unicode Implementor's Workshop, Ausust 1992, I defined it as follows: "A minimally distinctive unit of writing in the context of a particular writing system." > 2. A glyph is a pictorial representation of a grapheme. I had: "An abstract form which represents one or more glyph images, and which is used to visually depict encoded character data." A "glyph image" is then: "The actual, concrete image of a glyph representation having been rasterized or otherwise imaged onto some display surface." There is some distinction between "glyph" as "allograph" and "glyph image" as a particular visual instance of a glyph, since one may take the same allograph and display it under different transformations, linear and non-linear. > 3. A font is an ordered set of glyphs. > (i.e. a mapping between numeric (integer) values and glyphs.) OK. I had: "A collection of glyphs used for the visual depiction of character data. A font is often associated with a set of parameters, e.g., size, posture, weight, serifness, etc., which, when set to particular values, generate a collection of imagable glyphs." > 4. A character set is an ordered set of graphemes. > (i.e. a mapping between numeric (integer) values and graphemes.) Unfortunately, it doesn't quite work this way. The reason being that what constitutes a grapheme can only be determined in the context of a particular writing system, i.e., a particular language, a particular set of orthographic rules, and one or more collections of symbols (scripts) used to depict writing. Because character sets (generally) exclude language and orthographic differences, the characters therein end up encoding the elements of scripts independently of their status as graphemes. Also, because character sets in the past have often been confused with the enumeration of the elements of a font, allographic symbols and arbitrary glyphic elements have been encoded along with the more independent graphemic symbols. The best working definition for a character set is simply as "a collection of elements used to organize, control, or represent information." A character is then simply "an element of a character set," i.e., an element which "organizes, controls, or represents information." Trying to read anything else into "character," at least as currently implement by existing character set standards, will soon lead you to great conceptual difficulties. However, to go out on a conceptual limb, I would suggest that under ideal circumstances a character represents "an abstract form which may solely or jointly be used to represent the symbol of a script, which, in turn, may represent one or more graphemes in the context of implicit or explicit assumptions about the language and orthographic rules which apply to the character encoding." For example, the ASCII character 0x41 represents the LATIN CAPITAL LETTER A symbol, which, in an American English text, represents the graphemes and . How's that for clarity? > 5. A font encoding is a mapping between a character set and a font. No. A font encoding is a mapping from glyph codes, to the glyphs of a font. The mapping between character set(s) and font(s), is, in general, much more complex. I offer the following excerpt from a message I sent recently on the newsgroup "comp.fonts." In general, this model will not work for universal character sets which cover large numbers of writing systems, e.g., Unicode. It must be replaced by a more sophisticated model: string of unicode characters -> rendering & layout engine -> positioned glyphs In general, the rendering and layout engine must be able to do the following: - map 1 character to N (possibly discontiguous) glyphs* - map 1 character to 1 of N glyphs depending on context - map N characters to 1 glyph* - map M characters to N (possibly discontiguous) glyphs* - reorder resulting glyphs according to bidirectional display requirements - compute attachment points for glyphs which attach to other glyphs - deform glyphs for performing justifiction which doesn't use whitespace justification *may also require context sensitivity > III. I'm looking for pointers to documentation on standard and ad-hoc > mappings between 7-bit ascii and 8-bit charsets. The Unicode standard contains a large number of mapping tables to/from other character sets and Unicode, e.g., all of the ISO8859 series, PC code pages, Apple character sets, etc. Glenn Adams Cambridge, Massachusetts 2) ===================================================================== Send-date: Fri, 25 Sep 1992 23:04:11 UTC+0200 From: ted To: Cc: Message-ID: 9209252102.AA24598(a)NMSU.Edu Subject: Character Sets there is no single ebcdic standard. one of the better ways to get an ebcdic table is to run dd on a unix machine and have it convert a file with all the consecutive byte values in various ways. the best source on the unicode and 8859 character sets is the unicode book from addison weseley. it is supposed to have a companion disk, but a-w is apparently committed to forgeting that they sell this. i don't think that your definitions take script fonts like arabic and compositional fonts like korean. for that matter, many languages are essentially compositional in at least some respect (how else to you describe strike-outs, or underlines, much less the fact that you can accent or otherwise add any diacritic to any european character). with regard to ad-hoc mappings, they are just that. ad-hoc. incompatible. many of the asian character sets use a shift character after which it is assumed that the high bit is set, and many other systems are based (loosely) on tex. mac users often (accidentally, they claim) use R and S for opening and closing single quotes. 3) ===================================================================== Send-date: Mon, 28 Sep 1992 9:06:16 UTC+0100 From: Dominic Dunlop To: Message-ID: corpora:39 4348.9209280804(a)onions.natcorp.ox.ac.uk Subject: Re: Character Sets Jon Whalen writes: > I. I'm looking for pointers to documents describing the encodings used > by several different character sets. In particular, I'm looking for the > encodings for: > > 1. EBCDIC > 2. ISO Latin-1 (ISO 8859-1) > 3. ISO Latin-2 (ISO 8859-2) > 4. Unicode [more than just the ascii part] > 5. ISO 10646 For 2 and 6, use anonymous ftp from dkuug.dk, subdirectory isp. --- Dominic ============================================================================ From postmaster@x400.hd.uib.no Tue Sep 29 01:13:47 1992 Date: Tue, 29 Sep 1992 00:13:47 +0100 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: Chaucer Conf. Send-date: Mon, 28 Sep 1992 14:57:14 UTC+0100 From: (Ian Lancashire) To: Message-ID: corpora:45 9209281355.AA24731(a)epas.utoronto.ca Subject: Chaucer Conf. CONFERENCE ANNOUNCEMENT Of Remembrance the Keye: Computer-Based Chaucer Studies Sponsored by the Centre for Computing in the Humanities and the Department of English, University of Toronto Friday November 6, 1:30-5:00 pm, and Saturday November 7, 9:00 am-5:00 Location: Room 140, University College 15 King's College Circle St. George Campus University of Toronto The full announcement can be retrieved by sending the following line to FILESERV@HD.UIB.NO send corpora chaucer.conference ============================================================================ From postmaster@x400.hd.uib.no Tue Sep 29 01:14:57 1992 Date: Tue, 29 Sep 1992 00:14:57 +0100 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: computational indirect discourse Send-date: Mon, 28 Sep 1992 16:39:45 UTC+0100 From: Martin Wynne To: Message-ID: inbox:2296 5605.9209281539(a)sun.leeds.ac.uk Subject: computational indirect discourse Has anyone done or does anyone know of computational analysis done on indirect discourse (e.g. identifying formal features of reported speech, identifying narrative voices, etc.). References to any formal work in this area would be useful. Martin Wynne Dept. Linguistics & Phonetics University of Leeds ============================================================================ From postmaster@x400.hd.uib.no Tue Sep 29 01:14:33 1992 Date: Tue, 29 Sep 1992 00:14:33 +0100 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: Re: 'graded' corpora Send-date: Mon, 28 Sep 1992 7:38:08 UTC-0700 From: To: Message-ID: inbox:2295 01GPB6490PXE8ZJMO8(a)CCIT.ARIZONA.EDU Subject: Re: 'graded' corpora I don't know of any, but I'd welcome the info on ANY American English corpora anybody knows of. I am trying to build a large collection for research purposes, and a number of other people in TESOL would like to do this, also. The TESOL CALL Interest Section has a text collection in both MS-DOS and Mac formats, available from DHEALEY@OREGON.BITNET, but these are not "graded". The ascii texts in programs such as "Text Tanglers" could be used for any in-house purposes, and I'm sure such authors would be glad to give permission for not quite in-house use if asked. Although such texts are also not "graded" in this way (whatever such grading means in native speaker workdoes not transfer to ESL/EFL), they should be easily gradable by running them through your favorite document analyzer. Macey Taylor maceytay@ccit.arizona.edu ============================================================================ From postmaster@x400.hd.uib.no Tue Sep 29 13:13:31 1992 Date: Tue, 29 Sep 1992 12:13:31 +0100 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: Re: 'graded' corpora From: trobb To: Message-ID: inbox:2298 00961565.D96A4920.14178(a)jpnksuvx.BITNET Subject: Re: 'graded' corpora You might try using the SRA Reading Laboratory materials. These are graded according to the Fry scale (I believe) in half grade intervals of reading difficulty. There is some question in my mind as to the validity of the Fry scale since it doesn't take many factors into account which can cause text to be less readable -- such as topic, cultural background of the reader and even such mundane things as type size and word density on the page. There are 15 passages at each level, and there are several kits containing passages at the same level, so you could probably get 50,000 words or so at each level using these materials. Another approach would be for you to take anything you can get your hands on which you think might be appropriate, and then rate the reading level yourself. There are any number of programs which can do this. Even my word processor, Nisus (Macintosh) has this feature. --Tom Robb Kyoto Sangyo University, Japan ============================================================================ From postmaster@x400.hd.uib.no Tue Sep 29 13:13:51 1992 Date: Tue, 29 Sep 1992 12:13:51 +0100 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: Re: computational indirect discourse Send-date: Tue, 29 Sep 1992 13:39:48 UTC+1000 From: LLOYD HOLLIDAY, LA TROBE UNIV, EDUCATION To: Message-ID: inbox:2300 01GPCX4UU2768WXHJR(a)lure.latrobe.edu.au Subject: Re: computational indirect discourse With regard to Martin Wynne's request, you could try John Fought in the Linguistics Dept. at the University of Pennsylvania, Philadelphia, PA 19104. He has an interest in this area and had several students who may have worked on these issues in various languages. Lloyd Holliday edulh@lure.latrobe.edu.au ============================================================================ From postmaster@x400.hd.uib.no Wed Sep 30 02:20:55 1992 Date: Wed, 30 Sep 1992 01:20:55 +0100 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: Call for Papers: Corpus-Based Linguistics Send-date: Tue, 29 Sep 1992 18:59:13 UTC+0100 From: Cathy Ball To: Message-ID: corpora:52 01GPCXV90DT08ZENO6(a)guvax.acc.georgetown.edu Subject: Call for Papers: Corpus-Based Linguistics Call for Papers Georgetown University Round Table On Languages and Linguistics (GURT) Pre-Session: CORPUS-BASED LINGUISTICS Wednesday March 10, 1993 The analysis of large text corpora is engaging the interest of linguists from many subfields, as the field turns away from linguistic analysis based on introspection to data-oriented approaches. Currently, insights are not fully shared, as the subfields and related disciplines often present research at different conferences. For this full-day GURT pre-session, 20-minute papers are solicited on the following topics: - the design and collection of text/speech corpora - tools for searching and processing on-line corpora - critical assessments of on-line corpora and corpus-processing tools - methodological issues in corpus-based analysis - applications and results in linguistics and related disciplines, including language teaching, computational linguistics, historical linguistics, discourse analysis, and stylistic analysis Send 1 page (500-word) abstracts to cball@guvax.georgetown.edu (Internet), cball@guvax (Bitnet), or Catherine N. Ball, Dept. of Linguistics, Georgetown University, Washington DC 20057. Electronic submissions are encouraged. Please include name, institution, address, telephone number, and e-mail address. Deadline for receipt of abstracts is Dec. 1, 1992. ============================================================================ From postmaster@x400.hd.uib.no Wed Sep 30 02:21:16 1992 Date: Wed, 30 Sep 1992 01:21:16 +0100 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: Re: 'graded' corpora Send-date: Tue, 29 Sep 1992 15:32:41 UTC+0100 From: S Hanlon To: (CORPORA list) Message-ID: inbox:2308 19218.9209291432(a)csgi05.scs.leeds.ac.uk Subject: Re: 'graded' corpora Could someone point me in the direction of the SRA Reading Laboratory material? Is it available on-line anywhere? Thanks, Steve Hanlon ============================================================================ From postmaster@x400.hd.uib.no Thu Oct 1 02:19:16 1992 Date: Thu, 1 Oct 1992 01:19:16 +0100 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: Re: Call for Papers: Corpus-Based Linguistics Send-date: Tue, 29 Sep 1992 17:38:11 UTC-0700 From: To: Message-ID: inbox:2310 01GPD5JGT7308ZI8VS(a)CCIT.ARIZONA.EDU Subject: Re: Call for Papers: Corpus-Based Linguistics I'd like to say Hallelujah to the return of linguistics to considering real language! Wish G'town weren't so far away. Macey Taylor ============================================================================ From postmaster@x400.hd.uib.no Thu Oct 1 02:21:05 1992 Date: Thu, 1 Oct 1992 01:21:05 +0100 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: Re: 'graded' corpora Send-date: Tue, 29 Sep 1992 17:40:28 UTC-0700 From: To: Message-ID: inbox:2311 01GPD5LF6HYQ8ZI8VS(a)CCIT.ARIZONA.EDU Subject: Re: 'graded' corpora You hunt up a K-12 sales rep. It's commercial stuff, widely used in elem/sec schools and some of it in ESL programs. Call your local school board office if you can't find a rep. Or the Reading Dept at the U. Macey Taylor ============================================================================ From postmaster@x400.hd.uib.no Fri Oct 2 01:28:16 1992 Date: Fri, 2 Oct 1992 00:28:16 +0100 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: Re: 'graded' corpora Send-date: Wed, 30 Sep 1992 23:08:38 UTC-0700 From: (Willis Johnson) To: , Message-ID: corpora:59 9210010608.AA20624(a)violet.berkeley.edu Subject: Re: 'graded' corpora For what it's worth, SRA Reading Laboratory was the system used in my elementary school in the 60s. It's a self-paced system of a large number of graded readings. I loved it. Willis Johnson U.C. Berkeley willis@violet.berkeley.edu ============================================================================ From postmaster@x400.hd.uib.no Fri Oct 2 01:29:25 1992 Date: Fri, 2 Oct 1992 00:29:25 +0100 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: Computer Science Send-date: Thu, 1 Oct 1992 11:46:00 UTC From: To: Message-ID: inbox:2313 07A4D48740603028(a)usthk.ust.hk Subject: Computer Science RFC-822-HEADERS: X-Organization: The Hong Kong University of Science & Technology (HKUST) Does anyone know of any compilations of machine-readable texts in Computer Science? We have built such a corpus (1,000,000 words) in China/Hong Kong, and would like to compare our data with other sources. Any information would be welcome. Alex Fang Chengyu (Guangzhou) lcalex@usthk.bitnet Gregory James (Hong Kong) lcgjames@usthk.bitnet ============================================================================ From postmaster@x400.hd.uib.no Fri Oct 2 01:29:11 1992 Date: Fri, 2 Oct 1992 00:29:11 +0100 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: re: 'graded' corpora Send-date: Thu, 1 Oct 1992 6:47:43 UTC-0400 From: Robert Morris To: Reply-To: Message-ID: corpora:60 199210011047.AA02138(a)claude.cs.umb.edu Subject: 'graded' corpora Date: Thu, 1 Oct 1992 01:21:05 +0100 From: " (CORPORA list)" Send-date: Tue, 29 Sep 1992 17:40:28 UTC-0700 From: To: Message-ID: inbox:2311 01GPD5LF6HYQ8ZI8VS(a)CCIT.ARIZONA.EDU Subject: Re: 'graded' corpora You hunt up a K-12 sales rep. It's commercial stuff, widely used in elem/sec schools and some of it in ESL programs. Call your local school board office if you can't find a rep. Or the Reading Dept at the U. Macey Taylor I'm talking about electronic form. I'd like to avoid scanning half a million words and proofreading it all for OCR errors. All the school people do is respond "Umm, computers? Why don't you call an IBM sales office?". But I confess, I didn't try calling the publishers of the written material to ask whether they can supply electronically. I'll do that, but I'm not optimistic about the results. Most publishers live in the stone age (or really, the phototype age) about computer technology and can't even _accept_ stuff in electronic form, let alone produce it. Bob ============================================================================ From postmaster@x400.hd.uib.no Fri Oct 2 16:11:39 1992 Date: Fri, 2 Oct 1992 15:11:39 +0100 From: " (CORPORA list)" To: corpora@x400.hd.uib.no Subject: Non-English taggers and tagged corpora Send-date: Fri, 2 Oct 1992 14:13:26 UTC+0100 From: 02-Oct-1992 1358 To: Message-ID: corpora:63 9210021306.AA21516(a)vbormc.vbo.dec.com Subject: Non-English taggers and tagged corpora I was wondering whether there are people reading this list who are working on (or have references to) taggers for languages other than English. I am especially interested in German, French and Spanish. Alternatively, pointers to previously tagged corpora are also wellcome. Pim van der Eijk. ============================================================================ From corplst Tue Oct 6 07:26:45 1992 Date: Tue, 6 Oct 1992 07:26:45 +0100 From: corplst (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: 'graded' corpora Send-date: Mon, 5 Oct 1992 5:04:21 UTC-0700 From: To: Message-ID: inbox:2331 01GPKSTA1EJG9AMWMI(a)CCIT.ARIZONA.EDU Subject: Re: 'graded' corpora I, too, like some SRA materials. The ones I have used and like are boxes of cards dealing with different "thinking" skills and at several levels of difficulty within a box (set). My students have really enjoyed (and talked up a storm in English) with these, especially the kinds of problems where you have to figure out who sits where at the table or ho is the spy, etc. from a set of statments about the people in the problem. John Higgins has done two computer programs for CALL with this kind of idea, one dealing with comparative structures and set in Venn diagrams, one a simple physical layout with collections of words from the same semantic domain to be placed in the proper places. The value of these in ELT is similar to that of the SRA cards, HOTS and communication. Plus the fact that they are fun, thus motivational. Macey Taylor maceytay@ccit.arizona.edu ============================================================================ From corplst Tue Oct 6 12:29:49 1992 Date: Tue, 6 Oct 1992 12:29:49 +0100 From: corplst (CORPORA list) To: corpora Subject: Re: Non-English taggers and tagged corpora Send-date: Mon, 5 Oct 1992 5:11:13 UTC-0700 From: To: Message-ID: inbox:2332 01GPKT0XCWRM9AMWMI(a)CCIT.ARIZONA.EDU Subject: Re: Non-English taggers and tagged corpora The concordancer Letteratura Amica (or Literary Amiga in English) developed by Raffaele Cocchi of the U of Bologna tags and works in most European languages. He's still working on improving the allophones and algorithms for the speech function (this talks in its 9 [?] languages, too), but the concordancer part is well developed. His address: Via Toffano, 6; 40125 Bologna, Italy Macey Taylor maceytay@ccit.arizona.edu ============================================================================ From corplst Tue Oct 6 13:24:23 1992 Date: Tue, 6 Oct 1992 13:24:23 +0100 From: corplst (CORPORA list) To: corpora Subject: Re: Non-English taggers and tagged corpora Send-date: Mon, 5 Oct 1992 8:43:12 UTC-0600 From: ted To: Cc: eijk_p Message-ID: corpora:69 9210051443.AA12311(a)NMSU.Edu Subject: Non-English taggers and tagged corpora From: 02-Oct-1992 1358 To: Subject: Non-English taggers and tagged corpora I was wondering whether there are people reading this list who are working on (or have references to) taggers for languages other than English. I am especially interested in German, French and Spanish. we are working on a part of speech tagger for spanish. we should have something to distribute via the consortium for lexical research by the end of the year. we are also very interested in non-english tagged corpora. for more information on the consortium, contact lexical@nmsu.edu ============================================================================ From corplst Tue Oct 6 13:24:37 1992 Date: Tue, 6 Oct 1992 13:24:37 +0100 From: corplst (CORPORA list) To: corpora Subject: Need help finding these corpus Send-date: Mon, 5 Oct 1992 20:15:37 UTC+0100 From: (Circle Noetic Svc, A Nizhnikov,PAS) To: Message-ID: corpora:71 718310969.5138762(a)AppleLink.Apple.COM Subject: Need help finding these corpus We have heard of some text corpuses but we don't know how to access any of them. Does anyone out there have an address, phone number or e-mail address for any of these: TIPSTER -supposedly contains a dbase of 10^6 documents, the CACM collection, the NPL collection -supposedly 12000 old documents, the TREC collection, the British National Corpus -supposedly 100Mb, or the Tresor de la Langue Francaise? Any help you can offer would be great. Thank you, Gillian Smith ============================================================================ From corplst Tue Oct 6 13:23:46 1992 Date: Tue, 6 Oct 1992 13:23:46 +0100 From: corplst (CORPORA list) To: corpora Subject: Character Sets: Summary (Part I) Send-date: Sat, 3 Oct 1992 22:58:58 UTC+0100 From: Jon Whalen To: Message-ID: corpora:66 9210032155.AA08022(a)pobox.mot.com Subject: Character Sets: Summary (Part I) RFC-822-HEADERS: Summary: Summary of responses Keywords: character-set, charset, font, glyph To All: A week ago I posted an article asking for information on the definition of several character code sets, including ISO 10646/Unicode. Also I asked for a review of the definitions I gave for the terms "grapheme", "glyph", "font", "character set" and "font encoding". And thirdly, I asked for information pertaining to various encodings of non-ascii character sets into ascii. I received many, many very detailed and thoughtful replies to my posting and I'd like to thank all those who repsonded, in particular: unicode-inc@hq.m4.metaphor.com - Steven A. Greenfield, Office Manager, Unicode andras@gatekeeper.calera.com - Andra1s Kornai mark@kirk.retix.com - Mark Hoy <00V0HORVATH@LEO.BSUVC.BSU.EDU> - Vera Horvath cuong@haydn.stanford.edu - Cuong T. Nguyen ath@linkoping.trab.se - Anders Thulin glenn@metis.com - Glenn Adams KRAFT@PENNDRLS.UPENN.EDU - Bob Kraft Harald.Alvestrand@delab.sintef.no - Harald Tveit Alvestrand lou@vax.ox.ac.uk - Lou Burnard dominic@british-national-corpus.oxford.ac.uk - Dominic Dunlop enag@ifi.uio.no - Erik Naggum connolly@memstvx1.memst.edu - Leo Connolly churchh@emx.cc.utexas.edu - Henry Churchyard keld@login.dkuug.dk - Keld J|rn Simonsen In the sections below, I've tried to summarize my requests and the responses I received. In general, I tried to paraphrase, merge and otherwise condense the replies without distorting the meaning. In many cases, however, I didn't think I would do justice to the original and so I have included whole sections of text from various replies, with (I hope) appropriate accreditation. I apologize in advance for the length and for any typos or omissions I may have made. Some of what follows will be familiar to those of you who have read the follow-up articles, especially on comp.fonts. However, in fairness to those in other news groups or on various mailing lists, I have included information from both public responses and those e-mailed directly to me. Due to the number and length of responses, I have split this summary into tow pieces. Parts I covers questions 1) and 2) in my original posting. Part II includes repsonses to question 3). --jon *Jon S. Whalen Phone: (708) 576-0166* *Staff Engineer, Motorola, Inc. Fax: (708) 576-0892* *Corporate, Computer & Communications R&D * *Internet: jon@hook.corp.mot.com / Compuserve: 76665,3043 / AOL:JonSWhalen* [DISCLAIMER: Please note, I do _not_ speak for Motorola, Inc. Nor do I have any commercial, financial or other interest in or affiliation with any other company or organization mentioned herein.] ------------------------------------------------------------------------------ Summary of Responses (Part I) I asked: >To All, > >I. I'm looking for pointers to documents describing the encodings used >by several different character sets. In particular, I'm looking for the >encodings for: > >1. EBCDIC >2. ISO Latin-1 (ISO 8859-1) >3. ISO Latin-2 (ISO 8859-2) >4. Unicode [more than just the ascii part] >5. ISO 10646 Replies: I. Standards References and Sources A. ISO Unicode/ISO 10646 To obtain more information on Unicode or to order their printed material and/or diskettes contact: Steven A. Greenfield Unicode Office Manager 1965 Charleston Road Mountain View, CA 94043 Tel. 415-966-4189 Fax. 415-966-1637 In the information packet I received from him, the current prices on the materials (in the US and all prices in US dollars) are: Volume One of the Unicode 1.0 Character Encoding Standard Paperback book -- $32.95 Implementers Book with Mapping Diskette -- $37.50 Volume Two of the Unicode 1.0 Character Encoding Standard Paperback book -- $29.95 Implementers Book with Mapping Diskette -- $33.95 Unicode Implementers Workshop #3 Proceedingd -- $15.00 Diskettes: Cross Reference Diskette for Volume 1 -- $5.00 Cross Reference Diskette for Volume 2 -- $5.00 Character Name List -- $5.00 Shipping and handling are additional. For other material... Glenn Adams writes: >ISO standard documents may purchased in the US from Omnicom, (703) 281-1135. >Unicode, and by extension ISO10646, is available in two volumes at your >local computer bookstore (published by Addison-Wesley). As for EBCDIC, >see "EBCDIC Bibliographic Character Sets - Sources and Uses: A Brief Report," >Journal of Library Automation, Vol. 12/4, December 1979. This report >documents the extended EBCDIC used by many bibliographic services. Also >see IBM Publication 1403-03, Order #GA 24-3073-7, p. 36-37. and in a response to pfk@rz.uni-jena.de (Frank Klemm) Glenn says: >No fonts exist yet that cover all of Unicode. The Association for Font >Information and Interchange (AFII) is in the process of creating a font >which will be used to print the ISO10646 code charts. I understand this >font *may* become available when complete. The Unicode Consortium is >in the process of creating online mapping files from other character sets >to Unicode, by means of which one could obtain a partial mapping in the >reverse direction. I believe they intend to make these available by FTP >once they are verified for accuracy. Dominic Dunlop writes: >For [ISO Latin-1] and [ISO 10646], use anonymous ftp from dkuug.dk, > subdirectory isp. >--- >Dominic The files there are locally produced I believe, not authorized by ISO or Unicode, so I'm not sure what the restrictions on their use might be. At a glance the 8859-1 character set seems to be complete, but the 10646 file only contains a subset of the 10646/Unicode characters. B. CCITT Among the relevant CCITT standards are: T.50 - International Alphabet No. 5 This is the same as ISO 646, the 7-bit invariant character set. T.51 - Coded Character Sets for Telematic Services This describes the use of the ISO 2022 escape sequence mechanism for specifying a desired code set in terms of control groups C0 and C1 and graphic groups G0, G1, G2, and G3. It also contains several tables in Annex A, sec. A.2 through A.5 which define a set of identifiers for the Latin (extended) character set, digits, and various symbols. Annex B contains a table cross-referencing the various T-series recommendations defining alternative C0, C1, G0, G1, G2, G3 groups. [The tables are quite usable. The description of ISO 2022 is impenetrable.] T.61 - Character Repertoire and Coded Character Sets for the International Teletex Service This recommendation is specifically applicable to teletex. It contains material from T.50, T.51, ISO 646, ISO 2022, ISO 6429 and ISO 6937, as well as new material and emmendations. It also, includes a set of identifiers for the extended Latin character set (which appears to be the same as in T.51.) X.408 - Message Handling Systems: Encoded Information Type Conversion Rules "This Recommendation specifies the algorithms the MHS [Message Handling System] uses when converting between different types of encoded information. This recommendation offers conversion rules and tables for converting among Telex (F.1), IA5 (T.50), Teletex (T.60, T.61 and F.200), Group3 fax, Group4 class1 fax, Videotex, and mixed mode. D. Internet RFC Keld Simonsen writes: >I have written the RFC 1345 with information on the above, >although only about 2000 characters of UNICODE and ISO 10646 >is covered. It is available by ftp via the normal RFC archives. Harald Tveit Alvestrand adds: > >1345 Simonsen, K. Character Mnemonics & Character Sets. 1992 June; 103 p. > (Format: TXT=249738 bytes) > >He says he is working on an alternative document that he likes, but has not >yet (as far as I know) published it.> Erik Naggum writes: >I think it would be fair to include the reasons I don't like it, rather >than imply that it's just a matter of "liking" a format or not: the >tables are generally unreadable, they're full of errors, and they can't >be debugged by inspection, so you can't even find the errors without >doing very time-consuming comparisons with the original material. I >started doing this time-consuming work, but found it easier to go to the >original sources myself, and start over. That's why I don't "like" RFC >1345. All other users of it will also have to do this painstaking >checking all over, because the RFC's content can't be trusted. (The >author has announced a new, improved edition, but again, we have to >trust it, since its correctness and accuracy is extremely hard to >inspect, even for character sets you know well by heart.) [By way of comparison, Erik included a sample encoding from RFC 1345 and a sample of his encoding for the code set ISO 8859-1] >Slightly more verbose :-), but also easily debuggable: it's parsable, >and the character number is explicitly identified with the character >name. (Missing characters and resulting "shifts" account for about 400 >errors in RFC 1345.) The names are drawn from ISO/IEC DIS 10646-1.2, >and will be updated to include the official names from the published >standard. The fact that these are actually delimited strings also makes >it possible to construct conversion tables by name lookup, instead of >typing in (with errors) a pre-composed conversion table. > >If this peaks your interest, please drop me a line. > >Best regards, > [There was further discussion on both sides, not included here] Internet RFC's are available by anonymous ftp from nic.ddn.mil and many of other places. There may also be e-mail based ftp servers available, check with your local network administrator. II. Definitions I asked: >II. Are the following statements correct? If not, how far off the mark are they? And the replies were: 1. Grapheme I wrote: > 1. A grapheme is logical unit of meaning used in the written representation > of a language. Glenn Adams replies: >In my recent text, "Introduction to Unicode," Proceedings of the Third >Unicode Implementor's Workshop, Ausust 1992, I defined it as follows: > > "A minimally distinctive unit of writing in the context of a particular > writing system." Leo Connolly replies: >No way. A grapheme is a minimal unit of writing. Example: and >, or even capital and lower case , are different graphemes >because they are distinct and non-interchangeable in the English >writing system. (Graphemes are written between the angled brackets ><>.) On the other hand, Roman _a_ (with curve extending up from the >right side and over the top) and "manuscript" _a_ (o-like) are >"allographs" of a single grapheme because they are interchangeable in >our writing system; the choice depends on the writer or font designer. > >Graphemes have *NOTHING WHATSOEVER* to do with meaning. I totally munged it, sorry. I understand now that it is dangerous even to speak of a grapheme except in the context of a particular language and writing system. 2. Glyph I wrote: > 2. A glyph is a pictorial representation of a grapheme. Glenn Adams replies: >I had: > > "An abstract form which represents one or more glyph images, and which is > used to visually depict encoded character data." > >A "glyph image" is then: > > "The actual, concrete image of a glyph representation having been > rasterized or otherwise imaged onto some display surface." > >There is some distinction between "glyph" as "allograph" and "glyph image" >as a particular visual instance of a glyph, since one may take the same >allograph and display it under different transformations, linear and >non-linear. I should not have tried to link glyphs and graphemes, it appears that there's no direct connection, in general. 3. Font I wrote: > 3. A font is an ordered set of glyphs. > (i.e. a mapping between numeric (integer) values and glyphs.) Glenn Adams replies: >OK. I had: > > "A collection of glyphs used for the visual depiction of character data. > A font is often associated with a set of parameters, e.g., size, posture, > weight, serifness, etc., which, when set to particular values, generate > a collection of imagable glyphs." 4. Character Set I wrote: > 4. A character set is an ordered set of graphemes. > (i.e. a mapping between numeric (integer) values and graphemes.) Glenn Adams replies: >Unfortunately, it doesn't quite work this way. The reason being that >what constitutes a grapheme can only be determined in the context of >a particular writing system, i.e., a particular language, a particular >set of orthographic rules, and one or more collections of symbols (scripts) >used to depict writing. > >Because character sets (generally) exclude language and orthographic >differences, the characters therein end up encoding the elements of >scripts independently of their status as graphemes. Also, because >character sets in the past have often been confused with the enumeration >of the elements of a font, allographic symbols and arbitrary glyphic >elements have been encoded along with the more independent graphemic >symbols. > >The best working definition for a character set is simply as "a collection >of elements used to organize, control, or represent information." A character >is then simply "an element of a character set," i.e., an element which >"organizes, controls, or represents information." Trying to read anything >else into "character," at least as currently implement by existing character >set standards, will soon lead you to great conceptual difficulties. > >However, to go out on a conceptual limb, I would suggest that under ideal >circumstances a character represents "an abstract form which may solely >or jointly be used to represent the symbol of a script, which, in turn, may >represent one or more graphemes in the context of implicit or explicit >assumptions about the language and orthographic rules which apply to the >character encoding." For example, the ASCII character 0x41 represents >the LATIN CAPITAL LETTER A symbol, which, in an American English text, >represents the graphemes and . How's that for clarity? 5. Font Encoding I wrote: > 5. A font encoding is a mapping between a character set and a font. Glenn Adams replies: >No. A font encoding is a mapping from glyph codes, to the glyphs of a font. >The mapping between character set(s) and font(s), is, in general, much more >complex. I offer the following excerpt from a message I sent recently >on the newsgroup "comp.fonts." > > In general, this model will not work for universal character sets which > cover large numbers of writing systems, e.g., Unicode. It must be replaced > by a more sophisticated model: > > string of unicode characters -> > rendering & layout engine -> > positioned glyphs > > In general, the rendering and layout engine must be able to do the following: > > - map 1 character to N (possibly discontiguous) glyphs* > - map 1 character to 1 of N glyphs depending on context > - map N characters to 1 glyph* > - map M characters to N (possibly discontiguous) glyphs* > - reorder resulting glyphs according to bidirectional display requirements > - compute attachment points for glyphs which attach to other glyphs > - deform glyphs for performing justifiction which doesn't use whitespace > justification > > *may also require context sensitivity I goofed. I believe what I was trying to describe was not a font encoding, but (at least a part of) a rendering. Reference definitions: CCITT Recommendation T.51: character - A member of a set of elements used for the organization, control or representation of data. coded character set; code - A set of unambiguous rules that establishes a character set and the one-to-one relationship between the characters of the set and their bit combinations. The Unicode Standard 1.0, Volume 1: character - (1) The smallest component of written language that has semantic value. Character refers to the abstract idea, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader's understanding. (2) The basic unit of encoding for the Unicode character encoding, 16 bits of information. (3) Synonym for "code element." (4) The English name for the ideographic written elements of Chinese origin. character encoding - association of a unique number with each character in a set of characters. The distinction between characters and glyphs is not absolutely clear in all cases, and so most large character encodings also encode some set of glyphs. code set - A character encoding; this term is widely used by programmers. glyph - (1) The actual shape (bit pattern, outline) of a character image. For example, an italic "a" and a roman "a" are two different glyphs representing the same underlying character. In this strict sense, any two images which differ in shape constitute different glyphs. In this usage, "glyph" is a synonym for "character image" or simply "image." (2) A kind of idealized surface form derived from some combination of underlying characters in some specific context, rather than an actual character image. In this broad usage, two images would constitute the same glyph whenever they have essentially the same topology (as in oblique "a" and roman "a"), but different glyphs when one is written with a hooked top and the other without (as in italic "a" and roman "a"). In this usage, "glyph" is a synonym for "glyph type," where glyph is defined as in sense 1. [ This is the end of Part 1 -- Part 2 follows in a separate posting ] ============================================================================ From corplst Tue Oct 6 13:24:12 1992 Date: Tue, 6 Oct 1992 13:24:12 +0100 From: corplst (CORPORA list) To: corpora Subject: Character Sets: Summary (Part II) Send-date: Sat, 3 Oct 1992 23:00:52 UTC+0100 From: Jon Whalen To: Message-ID: corpora:67 9210032157.AA08047(a)pobox.mot.com Subject: Character Sets: Summary (Part II) RFC-822-HEADERS: Summary: Summary of responses Keywords: character-set, charset, font, glyph ================== The following is a summary of the replies to my posting Re: Character Sets [This is second of a two part posting.] --jon *Jon S. Whalen Phone: (708) 576-0166* *Staff Engineer, Motorola, Inc. Fax: (708) 576-0892* *Corporate, Computer & Communications R&D * *Internet: jon@hook.corp.mot.com / Compuserve: 76665,3043 / AOL:JonSWhalen* ---------------------------------------------------------------------------- III. Encodings I asked: >III. I'm looking for pointers to documentation on standard and ad-hoc mappings >between 7-bit ascii and 8-bit charsets. For example, on the mailing list >GAELIC-L, the accented vowels: a_acute, e_acute, i_acute, o_acute, u_acute, >a_grave, e_grave, i_grave, o_grave, u_grave; are represented as vowel + / >and vowel + \, respectively. What other encodings are in use? Where might I >get documentation (formal or informal) on them? Are there ISO standards >in existence or under development? I received the following replies, A. Hungarian Andra1s Kornai writes: >on the Hungarian mailing lists you often find x1 for TeX \'{x}, x2 for >\"{x}, and x3 for \H{x}, where x is any vowel a u i o e (the >combinations a2 a3 i2 i3 e2 e3 don't actually occur in Hungarian). >This code, originated by Ga1bor Pro1sze1ky at around 1980, is very >popular among people working with Hungarian corpora, less popular (but >still frequently seen) in political, cultural, etc. Hungarian e-mail >forums (fora?) of which there are at least five different ones. A >Hungarian Electronic Resources FAQ is often posted on >soc.culture.magyar, you'll find more info there if you are interested. and Vera Horvath write: >In response to your query on the Corpora list I would like to share with you >what I know about character-mappings in Hungarian. >At the Linguistic Institute of the Academy (in Budapest), the compilers of >a Hungarian >electronic corpus (which is still under preparation) decided on the use of >numbers after vowels to mark accent. Consequently, accented ("long") a is a1, >long "i" is i1, long e is e1. The number for the umlaut is 2 : o2, u2 >the number for the long umlaut is 3: o3, u3. > >On the Hungarian mailing lists there is no consesus at all. Some people use the > >numbers, others omit accents altogether, still others use single quote (') for >one accent, tilde (~) for umlaut and double quote (") for two accents. >(Graphically, these are the closest approximations of the "real" diacritics.) B. Vietnamese Cuong T. Nguyen writes: >For Vietnamese, please refer to the VIQR (VIetnamese Quoted-Readable) >standard from Viet-Std, which models the prevalent convention used >on electronic media worldwide. A PostScript version of the standards >is available for anonymous FTP from Sonygate.Sony.COM (192.65.137.2), >under /tin/viet-std/viet-std.eng.ps.Z. C. The Text Encoding Initiative Lou Burnard responded with a copy of the introduction to: Guidelines for Electronic Text Encoding and Interchange Edited by C. M. Sperberg-McQueen and Lou Burnard TEI P2, Chapter 21 Text Encoding Initiative Chicago, Oxford (c) 1990, 1992 ACH, ACL, ALLC July 15, 1992 Draft Version 2, July 15, 1992 In the body of that document it states: > Updates on the status of the draft as a whole will be distributed >automatically to all subscribers of the Listserv list TEI-L. To sub- >scribe to TEI-L, send electronic mail to the address LISTSERV@UICVM (or >Listserv@uicvm.uic.edu) containing the single line > > subscribe tei-l J. Smith > >(substituting your name for "J. Smith"). TEI-L is also the appropriate >place to pose questions or offer public comments on the TEI guidelines >and other relevant issues. The TEI-L file server contains all the sec- >tions of TEI P2 thus far released, as well as other TEI materials. Its >contents are shadowed at various other sites around the world; for more >information consult one of the following documents: > >* TEI ED J8, "Obtaining the Second Version of the TEI Guidelines (TEI > P2)" (describes how to retrieve electronic copies of TEI P2 and the > various formats they are available in) >* TEI ED J9, "Obtaining Paper Copies of the Second Version of the TEI > Guidelines (TEI P2)" (describes how to request paper copies of TEI > P2, for those without electronic mail access) > >These documents are available from the TEI-L file server or through the >editors at the addresses given on the User Response and Comment Form. In the introductory note for the TEI-L list it says: > Contributions sent to this list are automatically archived. You can obtain a >list of the available archive files by sending an "INDEX TEI-L" command to >LISTSERV@UICVM. These files can then be retrieved by means of a "GET TEI-L >filetype" command, or using the database search facilities of LISTSERV. Send an >"INFO DATABASE" command for more information on the latter. D. Hebrew, Aramaic, Greek, Arabic, Coptic Bob Kraft replies: >The following material is easily forwardable so I append >it for your interest. It probably is not what you had in >mind! Bob Kraft, UPenn/CCAT >===== [Editor's note: due to the length of the selection, I haven't inserted '>' at the beginning of the following, rather I have surrounded it with a separator.] -------------------------------- BEGIN INCLUDED ---------------------------- Appendix 1: CODING FOR TRANSLITERATION OF HEBREW The Hebrew, Syriac, and Aramaic texts are coded according to the Michigan- Claremont scheme: Hebrew Coding Hebrew Coding alef ) patah A bet B qametz F gimel G hireq I dalet D segol E he H tsere " waw W holam O zayin Z qibbuts U het X shureq W. tet + schwa : yod Y holem waw OW kaf K hateph-pathah :A lamed L hateph-qametz :F mem M hateph-segol :E nun N maqqeph - samek S dagesh . ayin ( rape , pe P ketiv * zade C qere ** qof Q resh R sin/shin # sin & shin $ taw T HEBREW ACCENTS/CANTILLATION CODING The accent and cantillation markings are named and cross referenced as in the TABULA ACCENTUM insert card in BHS. The Michigan-Claremont coding is in the first column. The alternate coding used by the French and Belgian projects is also noted. French Belgian Michigan-Claremont BHS CATAB CIB at end (to left) of word, above 00 ; --- sop pasuq [end of verse] - - _ , rape [see above] 80 _ - //--- abbreviation - 83 _ - /--- abbreviation - 84 01 .:--- segolta I.3 14 7 02 )--- zarqa, sinnor I.9,II.7 20 - 03 --- pashta, azla legarmeh I.10,II.12 21 - 04 &--- telisha parvum I.25 47 30 05 |--- paseq [separator] "Nota" 11 - - |-,-- legarmeh (74 + 05) I.18 40 + 11 - at start (to right) of word, below 10 ---< yetib (yetiv) I.11 23 (42) 13 --- dehi or tipha II.9 (49) - at start (to right) of word, above 11 ---/ (81 + ) mugrash II.5 - - 14 ---% telisha magnum I.17 27 31 above word 24 -&-- telisha qetannah (med) - - - 44 -%-- telisha magnum (med) - - - 60 --<- ole or mahpakatum (~I.2) - 43 61 -/-- geresh or teres I.13 25 81 62 -"-- garshajim I.14 26 82 63 --- azla, azla or qadma I.24,II.19 45=46 8 64 -,-- illuj II.15 - 44 65 -#-- shalshelet (magn,parv) I.4,II.6+20 15 33 80 -:-- zaqep parvum I.5 16 6 81 -.-- rebia (magnum=parvum) I.7,II.4=8 19 5 (cf 80) 82 --)- sinnorit II.21 (20) 9 83 -+-- pazer I.15,II.10 28 90 84 -&%-- pazer mag. or qarne para I.16 29 32 85 -|:-- zaqep magnum I.6 17 60 below word 35 -F|:-- meteg (med) - (12) (,) 70 -<-- mahpak or mehuppak I.20,II.11+18 43 42 71 -/-- mereka I.21,II.14 41 1 72 -//-- mereka kepulah (duplex) I.22 42 11 73 --- tipha, tarha I.8,II.16 18 2 (munah) - ---- majela [= 73] I.27 49 - 74 -,-- munah I.18-19,II.13 40 4 (dehi/tarha) 75 -|-- silluq [meteg (left)] I.1,II.1 12 , 91 -./-- tebir I.12 22 10 92 -^-- atnah I.2,II.3 13 3 93 -v-- galgal or jerah I.26,II.17 48 41 94 -s-- darga I.23 44 40 95 -|-- meteg (right) [cf 35,75] - (12) - rak 4/02/86 Appendix 2: CODING FOR TRANSLITERATION OF GREEK AND COPTIC Letter Greek (TLG) Coptic (RAK) alfa A A beta B B gamma G G delta D D epsilon E E digamma/vau (=6) V V zeta Z Z eta H H theta Q Q iota I I kappa K K lamda L L mu M M nu N N ksi C C omicron O O pi P P koppa (=90) #3 rho R R sigma S S [sigma final j ] tau T T upsilon U U phi F F chi X X psi Y Y omega W W sampi(=900) #5 smooth breathing ) shai s rough breathing ( fai f iota subscript | chai(Bo) acute accent / hori h grave accent  janjia j circumflex acc. = gima g ti t chi-rho x diaeresis + overline  (backslash) midpoint punct. : dash - (hyphen) capital letter * (precedes) RAK 5/23/88 Appendix 3: CODING FOR THE TRANSLITERATION OF ARMENIAN (Leiden, AIAS) Armenian Code Code Uncial Minuscule ayb A a ben B b gim G g da D d ec E e za Z z e E/ e/ et' E^ or E: e^ or e: t'o T^ t^ ze Z^ z^ ini I i liwn L l xe X x ca C c ken K k ho H h ja J j lat L^ l^ ce C/ c/ men M m yi Y y nu N n sa S^ s^ o O o c'a C^ c^ pe P p je J^ j^ ra R^ r^ se S s vew V v tiwn T t re R r c'o C| c| hiwn W or U w or u p'iwr P^ p^ k'e Q q O O/ o/ fe F f full stop : semicolon ; half-comma  question ? acute accent ] Other codes in the text correspond to various editorial markings which are difficult to translate into ASCII. A little experimentation will render them useful. A few characters need no transliteration. The following characters and editorial markings do not appear to be in the documents in the collection, either by coincidence or because they were filtered out during transmission and/or reformatting, but are included in the AIAS coding scheme: dash - - tilde ~ = crux % left curly bracket { < right curly bracket } > upper left half bracket Z [ upper right half bracket ? $ lower left half bracket @ + lower right half bracket * [rvh 2/88] Appendix 4: CHART OF CODINGS FOR TRANSLITERATION OF HEBREW-ARABIC There is as yet no substantial agreement as to what codes to use for the transliteration of Hebrew and Arabic. This collection uses the Michigan- Claremont scheme for Hebrew texts, and the Oxford schemes for Arabic. The following chart records various codings, along with some suggestions towards standardization of Semitic language coding: current proposed ANSI-Heb Arabic Hebrew Mich-Clar by Kraft 1975 Oxford Arabic alef ) A @ A elif bet B B B B ba gimel G G G J, G g'im, g'ain dalet D D D D, d dal, d_al he H H H O, X ha, h_a waw W W W W waw zayin Z Z Z Z, & zay, z.a het X X X H h.a tet + t J V t.a yod Y Y Y ? kaf K K, k K, K/ K kaf lamed L L L L lam mem M M, m M, M/ M mim nun N N, n N, N/ N nun samek S S S ? ayin ( J & E (ain pe P P, p P, P/ F fa zade C C, c C, C/ C, c s.ad, d.ad qof Q Q Q Q k.af resh R R R R ra sin/shin # s F sin & f $ S, s sin, s'in shin $ F ? ? taw T T T T, t ta, t_a ? ta marbut.a ---------------------------------- END INCLUDED ---------------------------- patah A h A/ P hamsa qametz F a A hireq I i I segol E e E tsere " y E/ holam O o O qibbuts U u U shureq W. W* W* holem waw OW oW OW schwa : % % hateph-pathah :A %h %O hateph-qametz :F %a %A hateph-segol :E %e %E maqqeph - - dagesh . * : shadda rape , ^ ketiv * | qere ** || [rak-rvh 2/88] Appendix 5: CODINGS FOR SANSKRIT Each Sanskrit document in this collection is coded in a slightly different way. In order to reduce the possibilities of confusion, we shall first list the fifty symbols needed, then chart them by document. The assistance of J. Hubbard and D. Wujastyk in this endeavor is gratefully acknowledged: Description of phoneme / standard scholarly representation 1 Short a / a 2 Long a / a with macron above 3 Short i / i 4 Long i / i with macron above 5 Short u / u 6 Long u / u with macron above 7 Short vocalic r / r with dot beneath 8 Long vocalic r / r with dot beneath and macron above 9 Vocalic l / l with dot beneath 10 e / e 11 Diphthong ai / ai 12 o / o 13 Diphthong au / au 14 Unaspirated unvoiced velar / k 15 Aspirated unvoiced velar / kh 16 Unaspirated voiced velar / g 17 Aspirated voiced velar / gh 18 Velar nasal / n with a dot above 19 Unaspirated unvoiced palatal / c 20 Aspirated unvoiced palatal / ch 21 Unaspirated voiced palatal / j 22 Aspirated voiced palatal / jh 23 Palatal nasal / n with tilde above 24 Unaspirated unvoiced retroflex / t with dot beneath 25 Aspirated unvoiced retroflex / t with dot beneath followed by h 26 Unaspirated voiced retroflex / d with dot beneath 27 Aspirated voiced retroflex / d with dot beneath followed by h 28 Retroflex nasal / n with dot beneath 29 Unaspirated unvoiced dental / t 30 Aspirated unvoiced dental / th 31 Unaspirated voiced dental / d 32 Aspirated voiced dental / dh 33 Dental nasal / n 34 Unaspirated unvoiced labial / p 35 Aspirated unvoiced labial / ph 36 Unaspirated voiced labial / b 37 Aspirated voiced labial / bh 38 Labial nasal / m 39 Palatal semivowel / y 40 Retroflex semivowel / r 41 Dental semivowel / l 42 Labial semivowel / v 43 Palatal sibilant / s with acute accent above 44 Retroflex sibilant / s with dot beneath 45 Dental sibilant / s 46 Sonant aspirate h / h 47 Pure nasal / m with dot beneath 48 Hard breathing / h with dot beneath 49 Elided short vowel / apostrophe 50 Punctuation mark / diagonal slash LETTER CODES BY DOCUMENT Heart Bhagavad- Kali- Phoneme Sutra Rigveda gita dasa by number 1 a A a a 2 A A: aa aa 3 i I i i 4 I I: ii ii 5 u U u u 6 U U: uu uu 7 r /R .r *r 8 RI R: r* &r 9 L /L .l /l 10 e E e e 11 ai AI ai ai 12 o O o o 13 au AU au au 14 k K k k 15 kh KH kh kh 16 g G g g 17 gh GH gh gh 18 ng /G [?] n* 19 c C c c 20 ch CH ch ch 21 j J j j 22 jh /H jh /h 23 nj /J n# n# 24 T /T .t /t 25 TH /TH .th [?] 26 D /D .d .d 27 DH /DH .dh ? /dh 28 N /N .n .n 29 t T t t 30 th TH th th 31 d D d d 32 dh DH dh dh 33 n N n n 34 p P p p 35 ph PH ph ph 36 b B b b 37 bh BH bh bh 38 m M m m 39 y Y y y 40 R R r r 41 l L l l 42 v V v v 43 S [?] z z 44 SH /S .s .s 45 s S s s 46 h H h h 47 M /M .m .m 48 : /H .h .h 49 ' [?] ' ' 50 / / [rvh 2/88] ---------------------------------------------------------------------------- The End :-) ============================================================================ From corplst Tue Oct 6 13:24:17 1992 Date: Tue, 6 Oct 1992 13:24:17 +0100 From: corplst (CORPORA list) To: corpora Subject: ICAME bibliography Send-date: Mon, 5 Oct 1992 12:47:36 UTC+0100 From: ALTEN To: Message-ID: corpora:68 (q)alf.uib.no.706:05.09.92.11.46.07(q)(a)uib.no Subject: ICAME bibliography Dear CORPORA subscriber, As you know, one important aim of ICAME is to provide information about studies based on or related to the English text corpora distributed through ICAME. For this purpose, a regularly updated bibliography is available on the ICAME fileserver: FILESERV@HD.UIB.NO. To obtain the bibliography, send a message to this address giving the following lines in the body of the message: send icame bibliography.1991 send icame bibliography.1992 To make the bibliography as useful and up-to-date as possible we need your help. Could you please check the latest version of the bibliography and send any corrections and/or additions to the compiler Bengt Altenberg (not to Bergen!), by e-mail or ordinary mail, under the following address: Department of English Helgonabacken 14 S-223 62 Lund, Sweden E-mail: Alten@seldc52.bitnet Please note the following: o Send information about published or readily accessible works only (leave forthcoming titles till they have appeared in print); o Give full bibliographical information: name of author(s), year of publication, title of work, and (where relevant) name of periodical/collection, editor(s), page references, place of publication and publisher; o As far as possible, use the stylistic conventions of the ICAME bibliography (or the LSA style sheet); o For each title, indicate which corpus or corpora the study is based on or related to. Thank you, Bengt Altenberg ============================================================================ From corplst Tue Oct 6 13:24:29 1992 Date: Tue, 6 Oct 1992 13:24:29 +0100 From: corplst (CORPORA list) To: corpora Subject: Re: Non-English taggers and tagged corpora Send-date: Mon, 5 Oct 1992 13:10:44 UTC-0500 From: (Jim Barnett) To: Cc: Message-ID: corpora:70 9210051810.AA04366(a)paintbrush.mcc.com Subject: Non-English taggers and tagged corpora The University of Kyoto has a Japanese segmenter/morphological analyzer, called Juman, that is freely available. We tested it on some newspaper stories and found it roughly 93% accurate (that is, 93% of the words it identified were both correctly segmented and correctly tagged.) - Jim Barnett ============================================================================ From corplst Tue Oct 6 15:14:38 1992 Date: Tue, 6 Oct 1992 15:14:38 +0100 From: corplst (CORPORA list) To: corpora Subject: lost mail Send-date: Tue, 6 Oct 1992 12:08:58 UTC+0100 From: M Wynne To: Message-ID: corpora:75 3009.9210061104(a)gps.leeds.ac.uk Subject: lost mail Sorry to be such a clot but I accidentally erased an interesting reply to my query on indirect discourse. It was from William Rapapport, so if you're out there, could you please send me it again. ============================================================================ From pedersen@parc.xerox.com Tue Oct 6 02:50:08 1992 From: Jan Pedersen Sender: Jan Pedersen Fake-Sender: pedersen@parc.xerox.com To: corpora@nora.hd.uib.no In-Reply-To: CORPORA list's message of Tue, 6 Oct 1992 05:24:37 -0700 <199210061224.AA05578@nora.hd.uib.no> Subject: Re: Need help finding these corpus Date: Tue, 6 Oct 1992 09:50:08 PDT From: (Circle Noetic Svc, A Nizhnikov,PAS) To: Subject: Need help finding these corpus We have heard of some text corpuses but we don't know how to access any of them. Does anyone out there have an address, phone number or e-mail address for any of these: TIPSTER -supposedly contains a dbase of 10^6 documents, the CACM collection, the NPL collection -supposedly 12000 old documents, the TREC collection, the British National Corpus -supposedly 100Mb, or the Tresor de la Langue Francaise? Any help you can offer would be great. Thank you, Gillian Smith TIPSTER corpus and the TREC corpus are one and the same, a large text collection put together under the auspices of DARPA (US Defense Advanced Research Program) for the purpose of evaluating information retrieval systems. (TIPSTER is a DARPA project on information retrieval. TREC stands for Text Retrieval Evaluation Corpus.) It is not yet publicly available, but will eventually be distributed by NIST (The US Bureau of Standards) on CDROM's. The texts in the TIPSTER/TREC corpus have been extracted from the ACL/DCI text collection. There are roughly one million documents in the collection, occupying over 1.5 Gigabytes in ascii form. Along with documents there are 50 queries each paired with a set of documents judged relevant. The CACM collection is a classic information retrieval reference collection. Like TREC it pairs queries with sets of documents judged relevant. Unlike TREC it is very small (approximately 1.5 Megabytes) with many documents only having titles and the rest titles and abstracts. The CACM corpus, along with other IR reference collections is available on CDROM from Edward Fox (fox@fox.cs.vt.edu). Jan Pedersen ============================================================================ From pedersen@parc.xerox.com Tue Oct 6 04:27:06 1992 From: Jan Pedersen Sender: Jan Pedersen Fake-Sender: pedersen@parc.xerox.com To: corpora@nora.hd.uib.no Subject: TREC/TIPSTER corpus Date: Tue, 6 Oct 1992 11:27:06 PDT I've received the following corrections to my previous message on the TREC/TIPSTER corpus: From: To: pedersen@parc.xerox.com Subject: TREC TREC (text retrieval evaluation CONFERENCE) (The US Bureau of Standards) on CDROM's. The texts in the TIPSTER/TREC corpus have been extracted from the ACL/DCI text collection. there are a large number of texts that were not part of the ACL/DCI stuff. There are roughly one million documents in the collection, occupying over 1.5 Gigabytes in ascii form. in fact, each disk contains about 1.2 GB of text after you account for compression. Along with documents there are 50 queries each paired with a set of documents judged relevant. and a list of the documents that were examined and found irrelevant. J.P. ============================================================================ From dominic@natcorp.ox.ac.uk Tue Oct 6 20:49:26 1992 id <06548-0@oxmail.ox.ac.uk>; Tue, 6 Oct 1992 16:00:01 +0100 Tue, 6 Oct 92 15:59:35 +0100 From: Dominic Dunlop Date: Tue, 6 Oct 92 15:59:30 BST To: corpora@nora.hd.uib.no Subject: Re: Need help finding these corpus Cc: D1634@AppleLink.Apple.com X-Project: British National Corpus X-Organization: Oxford University Computing Service X-Address: 13 Banbury Road, Oxford OX2 6NN, U.K. X-Phone: +44 865 273280 X-Fax: +44 865 273275 Gillian Smith writes: > We have heard of some text corpuses but we don't know how to access any of > them. Does anyone out there have an address, phone number or e-mail address for > any of these: ... the British National Corpus -supposedly 100Mb The British National Corpus is to be a balanced, TEI-comformant, part-of-speech tagged, corpus of 100 million WORDS of modern spoken and written British English (making it, according to my guess, about 2 gigabytes uncompressed), but will not be available until April, 1994. --- Dominic Dunlop, BNC Project Manager, OUCS Contact information in mail header ============================================================================ From chrisbr@cogsci.edinburgh.ac.uk Tue Oct 6 17:04:02 1992 Via: uk.ac.edinburgh.cogsci; Tue, 6 Oct 1992 16:13:29 +0100 To: ianj <@prg.oxford.ac.uk:ianj@sharp> Cc: corpora@nora.hd.uib.no Subject: Re: Non-English taggers and tagged corpora In-Reply-To: Your message of Tue, 06 Oct 92 13:24:29 +0100. <199210061224.AA05575@nora.hd.uib.no> Date: Tue, 06 Oct 92 16:04:02 +0100 From: Chris Brew > Send-date: Mon, 5 Oct 1992 13:10:44 UTC-0500 > From: (Jim Barnett) > To: > Cc: > Message-ID: corpora:70 9210051810.AA04366(a)paintbrush.mcc.com > Subject: Non-English taggers and tagged corpora > > > The University of Kyoto has a Japanese segmenter/morphological > analyzer, called Juman, that is freely available. We tested it on some > newspaper stories and found it roughly 93% accurate (that is, 93% of > the words it identified were both correctly segmented and correctly > tagged.) > > - Jim Barnett Got this message off a mailing list. Don't know if you're interested. Chris ============================================================================ From corplst Tue Oct 6 22:23:17 1992 Date: Tue, 6 Oct 1992 22:23:17 +0100 From: corplst (CORPORA list) To: corpora Subject: Re: Need help finding these corpus Directory: 1)------------------------------------------------------------------ Send-date: Tue, 6 Oct 1992 9:45:09 UTC-0400 From: Subject: Need help finding these corpus 2)------------------------------------------------------------------ Send-date: Tue, 6 Oct 1992 9:30:22 UTC-0600 From: ted Subject: Need help finding these corpus 3)------------------------------------------------------------------ Send-date: Tue, 6 Oct 1992 19:18:32 UTC+0100 From: Subject: IR collections 4)------------------------------------------------------------------ Send-date: Tue, 6 Oct 1992 13:38:57 UTC-0500 From: C. M. Sperberg-McQueen Subject: Re: Need help finding these corpus Messages: 1)------------------------------------------------------------------ Send-date: Tue, 6 Oct 1992 9:45:09 UTC-0400 From: To: Message-ID: corpora:88 9210061345.AA18619(a)sanborn.bbn.com Subject: Need help finding these corpus > Date: Tue, 6 Oct 1992 13:24:37 +0100 > From: CORPORA list > > Send-date: Mon, 5 Oct 1992 20:15:37 UTC+0100 > From: (Circle Noetic Svc, A Nizhnikov,PAS) > To: > Message-ID: corpora:71 718310969.5138762(a)AppleLink.Apple.COM > Subject: Need help finding these corpus > > We have heard of some text corpuses but we don't know how to access any of > them. Does anyone out there have an address, phone number or e-mail address for > any of these: TIPSTER -supposedly contains a dbase of 10^6 documents, the CACM > collection, the NPL collection -supposedly 12000 old documents, the TREC > collection, the British National Corpus -supposedly 100Mb, or the Tresor de la > Langue Francaise? Any help you can offer would be great. Thank you, Gillian > Smith > You can get information about both the TIPSTER corpora and the TREC collection (perhaps also CACM) from : Donna K. Harman NIST Building 225, A216 Gaithersburg, MD 20899 301/975-3569 harman@magi.ncsl.nist.gov ........................................ Sean Boisen -- sboisen@bbn.com BBN Systems and Technologies, Cambridge MA 2)------------------------------------------------------------------ Send-date: Tue, 6 Oct 1992 9:30:22 UTC-0600 From: ted To: Message-ID: corpora:90 9210061530.AA21055(a)NMSU.Edu Subject: Need help finding these corpus any of these: TIPSTER -supposedly contains a dbase of 10^6 documents, the CACM collection, the NPL collection -supposedly 12000 old documents, the TREC collection, the British National Corpus -supposedly 100Mb, or the Tresor de la Langue Francaise? just a note... the TIPSTER english and TREC corpora are identical. i believe that the tipster corpus is still not publically available. the easiest large corpus to get right now is the acl/dci text corpus which comes on cdrom. contact the dci for this. 3)------------------------------------------------------------------ Send-date: Tue, 6 Oct 1992 19:18:32 UTC+0100 From: To: Message-ID: corpora:93 E4F69708921F81FB75(a)cs.umass.EDU Subject: IR collections Some of the corpora mentioned by Gillian Smith are standard test collections used in Information Retrieval (IR). These are the CACM collection and the NPL collections. Both of them (as well as a number of other test collections) are available from Ed Fox (fox@fox.cs.vt.edu). The Tipster corpus is available from Donna Harman (harman@magi.ncsl.nist.gov). It is part of an effort to test information retrieval systems using a truly large corpus (multiple gigabytes). The TREC corpus is the same as the Tipster corpus. TREC is an open competition to test IR systems, and Tipster is a government-sponsored research effort. Please contact Donna Harman for more information about TREC. Bob krovetz@cs.umass.edu 4)------------------------------------------------------------------ Send-date: Tue, 6 Oct 1992 13:38:57 UTC-0500 From: C. M. Sperberg-McQueen To: CORPORA list Message-ID: corpora:95 (q)alf.uib.no.973:06.09.92.18.56.31(q)(a)uib.no Subject: Re: Need help finding these corpus RFC-822-HEADERS: Organization: ACH/ACL/ALLC Text Encoding Initiative On Tue, 6 Oct 1992 13:24:37 +0100 Gillian Smith said: > ... Does anyone out there have an address, phone number or e-mail > address for any of these: > > TIPSTER -supposedly contains a dbase of 10^6 documents, > the CACM collection, My current CACM (35.10, Oct 1992) says (ad on p. 116) that all ACM computer science literature is available -- on an article-by-article basis, apparently -- from Engineering Information Inc. Document Delivery Service. Phone Ruth Miller at Engineering Information, Inc., +1 (800) 221-1044, +1 (201) 216-8537, fax +1 (201) 216-8526. I doubt that this is what you want, or are talking about. I have vague memories of seeing ads in CACM for a CD-ROM of computer science material, but I can't find them in any copies of CACM on my shelf, and no CD that I can recognize is listed among the ACM publications in the front matter of the journal. Perhaps it's described in SIGIR documents somewhere? > the NPL collection -supposedly 12000 old documents, > the TREC collection, > the British National Corpus -supposedly 100Mb, or Not completed yet. Contact Lou Burnard, lou@vax.ox.ac.uk for information. > the Tresor de la Langue Francaise? Contact Mark Olsen, mark@gide.uchicago.edu for information. > Any help you can offer would be great. Thank you, Gillian >Smith > ============================================================================ From corplst Tue Oct 6 22:22:51 1992 Date: Tue, 6 Oct 1992 22:22:51 +0100 From: corplst (CORPORA list) To: corpora Subject: Non-English taggers and tagged corpora Send-date: Tue, 6 Oct 1992 9:32:24 UTC-0600 From: ted To: Message-ID: corpora:91 9210061532.AA21417(a)NMSU.Edu Subject: Non-English taggers and tagged corpora Date: Tue, 6 Oct 1992 13:24:29 +0100 From: corplst%nora.hd.uib.no (CORPORA list) Send-date: Mon, 5 Oct 1992 13:10:44 UTC-0500 From: (Jim Barnett) To: Cc: Message-ID: corpora:70 9210051810.AA04366(a)paintbrush.mcc.com Subject: Non-English taggers and tagged corpora ... kyoto's tagger juman is 93% accurate ... the first version of juman was pretty slow and not terribly accurate due to its small lexicon. the new version is rumored to remedy both defects. ============================================================================ From ingria@BBN.COM Tue Oct 6 13:48:46 1992 To: ram@claude.cs.umb.edu Cc: corpora@x400.hd.uib.no, corplst@nora.hd.uib.no In-Reply-To: " (CORPORA list)"'s message of Fri, 2 Oct 1992 00:29:11 +0100 <199210012328.AA05767@nora.hd.uib.no> Subject: 'graded' corpora Reply-To: ingria@BBN.COM Date: Tue, 6 Oct 92 17:48:46 EDT From: ingria@BBN.COM Sender: ingria@BBN.COM From: Robert Morris To: Subject: 'graded' corpora But I confess, I didn't try calling the publishers of the written material to ask whether they can supply electronically. I'll do that, but I'm not optimistic about the results. Most publishers live in the stone age (or really, the phototype age) about computer technology and can't even _accept_ stuff in electronic form, let alone produce it. You might try to negotiate with them about getting hold of the typesetting tapes. Even though this may not be a regular product, it is sometimes possible to get hold of a copy of the typesetting tape. I've never negotiated directly with a publisher to get hold of such a tape, but those who have indicate that the responses range from ``If you'll actually take care of the original tapes for us, we'll give them to you'' to demands for scads and scads of money. The problem is (1) publishers' main business currently is text publishing, so they are leery about giving away their product (fear of piracy or use by competitors); and (2) since there is no general market for ``electronic books'' etc., there are no market forces to set a price, so they do not have any idea of what an electronic version of their product is worth, and so may wildly over- or under- estimate the value. If you are willing to accept electronic versions of texts that are no longer produced, this may help out. Publishers are sometimes more willing to distribute E-versions of texts that they no longer have any market interest in, and, hence, have little or nothing to lose if they are pirated. You will almost certainly have to sign a licensing agreement if you pry something loose, agreeing not to re-distribute the texts, use them in commercial products, etc. Good luck. -30- Bob Ingria ============================================================================ From postmaster@x400.hd.uib.no Wed Oct 7 00:03:03 1992 Date: Tue, 6 Oct 1992 23:03:03 +0100 From: ingria@BBN.com Sender: ingria@BBN.com To: ram@claude.cs.umb.edu Cc: corpora@x400.hd.uib.no, corplst@nora.hd.uib.no In-Reply-To: <199210012328.AA05767@nora.hd.uib.no> Subject: 'graded' corpora Reply-To: ingria@BBN.com From: Robert Morris To: Subject: 'graded' corpora But I confess, I didn't try calling the publishers of the written material to ask whether they can supply electronically. I'll do that, but I'm not optimistic about the results. Most publishers live in the stone age (or really, the phototype age) about computer technology and can't even _accept_ stuff in electronic form, let alone produce it. You might try to negotiate with them about getting hold of the typesetting tapes. Even though this may not be a regular product, it is sometimes possible to get hold of a copy of the typesetting tape. I've never negotiated directly with a publisher to get hold of such a tape, but those who have indicate that the responses range from ``If you'll actually take care of the original tapes for us, we'll give them to you'' to demands for scads and scads of money. The problem is (1) publishers' main business currently is text publishing, so they are leery about giving away their product (fear of piracy or use by competitors); and (2) since there is no general market for ``electronic books'' etc., there are no market forces to set a price, so they do not have any idea of what an electronic version of their product is worth, and so may wildly over- or under- estimate the value. If you are willing to accept electronic versions of texts that are no longer produced, this may help out. Publishers are sometimes more willing to distribute E-versions of texts that they no longer have any market interest in, and, hence, have little or nothing to lose if they are pirated. You will almost certainly have to sign a licensing agreement if you pry something loose, agreeing not to re-distribute the texts, use them in commercial products, etc. Good luck. -30- Bob Ingria ============================================================================ From ingria@BBN.COM Tue Oct 6 14:07:59 1992 To: corplst@nora.hd.uib.no In-Reply-To: CORPORA list's message of Tue, 6 Oct 1992 12:29:49 +0100 <199210061129.AA05416@nora.hd.uib.no> Subject: Non-English taggers and tagged corpora Reply-To: ingria@BBN.COM Date: Tue, 6 Oct 92 18:07:59 EDT From: ingria@BBN.COM Sender: ingria@BBN.COM From: Subject: Re: Non-English taggers and tagged corpora The concordancer Letteratura Amica (or Literary Amiga in English) developed by Raffaele Cocchi of the U of Bologna tags and works in most European languages. He's still working on improving the allophones and algorithms for the speech function (this talks in its 9 [?] languages, too), but the concordancer part is well developed. Some questions: (1) Does this work for all the EC languages? (2) What sorts of tags does it have for nouns and verbs? e.g. for a language with rich morphological Case, such as German or Modern Greek, one might expect the noun tags to include the Case information, whereas for English and Dutch, say, where only pronouns bear overt Case, the noun tags probably wouldn't. Similarly for verbs and aspect, mood, and voice. (3) How large are the lexicons for each language for the tagger functions? (4) How does the tagger deal with unknown words? Is it stochastic? Rule-based? Stochastic with knowledge-based overlay? His address: Via Toffano, 6; 40125 Bologna, Italy Does he have an EMail address? -30- Bob