From corpora-request@uib.no Thu Jan 21 01:39:07 1993 Date: Thu, 21 Jan 1993 00:39:07 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Repeated structures and fuzzy matching ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Wed, 20 Jan 1993 12:27:27 UTC+0100 From: Magnus Merkel Subject: Repeated structures and fuzzy matching Three questions on Repeated Structures and Fuzzy Matching: 1. I have found Lessard & Hamm's article on repeated structures in the Journal of ALLC, 4/91, but can anybody give me more references on studies on recurrent patterns in text (sentences, phrases and syntactic patterns)? 2. Is there anything on how to identify phrases with minimal linguistic information, e.g. on the basis of function words and punctuation marks? 3. Where can I find out more about fuzzy matching of sentences and strings? These things are implemented in some translation memory-based translation software, but unfortunately, they are not very good. So any hints here are very welcome. The background for this is that we are interested in automatic text analysis, with the specific goal to give a "diagnosis" on how translateable a given text is. The results of the analysis will give the translator a translation profile and certain measurements that indicate what kind of translation tool (if any) that should be used for a particular text. We have done some preliminary studies of computer handbook texts and constructed tools for analysing recurrent sentences and phrases. The results we are getting are measurements of how large proportion of the text that is made up by these recurrent structures. We have run the tools on approximately 1 million words of computer program manuals. At the moment "phrases" or strings that are found by the analysis tool has to be manually revised before we can "measure" their coverage in the text. By looking at a bilingual corpora, and using an alignment program to construct "translation memories" of existing translations, we want to study how recurrency is handled in a) actual manual translations and b) translation memory-based translations (for example, by using IBMs Translation Mgr). Of special interest is to compare in what way the translation quality is effected in for example translation consistency and text binding. Comments and suggestions are very welcome. Regards, Magnus Merkel Dept. of Computer and Information Science Linkoping University S-581 83 Linkoping Sweden email: magme@ida.liu.se From corpora-request@uib.no Mon Jan 25 08:19:07 1993 Date: Mon, 25 Jan 1993 07:19:07 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: Repeated structures and fuzzy matching ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Thu, 21 Jan 1993 9:20:12 UTC-0500 From: (Ian Lancashire) Subject: Re: Repeated structures and fuzzy matching Toronto's TACT 2.0 system (MS-DOS) has a CollGen (Collocation Generator) program that lists all maximal phrases in a text. This is a batch procedure operating off a TACT database. I've articles on phrases in Chaucer, Shakespeare, T.S. Eliot and Margaret Atwood phrasal repetitions at press now but nothing in the libraries yet. TACT 2.0 is in beta-test now. Version 1.2 (available for FTP transfer from epas.utoronto.ca (/pub/cch/tact)) has a phrase lister that produces just a highly redundant index, not a list of maximal phrases. TACT will not handle immense corpora well, although it has been used for that purpose. Its theoretical limit is any text with a word-type occurring more than 65,000 times. TACT was devised for literary analysis. Tom Horton (Florida Atlantic University, Computer Science) released a very interesting word-cluster program last year, again for MS-DOS. He might be able to help on fuzzy collocation. Phrasal studies are among the most interesting opportunities for new research. Would that some enterprising researcher could devise the software to identify repeating phrases in corpora! ---------------------------------- Prof. Ian Lancashire Dept. of English, New College Director, Centre for Computing in the Humanities Univ. of Toronto, Toronto, Ont. M5S 1A1, CANADA Voice: (416) 978-8279; FAX: (416) 978-6519 E-mail: ian @ epas.utoronto.ca From corpora-request@uib.no Tue Jan 26 16:21:51 1993 Date: Tue, 26 Jan 1993 15:21:51 +0100 From: PSP10%PHOENIX.CAMBRIDGE.ac.uk@alf.uib.no To: "corplst (CORPORA list)" Subject: Re: Repeated structures and fuzzy matching Can I get hold of the Tact phrase collector? Paul (Procter), Cambridge Language Survey From corpora-request@uib.no Tue Jan 26 16:40:04 1993 Date: Tue, 26 Jan 1993 15:40:04 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: M.L.King's Dream text? ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Sat, 23 Jan 1993 11:28:58 UTC-0500 From: (John Bro) Subject: M.L.King's Dream text? Is Martin Luther King's "I have a dream" speech available online anywhere? Thanks, j. ============================================================ John Bro | bro@elm.circa.ufl.edu Linguistics | bro@oak.circa.ufl.edu University of Florida | bro@ufoak.bitnet Gainesville, Fl 32611 | bro@reef.cis.ufl.edu [ From list moderator: The text is listed in the Oxford Text Archive list: King, Martin Luther P-1501-A | "I have a dream" speech. New York, 1968: Pocket Books. Depositor: Michael S. Hart, 405 West Elm Street, Project Gutenburg. [Taken from the peaceful warrior] contact: archive@ox.ac.uk or hart@vmd.cso.uic.edu The text was also included on the Walnut Creek Desktop Bookshop CD-ROM, but this CD was temporarily withdrawn due to copyright reasons. For more info about CD-ROMs from Walnut Creek, use anonymous FTP to cdrom.com, contact velte@cdrom.com or send the following line to fileserv@nora.hd.uib.no send info walnut.cdrom.info Knut Hofland ] From corpora-request@uib.no Tue Jan 26 16:40:32 1993 Date: Tue, 26 Jan 1993 15:40:32 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: Repeated structures and fuzzy matching ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Mon, 25 Jan 1993 17:09:27 UTC-0500 From: Joe Raben Subject: Re: Repeated structures and fuzzy matching How does one reach Tom Horton to inquire about his cluster program? From corpora-request@uib.no Wed Jan 27 20:55:27 1993 Date: Wed, 27 Jan 1993 19:55:27 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: queries ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Tue, 26 Jan 1993 23:16:24 UTC-0500 From: (Ian Lancashire) Subject: queries I'll gladly send others Tom Horton's e-mail address as I have sent it to Joe Raben. Paul Proctor asks how he can get the TACT phrase generator. An older version can be had by anonymous FTP from epas.utoronto.ca (/pub/cch/tact); the latest version is in beta-testing and will be released in June in the same way. CollGen can only be used on a TACT database file, not a plain text, so that you have to buy into the entire program. A few people have expressed willingness to help us beta-test the new version, but this involves more work than many would be eager to do. Ian Lancashire -- Dept. of English, New College Director, Centre for Computing in the Humanities Univ. of Toronto, Toronto, Ont. M5S 1A1, CANADA Voice: (416) 978-8279; FAX: (416) 978-6519 E-mail: ian @ epas.utoronto.ca From corpora-request@uib.no Wed Jan 27 20:57:32 1993 Date: Wed, 27 Jan 1993 19:57:32 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: M.L.King's Dream text? ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Wed, 27 Jan 1993 17:35:29 UTC+1000 From: CALIX Subject: Re: M.L.King's Dream text? I think it is available at oes.orst.edu in pub/alamnac/etext or mrcnext.cso.uiuc.edu in /etext Lloyd edulh@lure.latrobe.edu.au Oops almanac, above. From corpora-request@uib.no Wed Jan 27 20:57:56 1993 Date: Wed, 27 Jan 1993 19:57:56 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: Repeated structures and Fuzzy matching ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Wed, 27 Jan 1993 7:57:32 UTC+0100 From: Magnus Merkel Subject: Repeated structures and Fuzzy matching Here's a compilation of some of the response I got after my request about Repeated Structures and Fuzzy Matching. Thanks to all who responded. Magnus Merkel ************************************************************************ Pim van der Eijk of Digital Equipment in The Netherlands, has submitted a paper to EACL-93 with the title "Automating the Acquisition of Bilingual Terminology". This is very interesting for people who are into "minimal phrase parsing" and making use of bilingual corpora for translation purposes. If you are interested you should contact Pim van der Eijk directly (eijk@cecehv.enet.dec.com). Magnus M. ***************************************************************************** From: kornai@Csli.Stanford.EDU (Andras Kornai) Message-Id: <9301202033.AA03107@Csli.Stanford.EDU> Subject: Re: Repeated structures and Fuzzy matching To: mme@ida.liu.se (Magnus Merkel) Date: Wed, 20 Jan 93 12:33:53 PST In-Reply-To: <9301201930.AA29952@Csli.Stanford.EDU>; from "Magnus Merkel" at Jan 20, 93 11:30 am The standard reference on string matching is David Sankoff & Joseph Kruskal: Time warps, string edits, and macromolecules (Addison Wesley 1983). There are some newer developments, but the basic techniques are all there... Andras Kornai (kornai@csli.stanford.edu) ***************************************************************************** From roscheis@Csli.Stanford.EDU Wed Jan 20 21:44:22 1993 From marti@banyan.Berkeley.EDU Wed Jan 20 23:35:13 1993 Date: Wed, 20 Jan 93 14:30:27 -0800 Message-Id: <9301202230.AA02053@banyan.Berkeley.EDU> From: Marti Hearst Sender: marti@banyan.Berkeley.EDU To: magme@ida.liu.se Subject: empiricist query 3. Where can I find out more about fuzzy matching of sentences and strings? These things are implemented in some translation memory-based translation software, but unfortunately, they are not very good. So any hints here are very welcome. I don't know if this is what you want, but on page 74 of the National Research Council's report Computing the Future there is a discussion of how Lewis and Chandy of Cal Tech modified a string search algorithm by Knuth,Morris, and Pratt (the KMP Algorithm) and Boyer and Moore, to do fuzzy matching on DNA sequences. Unfortunately there are no real references but the blurb was written by Chuck Seitz of Cal Tech, but the DNA literature might be a good place to look. Marti Hearst ***************************************************************************** From ska@dou.dk Thu Jan 21 10:34:17 1993 Date: 21 Jan 93 10:34 +0100 From: Sabine Kirchmeier-Andersen To: Message-Id: <52*ska@dou.dk> Subject: tools for measuring texts ... I saw your request on the corpus list on tools for measuring recurring structures in texts in order to determine whether they are suitable for translation or not. Here at the University of Odense, we have started to work on corpus linguistics particularly with a view to syntactic valency. We have some background in MT - a couple of years in the Eurotra project. For our research we are developping a general purpose tool, that can help us pick out valency patterns for specified verbs in a corpus of 4 million words, but it is possible to specify other patterns and structures as well. ... Sabine Kirchmeier-Andersen Institute for Language and Communication University of Odense Campusvej 55 DK-5230 Odense M ***************************************************************************** From rousse@steinway.u-strasbg.fr Thu Jan 21 17:01:43 1993 Date: Thu, 21 Jan 93 17:00:28 +0100 From: Rousselot Francois Message-Id: <9301211600.AA00477@steinway.u-strasbg.fr> To: Magnus Merkel (by way of roscheis@cs.stanford.edu (Martin Roscheisen)) Subject: Re: Repeated structures and Fuzzy matching Status: O I just received a book that will surely interest you. "patterns of lexis in text" autor Michael Hoey oxford university press 1991 ... F Rousselot ***************************************************************************** Message-Id: <9301221949.AA25930@ida.liu.se> Date: Fri, 22 Jan 1993 14:46-EST From: Marc.Ringuette@GS80.SP.CS.CMU.EDU To: Magnus Merkel (by way of roscheis@cs.stanford.edu \(Martin Roscheisen)")" Subject: Re: Repeated structures and Fuzzy matching Status: O It sounds like you're considering some very fancy solutions -- but you may want to consider an inductive learning approach. Write a program to extract a likely-looking set of features from each text. Then produce a thousand or so "training examples" for an inductive learning system like ID3 or backpropagation, which consist of n pairs. For example, I'm working on assigning keywords to text: I extract word occurrences as features, and try to classify into keyword categories. You might consider trying to inductively learn the class "texts which are best fed to software package X". -- Marc Ringuette (mnr@cs.cmu.edu) ************************************************** From corpora-request@uib.no Thu Jan 28 11:06:51 1993 Date: Thu, 28 Jan 1993 10:06:51 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: Repeated structures and Fuzzy matching ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Wed, 27 Jan 1993 15:09:49 UTC-0500 From: Subject: Re: Repeated structures and Fuzzy matching On fuzzy matching, there was an article in BYTE (Nov 92, vol 17 #12, pp. 281-290) on agrep, with the "a" meaning approximate. There are flags you can set that control the amount of fuzziness that constitutes a match. We FTPed it here and it seems to work reasonably well. malcolm brown Dartmouth From corpora-request@uib.no Sat Jan 30 11:35:48 1993 Date: Sat, 30 Jan 1993 10:35:48 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: CETH 1993 Summer Seminar in Electronic Texts in the Humanities ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Thu, 28 Jan 1993 17:28:00 UTC-0500 From: Susan Hockey Subject: CETH 1993 Summer Seminar in Electronic Texts in the Humanities [This is being cross-posted to several lists. Apologies if you receive it more than once: --SH] CENTER FOR ELECTRONIC TEXTS IN THE HUMANITIES Electronic Texts in the Humanities: Methods and Tools The Second Annual Summer Seminar at Princeton University, New Jersey August 1-13, 1993 organized by The Center for Electronic Texts in the Humanities, Princeton and Rutgers with the co-sponsorship of the Centre for Computing in the Humanities, University of Toronto The Center for Electronic Texts in the Humanities (CETH) is again offering an intensive two-week seminar during August 1993. The seminar will address a wide range of challenges and opportunities that electronic texts and software offer to teachers, scholars and librarians in the humanities. Discussions on the capture, markup, retrieval, presentation, transformation, and analysis of electronic text will prepare students for extensive hands-on experience with illustrative software, e.g., MTAS, Micro-OCP, WordCruncher, Tact, and hypertext. Resources on CD-ROM and Internet, such as the OED, Perseus, CDWORD, and several large textual collections in classical Greek, Latin, French, Italian, and English, will be demonstrated so that participants may make informed evaluations of their significance in the light of current and future technologies. Approaches to markup, from ad hoc schemes to the systematic design of the Text Encoding Initiative, will be surveyed and considered. The focus of the Seminar will be practical and methodological, with the immediate aim of assisting participants in their own teaching, research, and advising. It will be concerned with the demonstrable benefits of using electronic texts, with typical problems and how to solve them, and with the ways in which software fits or can be adapted to common methods of textual study. Participants will be expected to work on coherent projects, preferably of their own devising, and will be given the opportunity to present them on the last day. Throughout the Seminar, the instructors will provide assistance with designing projects, locating sources for texts and software, and solving practical problems. Ample computing facilities will be available 24 hours per day. A small library of essential articles and books in humanities computing will be on hand to supplement printed seminar materials, which include an extensive bibliography. Special lectures will describe current research in the field and address research topics, as well as the role of the library in the use of electronic texts. The Seminar is intended for faculty, students, librarians, technical advisers, and academic administrators with direct responsibilities for humanities computing support. It assumes basic computing experience but not necessarily with the application of computers to academic research and teaching. The number of participants will be limited to 30. Provisional Schedule Week 1, August 1-6, 1993 Sunday, August 1. Registration and introductions Monday, August 2. The electronic text a.m. What electronic texts are and where to find them; survey of existing inventories, archives, and other current resources. History of computer-assisted text analysis in the humanities. Introduction to simple concordancing with MTAS, including practical session. p.m. Creating and capturing texts in electronic form; keyboard entry vs. optical scanning. Demonstration of optical character-recognition technology. Introduction to text encoding, surveying ad hoc methods, e.g. COCOA, WordCruncher, TLG beta code; problems of these methods. Practical exercise in deciding what to encode in typical texts. Tuesday, August 3. Concordancing a.m. A focussed look at computer-assisted concordance generation; types of concordances, their specific advantages and disadvantages. Alphabetization, character sequences, sorting, and forms of presentation. Introduction to Micro-OCP; practical session in its use. p.m. Further work on concordancing with Micro-OCP. Wednesday, August 4. The interactive concordance a.m. Indexed, interactive retrieval vs. batch concordance generation. Textual problems and interpretative approaches particularly suitable to an interactive system; the continuing use of concordances in hardcopy. Preparation of text for indexed retrieval; differing roles of markup and external "rules"; kinds of displays and their augmentation through post-processing. Introduction to Tact. p.m. Practical work using Tact: simple markup, compilation of a textual database, and methods of inquiry. Thursday, August 5. Stylistics; SGML a.m Stylistic comparisons and authorship studies using concordance tools; basic statistics for lexical and stylistic analysis. Case studies, e.g. Federalist Papers, Kenny on Aristotle, Burrows on Jane Austen. p.m. Introduction to the Standard Generalized Markup Language (SGML) and the Text Encoding Initiative (TEI). Document structure and SGML elements. Start-tags, end-tags, and empty tags. Document type declarations. Group tagging of simple examples. SGML entities and their uses: character representation, boilerplate text, file management. Introduction to TEI Core tags and base tags for prose. Group tagging of examples using TEI tags. Friday, August 6. SGML and TEI a.m. The TEI Header: documentation for electronic texts. The file description; the encoding description; the text profile; the revision history. Overview of the TEI DTDs: base tag sets, additional tag sets, and auxiliary document types. p.m. Using TEI in practice. Overview of available commercial and public-domain software (the latter will be distributed to participants). Creating TEI texts; validation; processing. Tools for processing SGML texts: commercial and public-domain. Examples: translating a TEI text into COCOA (for OCP), Word-Cruncher format, TACT format. Practical session creating and validating TEI-conformant texts. Week 2, August 9-13, 1992 Monday, August 9. Scholarly editions a.m. Overview of tools for preparing critical editions. Constructing glossaries and material for commentary; application of Micro-OCP and/or Tact. Collation; single-text vs. multiple-text methods. Overview of software tools. Introduction to Collate. p.m. Electronic publication. Discussion of methods and implications. Tuesday, August 10. Electronic Dictionaries a.m. The electronic dictionary; from machine-readable dictionary to computational lexicon. What the New OED and other online dictionaries can do for the scholar. Uses of lexical knowledge bases in text retrieval. Building a simple online lexicon with Tact. p.m. Individual project work. Wednesday, August 11. Hypertext a.m. Hypertext and hypermedia: techniques of presentation and organization of textual data for analysis; possible combinations of hypertext and concordancing methods. Reading and writing the hypertextual book; hypertextual note-taking and annotating. Practical introduction to constructing a hypertext. p.m. Further practical session on building a hypertextual system. Demonstration and discussion of Perseus, StorySpace and Voyager texts. Thursday, August 12. Evaluation; Projects a.m. Review of the previous week's work. Discussion on the limitations of existing software. Advanced analytical tools not commonly available, e.g. pattern recognizers, lemmatization systems, morphological analyzers, parsers; overview of these. The contributions of computational linguistics and artificial intelligence, and where research in these areas is headed. Examination of some existing resources. p.m. Completion of project work. Friday, August 13. Projects a.m. Presentation of participants' projects. p.m. Concluding discussion of basic questions. What from a scholarly and methodological perspective is to be gained? What are the probable effects on research and teaching? What can one learn from the collision of automatic methods with intuitive perceptions? What it is the role of humanities computing: merely an efficient facilitator of traditional work or a fundamental component for pursuing new questions? Where do we go from here with software, and with its application? How can the machine better assist us in educating the imagination? The Center for Electronic Texts in the Humanities The Center for Electronic Texts in the Humanities was established in October 1991 by Rutgers and Princeton Universities with external support from the Mellon Foundation and the National Endowment for the Humanities. As a national focus of interest in the U.S. for those who are involved in the creation, dissemination and use of electronic texts in the humanities, it also acts as a national node on an international network of centers and projects which are actively involved in the handling of electronic texts. Developed from the international inventory of machine-readable texts which was begun at Rutgers in 1983 and is held on RLIN, the Center is now reviewing the records in the inventory and continues to catalog new texts. The acquisition and dissemination of text files to the community is another important activity, concentrating on a selection of good quality texts which can be made available over Internet with suitable retrieval software and with appropriate copyright permission. The Center also acts as a clearinghouse on information related to electronic texts, directing enquirers to other sources of information. Instructors The seminar will be taught by Susan Hockey and Willard McCarty, with assistance from Michael Sperberg-McQueen (SGML and TEI), Elli Mylonas (Hypertext) and staff of Computing and Information Technology, Princeton. Susan Hockey is Director of the Center for Electronic Texts in the Humanities. Before moving to the USA in October 1991, she spent 16 years at Oxford University Computing Service where her most recent position was Director of the Computers in Teaching Initiative Centre for Textual Studies. At Oxford she was responsible for various humanities computing projects including the development of the Oxford Concordance Program (OCP), an academic typesetting service for British universities, and OCR scanning. She has taught courses on humanities computing for fifteen years and has given numerous guest lectures on various aspects of computing in the humanities. She is the author of three books and numerous articles on humanities computing and has been Chair of the Association for Literary and Linguistic Computing since 1984. She is a member (currently Chair) of the Steering Committee of the Text Encoding Initiative. Willard McCarty has been active in humanities computing since 1977. With its founding Director, Ian Lancashire, he helped to set up the Centre for Computing in the Humanities, University of Toronto, of which he is now the Assistant Director. He was the founding editor of Humanist, the principal electronic seminar for computing humanists, and has edited several other publications in the field. He regularly gives talks, papers, and lectures throughout North America and Europe. McCarty took his Ph.D. in English literature in 1984; his current literary research is in classical studies, especially the Metamorphoses of Ovid. In support of a forthcoming book, he has an electronic edition of that poem underway for the text-retrieval program Tact. Elli Mylonas is a Research Associate in Classics at Harvard University, and is currently the Managing Editor of the Perseus Project. She has co-taught tutorials on "Teaching with Hypertext" at the Hypertext meetings in San Antonio and Milan (1991, 1992). In addition to coordinating the Perseus Project, her responsibilities cover the creation and structuring of the textual component of the project, and working together with the user interface designers and documentation specialists. She is the project leader for Pandora, a Macintosh search program for the TLG and PHI disks. Elli Mylonas is a founding member and one of the two organizers of CHUG (Computing in the Humanities User's Group), a humanities computing seminar that has been meeting biweekly at Brown University for the last 4 years. She is also on the Text Representation Committee of the Text Encoding Initiative, where she has worked on identifying SGML structures for tagging reference systems, drama and verse in literary texts. She has published and spoken on hypertext, descriptive markup and literary texts, and the use of computers in education. C. M. Sperberg-McQueen studied Germanic medieval literature in the comparative literature program at Stanford University; since 1980 he has been working to bring computing technology to bear on problems of textual research. In 1985 and 1986, he served as a consultant for humanities computing in the Princeton University Computer Center; since 1987 he has worked at the academic computer center at the University of Illinois at Chicago, where he is now a senior research programmer. He is a member of the steering committee, and the editor in chief, of the Text Encoding Initiative. Fees The cost of participating in this Summer Seminar will be $895, including tuition, use of computer facilities, student accommodation, breakfast and lunch at Princeton for the two weeks, and banquet and reception. Students pay a reduced rate of $795. For those who prefer hotel accommodations, the cost is $645 to cover tuition, lunch, the banquet and reception, and $565 for students. There will be 24-hour access to networked microcomputers in the student accommodation throughout the seminar. Application Procedure To apply for participation in this Summer Seminar, submit a one-page statement of interest. The statement should indicate (1) how participation in the Seminar would be relevant for your teaching, research, librarianship, advising or administrative work, and possibly that of your colleagues; (2) what project you would like to undertake during the Seminar, or what area of the humanities you would most like to explore; and (3) the extent of your computing experience. Applications must be attached to a cover sheet specifying your name, current institutional affiliation and position, postal and email addresses, and phone and fax numbers, as available, as well as natural language interest and computing experience. Currently enrolled students must also include a photocopy of a valid student ID. E-mail submissions should have a subject line `Summer Seminar Application'. The statement must be received by the reviewing committee, consisting of members of the Center's Governing Board, by APRIL 15, 1993, at the address below. Those who have been selected to attend will be notified by May 15, 1993. Payment will be requested at this time. Summer Seminar 1993 Center for Electronic Texts phone: (908) 932-1384 in the Humanities fax: (908) 932-1386 169 College Avenue bitnet: ceth@zodiac New Brunswick, NJ 08903 internet: ceth@zodiac.rutgers.edu USA From corpora-request@uib.no Wed Feb 3 10:49:47 1993 Date: Wed, 3 Feb 1993 09:49:47 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: French quotations dictionary ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Fri, 29 Jan 1993 16:16:10 UTC+0100 From: Subject: French quotations dictionary Does anybody know where to get a (computer readable) dictonary with quotations of famous authors from ? The nationality of the author doesn't matter, but their quotations should be translated into French. Thanx, ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Email: knaff@mururoa.imag.fr Alain Lucien Knaff Tel.(home): (33) 76 85 23 05 Appartement 310b (repondeur & minicom 3612) =====O=====/ 11,rue General Mangin Fax : (33) 76 54 76 15 =====O=====/ 38100 Grenoble France From corpora-request@uib.no Wed Feb 3 10:50:05 1993 Date: Wed, 3 Feb 1993 09:50:05 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Corpus annotation ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Sat, 30 Jan 1993 15:26:04 UTC From: Lou Burnard Subject: Corpus annotation I have a general question on the subject of corpus annotation or tagging. Indeed, it may be so general as not to be worth asking, but I'll ask it anyway. It's prompted by work I am currently doing in editing the draft section of the TEI Guidelines which deals with corpora, and in which an attempt is made to categorize the various kinds of corpus annotation and to point to model TEI-conformant solutions for them. The question is: can you distinguish more than three fundamentally different kinds of corpus annotation? and if so, what are they? I should probably define what I mean by corpus annotation. I don't include under this heading properties of a corpus fragment relating to its formal structure or organization (for example, page, chapter or line numbers), nor properties having to do with its context (for example, the circumstances of its production, its genre or medium). I am thinking more of linguistic annotation, (for example following the model of the "parsed" LOB, the TOSCA project, the SUZANNE corpus and many others) but similar techniques might be used for all kinds of textual categorizations. I'm conscious of a distinctly English bias in my way of looking at the topic however, so I am hoping for corrections to this simplistic world view. It seems to me that there are three different levels at which such annotation is performed: 1. the token level (an individual code is associated with each token in the running text) 2. the segment level (codes are associated with particular sequences of tokens in the running text) 3. the associative level (codes are associated with associations or links between particular tokens or segments in the running text) Token-level annotation includes such things as the LOB word-class codes. It has well-recognised problems of scope and granularity, but is well understood and widely used. Segment-level annotation includes all kinds of syntactic analyses or "labelled bracketting" of texts. Again there are problems of scope (special steps must be taken to deal with discontinuous or nested segments) but the basic mechanisms are well understood. Associative annotation is less widespread, perhaps because of technical implementation difficulties. Examples include the work of Leech and others at Lancaster in modelling anaphors. I don't attempt to propose a typology of annotation, you'll note, just a typology for things-annotated. Better brains than mine have tried, and abandoned, the quest for a universal typology of annotation. For similar reasons, I'm agnostic as to how the annotation itself is represented, and whether or not (for example) it has internal structure. Though important, that issue seems to me to be quite distinct from the one I'm raising here. Lou Burnard From corpora-request@uib.no Wed Feb 3 10:50:20 1993 Date: Wed, 3 Feb 1993 09:50:20 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: TEXT Technology ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Sun, 31 Jan 1993 7:29:02 UTC-0600 From: Eric Johnson DSU, Madison, SD 57042 Subject: TEXT Technology Readers of this list who would like to receive a complimentary copy of the most recent issue of the journal TEXT Technology should send a request to me by email at eric@sdnet.bitnet or johnsone@columbia.dsu.edu, or send a request to me by regular mail to the address listed at the end of this message. TEXT Technology publishes articles and reviews about all facets of using computers for the creation, processing, and analysis of texts. It is designed for academic and corporate writers, editors, and teachers. The bi-monthly journal contains timely reviews of software for writing and publishing, discussions of applications for the analysis of literary works and other texts, notices of significant events in computing around the world, bibliographic citations, and much more. Submissions of articles and reviews are welcome. They should be sent as ASCII files via e-mail to the Editor, Eric Johnson, at ERIC@SDNET.BITNET, or they may be submitted on MS-DOS disks sent to Eric Johnson TEXT Technology 114 Beadle Hall Dakota State University Madison, South Dakota 57042, USA From corpora-request@uib.no Wed Feb 3 10:50:35 1993 Date: Wed, 3 Feb 1993 09:50:35 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: Fuzzy matching ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Mon, 1 Feb 1993 8:30:11 UTC-0500 From: (Malcolm Brown) Subject: Re: Fuzzy matching Somebody asked about exact info on the agrep souce. According to the BYTE article, you can FTP it from cs.arizona.edu or find it on BIX in the frombyte92 listings area. Malcolm Brown Dartmouth From corpora-request@uib.no Sat Feb 6 21:19:53 1993 Date: Sat, 6 Feb 1993 20:19:53 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Children's French corpus? ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Thu, 4 Feb 1993 17:39:00 UTC+0100 From: BLACKWELLSA Subject: Children's French corpus? Can anyone on the list tell me whether there is a corpus of Children's French utterances in existence? I'm enquiring on behalf of a student. The relevant age range is 2 to 8 years. Thanks! Sue Blackwell University of Birmingham From corpora-request@uib.no Sat Feb 6 21:20:20 1993 Date: Sat, 6 Feb 1993 20:20:20 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: STATITEXT ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Fri, 5 Feb 1993 7:45:50 UTC-0700 From: Ron Southerland Subject: STATITEXT This is a request for information (from firsthand experience or otherwise) on a text analysis product called STATITEXT (version 1.0 for the Mac). I have the limited info provided by the developer but, in view of the not insignificant cost and the lack (as far as I know) of third-party reviews, would like to hear from anyone with knowledge of the product. The STATITEXT brochure promises virtually the moon in terms of text analysis (linguistic, literary, etc.) and would seem to be the answer to my own problems with respect to the analysis of longish electronic texts. Thanks for any help that may be forthcoming. Ron Southerland Linguistics University of Calgary From corpora-request@uib.no Sat Feb 6 21:20:35 1993 Date: Sat, 6 Feb 1993 20:20:35 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Getting the ACL/DCI CD-ROM ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Fri, 5 Feb 1993 15:58:57 UTC+0100 From: Subject: Getting the ACL/DCI CD-ROM Could anyone tell me to whom I should apply to get a copy of the ACL/DCI CD-ROM? Thank you in advance for your help Mats Eeg-Olofsson Institutionen f lingvistik/Department of Linguistics and Phonetics Lunds universitet/Lund University Helgonabacken 12 S-223 62 LUND Sverige/Sweden Telefon/Phone: Int + 46 46 108444 Fax: Int + 46 46 104210 Datorpost/E-mail: Mats.Eeg-Olofsson@lings.lu.se From corpora-request@uib.no Sat Feb 6 21:20:07 1993 Date: Sat, 6 Feb 1993 20:20:07 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Announcement - Alvey NL Tools Release 4 ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Fri, 5 Feb 1993 12:03:53 UTC+0100 From: Subject: Announcement - Alvey NL Tools Release 4 THE ALVEY NATURAL LANGUAGE TOOLS (RELEASE 4) BASIC DESCRIPTION AND DISTRIBUTION ARRANGEMENTS A fourth (and final) release of the Alvey Natural Language Tools (ANLT) is now available. The UK Alvey Programme originally funded three projects at the Universities of Cambridge, Edinburgh and Lancaster to provide tools for use in natural language processing research. The DTI and SERC has funded their continued support and enhancement. The tools, a MORPHOLOGICAL ANALYSER, PARSERS and a GRAMMAR and LEXICON, are usable individually as well as together (integrated by a GRAMMAR DEVELOPMENT ENVIRONMENT) forming a complete system for the morphological, syntactic and semantic analysis of a considerable subset of English. DISTRIBUTION AND LICENSING The ANLT system is available by anonymous FTP from Cambridge University, Computer Laboratory. The files containing grammars, lexicons and source code are encrypted, however, reports describing the system, specimen licence agreement and other information is not. If after examining the documentation, you wish to purchase a licence for use of the system for research purposes, you should complete and sign the specimen agreement and return it together with a cheque for the amount specified in the agreement (currently 500 ECU -- 100 ECU upgrade -- or local currency equivalent) to: Lynxvale WCIU Programs 20 Trumpington St. Cambridge, CB2 1QA, UK Fax: +223 332797 On receipt Lynxvale will send you (by letter) the key which can be used in conjunction with the software provided to decrypt the remaining files. If you do not have access to anonymous FTP, you can write to Lynxvale for further details and obtain the system on magnetic tape or cartridge. We are currently negotiating with Longman Group UK Ltd, who have an interest in the large lexicon, to provide a commercial licence for use of the ANLT system. A specimen commercial licence agreement will be deposited in the files shortly. DESCRIPTION The MORPHOLOGICAL ANALYSER provides a set of mechanisms for the analysis of complex word forms. The analyser requires data files specifying a lexicon of base morphemes, rules governing spelling changes when concatenating morphemes, and rules describing valid combinations of morphemes in complex words. The tools include a description of English morphology in this form. The analyser should be capable, though, when provided with the necessary linguistic analyses, of being used for most European languages and many others. The morphological analyser is now available independently of the rest of the tools package by anonymous FTP from scott.cogsci.ed.ac.uk [129.215.144.3]:/pub/phonology/tools/MAP/MAP3.1.tar.Z Further enquiries may be sent to Alan W Black (awb@ed.ac.uk). There are two alternative PARSERS. The main one is an optimized chart parser, incorporating a 'packing' mechanism (making it much more efficient when parsing sentences containing multiple local ambiguities). The other parser is a non-deterministic LALR(1) parser which seems, in most cases, to be even more efficient than the chart parser. The GRAMMAR is a wide-coverage syntactic and semantic grammar of English, written in a metagrammatical formalism derived from Generalized Phrase Structure Grammar. The grammar pairs one or more formulas of the lambda calculus with each syntactic rule and these produce unscoped (mostly) first-order `event-based' compositional semantic representations. Full coverage is provided of the following constructions and their combinations: - all sentence types: declaratives, imperatives and questions (yes/no, tag and wh questions), - all unbounded dependency types: topicalisation, relativisation, wh questions, - a relatively exhaustive treatment of verb and adjective complement types, - phrasal and prepositional verbs of many complement types, - passivisation, verb phrase extraposition, - sentence and verb phrase modification, - noun phrase complements, - noun phrase pre- and post-modification, - partitives, - coordination of all major category types, - nominal and adjectival comparatives. The LEXICON contains 40,000 homonyms (63,000 entries in total) in the form required by the morphological analyser. The GRAMMAR DEVELOPMENT ENVIRONMENT gives access to all of the other components of the tools, allowing grammars to be input, edited, and browsed; it also compiles them into the base grammatical formalism used by the parsers, and provides extensive grammar debugging facilities. A simple quantifier scoping and post-processing module is supplied as an example of how the result of parsing a sentence can be converted into a representation suitable for further semantic and pragmatic processing. In addition, an illustrative database management application with a small database of wine merchants' stock is supplied. All of the software components are written in Common Lisp and have been tested in several implementations on a wide range of machines. We have created a BULLETIN BOARD which we hope can be used to inform existing users about developments, to provide some informal support, and as a forum for discussion between people doing research with the ANLT system. Submissions should be sent to alveynltools@cl.cam.ac.uk and requests to be added to or deleted from the distribution list should be sent to alveynltools-request@cl.cam.ac.uk. If you are an existing user and this message has come to you direct, your email address has been added to the list already; unfortunately though, we do not have up-to-date email addresses for all known users, so please email alveynltools-request otherwise. Two published REFERENCES to these projects are: Briscoe, E., C. Grover, B. Boguraev & J. Carroll, 'A Formalism and Environment for the Development of a Large Grammar of English', Proceedings of 10th International Joint Conference on Artificial Intelligence, Milan, 1987, pp. 703-708. Ritchie, G., G. Russell, A. Black & S. Pulman, 'Computational Morphology: Practical Mechanisms for the English Lexicon', MIT Press, 1991. Technical reports describing the system in detail are available via FTP as detailed in the file `instruct'. These contain many further references to papers describing aspects of the ANLT system. ******************** ANLT distribution arrangements and instructions, and a machine-readable specimen licence agreement are available in files on the FTP server ftp.cl.cam.ac.uk (128.232.0.56). To fetch this information use anonymous FTP (login with user name anonymous, and password your e-mail address), go to the directory `nltools', and fetch the files licence a machine-readable specimen licence agreement instruct instructions on how to FTP technical reports and the ANLT itself The following example shows how to fetch these files: $ ftp ftp.cl.cam.ac.uk Connected to swan.cl.cam.ac.uk. 220- swan.cl.cam.ac.uk FTP server (Version 5.60+UA) ready. ... Name (ftp.cl.cam.ac.uk:jac): anonymous Password (ftp.cl.cam.ac.uk:anonymous): ... ftp> cd nltools 250 CWD command successful. ftp> get licence ... ftp> get instruct ... ftp> quit 221 Goodbye. (The $ is the Unix shell command prompt). If the FTP command does not know about the address ftp.cl.cam.ac.uk, try giving the command the internet number (128.232.0.56) instead. From corpora-request@uib.no Tue Feb 9 00:54:49 1993 Date: Mon, 8 Feb 1993 23:54:49 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: Getting the ACL/DCI CD-ROM ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Mon, 8 Feb 1993 8:42:51 UTC From: (Wu Zhibiao) Subject: Re: Getting the ACL/DCI CD-ROM > Send-date: Fri, 5 Feb 1993 15:58:57 UTC+0100 > From: > Subject: Getting the ACL/DCI CD-ROM > > Could anyone tell me to whom I should apply to get > a copy of the ACL/DCI CD-ROM? > > Thank you in advance for your help > > > Mats Eeg-Olofsson > > Institutionen f lingvistik/Department of Linguistics and Phonetics > Lunds universitet/Lund University > Helgonabacken 12 > S-223 62 LUND > Sverige/Sweden > > Telefon/Phone: Int + 46 46 108444 > Fax: Int + 46 46 104210 > Datorpost/E-mail: Mats.Eeg-Olofsson@lings.lu.se > Following is a message reply the above question: From myl@unagi.cis.upenn.edu Tue Jan 28 09:32:45 1992 To: wuzhibia@iscs.nus.sg This is in response to your request for information about ACL/DCI material. We are currently able to send out a CD-ROM, in ISO 9660 format, containing about 300 Mb of Wall Street Journal text, a large collection of scientific abstracts, the full text of the 1979 edition of the Collins English Dictionary, and some samples of tagged and parsed text from the Penn Treebank project. In order for us to send you this CD-ROM, we need a copy of our User Agreement, signed by you or by some responsible party on behalf of your institution. Please send your mailing address to Rafi Khan (khanr@unagi.cis.upenn.edu), and he will send a paper copy of this form, which you can sign (or have signed) and return to him. When you return this form, we will also ask you to send a check for $25, payable to the ACL. The User Agreement should be signed by whoever is in charge of the group --- department, research institute, laboratory, company or whatever --- that will be using the CD-ROM. If you are the only person who will be using it, then you can sign for your own use, but in that case, you should not transfer the disk or its contents to others. Of course, if at some later date you want a larger (or simply different) group to have access, you can arrange for another User Agreement to be signed by an appropriate person on behalf of the new group. Please let us know who will be signing, and the name of the administrative unit (if other than "self") on whose behalf they will sign. Regards, Mark Liberman myl@unagi.cis.upenn.edu best zhibiao From corpora-request@uib.no Tue Feb 9 00:55:04 1993 Date: Mon, 8 Feb 1993 23:55:04 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Conference Kiel 3-5/3-1993 ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Originator: UPK013@DBNRHRZ1.bitnet Send-date: Fri, 5 Feb 1993 8:35:00 UTC+0100 From: ger07 Subject: Conference Kiel 3-5/3-1993 Following conference announcement to your attention!! ----------------------------Original message---------------------------- GLDV-Jahrestagung 1993 Universitaet Kiel, 3.3.-5.3.1993 Sprachtechnologie: Methoden, Werkzeuge, Perspektiven P R O G R A M M Mittwoch, 03.03.93 ab 8.00 Anmeldung 09.00 Einleitungspanel: Linguistische Datenverarbeitung/Computerlinguistik Selbstverstaendnis einer Disziplin Leitung: Prof. Dr. W. Lenders, Univ. Bonn Teilnehmer: I. Batori (Koblenz), H. Haller (Saarbruecken), H.J. Neuhaus (Muenster), B. Rieger (Trier) Sektion Quantitative Linguistik (Leitung: Reinhard Koehler): 10.30 Koehler, R.: Einheiten, Dimensionen und Masse 11.00 Schmidt, P.: Metrisierungsmoeglichkeiten in der Morphosyntax 11.30 Leopold, E.: Linguistische Anpassungsprozesse in der Zeitdimension 13.30 Altmann, G.: Ranghaeufigkeitsverteilungen 14.30 Boroda, I.: Einheiten und Messungen musikalischer Texte 15.00 Kaffeepause Freie Vortraege (parallel: Software / Linguistik) ========================================================== 15.30 Seewald, U.: Objektorientierte Programmierung als Werkzeug der Linguistischen Datenverarbeitung 16.00 Lutz, H.-D.: Software-Ergonomie fuer Sprachsoftware 16.30 Marx, J.: Der Einsatz natuerlichsprachlicher Komponenten in einer multimodalen Benutzerschnittstelle fuer Werkstoffdatenbanken ----------- 15.30 I. B\'atori / M. Volk (Koblenz): Grammar engineering und linguistische Forschung 16.00 Domenig, M. et al.: Werkzeuge zur Akquisition und Verwaltung von morphologischem und phrasealem Wissen 16.30 Weber, N.: Computergestuetzte Analyse von Definitionstexten in einem deutschen Woerterbuch 17.00 Begruessung und Empfang durch die Universitaet Kiel Donnerstag, 04.03.92 ---------------------------------------------------- Sektion Fuzzy Linguistik (Leitung: Burghard Rieger) 09.00 Rieger, B.: LLAMA*) - ein Pilotsystem zum referentiellen Sprachlernen mit unscharfem Bedeutungserwerb 09.30 Badry, B.: Sprachliche Unschaerfe und ihre experimentelle Modellierung in LLAMA 10.00 Reichert, M.: Cluster-Strukturen der Zwischenrepraesentationen in LLAMA 10.30 Galle,M.: Methodische Grundlagen und formale Voraussetzungen der praktischen Auswertbarkeit sehr grosser linguistischer Corpora (VLLC)**) *) LLAMA := Language Learning And Meaning Acquisition **) VLLC := Very Large Linguistic Corpora (> 10exp7 running words) 11.00 Kaffeepause Sektion Maschinelle Uebersetzung (Leitung: Johann Haller) 11.30 Schwall, U.: METAL - Fortschritte und neue Entwicklungen 12.00 Roesner, D.: Multilinguale Generierung aus Wissensbasen 14.00 Bruckert, F.: LOGOS in Europa - der neue Uebersetzerarbeitsplatz 14.30 Haller, J.: CAT2 - vom Forschungssystem zum praeindustriellen Prototyp 15.00 Schubert, K.: Zwischen Benutzerschulung und Wissenschaft. Sprachtechnologie in der Uebersetzerausbildung. 16.00 GLDV-Mitgliederversammlung Abendveranstaltung Freitag, 5.3.93 Sektion Maschinelle Korpora (Leitung: Winfried Lenders) 09.00 Wothke, Klaus: Statistisch basiertes Wortklassentagging an deutschen Textkorpora - einige Experimente 09.30 Schroeder,B.: Fragen der Repraesentativitaet linguistischer Korpora 10.00 Willee, G.: Erfahrungen mit morphologischem Tagging am Beispiel des LIMAS-Korpus 10.30 Lenders, W.: Tagging - Formen und Tools 11.00 Treffen der Arbeitskreise Gegen 13.00 Ende der Tagung Informationen Teilnahmegebuehren: Nichtmitglieder: DM 150,-- Mitglieder: DM 100,-- Studenten (Nicht-Mitglieder): DM 75,- (ohne Proceedings) Studenten (Mitglieder): DM 50,- (ohne Proceedings) Bei Anmeldung nach dem 25.02. erhoeht sich die Teilnahmegebuehr um DM 30,--. Bitte ueberweisen Sie die Teilnahmegebuehr auf das Konto mit der Nr. 25 291 170 bei der Sparkasse Kiel, BLZ 210 501 70, und geben Sie als Stichwort "GLDV-93" an. Wir bitten Sie, Zimmerreservierungen selbst vorzunehmen. From corpora-request@uib.no Tue Feb 9 00:55:22 1993 Date: Mon, 8 Feb 1993 23:55:22 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: PhD dissertation available ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Mon, 8 Feb 1993 11:36:00 UTC+0100 From: Subject: PhD dissertation available =================================================================== Ph.D. DISSERTATION AVAILABLE on Neural Networks, Natural Language Processing, Information Retrieval =================================================================== A Copy of the dissertation "Neural Networks in Natural Language Processing and Information Retrieval" by Johannes C. Scholtes can be obtained for cost price and fast airmail- delivery at US$ 25,-. Payment by Major Creditcards (VISA, AMEX, MC, Diners) is accepted and encouraged. Please include Name on Card, Number and Exp. Date. Your Credit card will be charged for Dfl. 47,50. Within Europe one can also send a Euro-Cheque for Dfl. 47,50 to: University of Amsterdam J.C. Scholtes Dufaystraat 1 1075 GR Amsterdam The Netherlands Do not forget to mention a surface shipping address. Please allow 2-4 weeks for delivery. Abstract 1.0 Machine Intelligence For over fifty years the two main directions in machine intelligence (MI), neura l networks (NN) and artificial intelligence (AI), have been studied by various persons with many dif- ferent backgrounds. NN and AI seemed to conflict with many of the traditional sc iences as well as with each other. The lack of a long research history and well defined fo undations has always been an obstacle for the general acceptance of machine intelligence b y other fields. At the same time, traditional schools of science such as mathematics and physics devel- oped their own tradition of new or "intelligent" algorithms. Progress made in th e field of statistical reestimation techniques such as the Hidden Markov Models (HMM) start ed a new phase in speech recognition. Another application of the progress of mathemat ics can be found in the application of the Kalman filter in the interpretation of sonar and radar sig- nals. Much more examples of such "intelligent" algorithms can be found in the st atistical classification en filtering techniques of the study of pattern recognition (PR). Here, the field of neural networks is studied with that of pattern recognition i n mind. Although only global qualitative comparisons are made, the importance of the rel ation between them is not to be underestimated. In addition it is argued that neural n etworks do indeed add something to the fields of MI and PR, instead of competing or conflic ting with them. 2.0 Natural Language Processing The study of natural language processing (NLP) exists even longer than that of M I. Already in the beginning of this century people tried to analyse human language with machines. However, serious efforts had to wait until the development of the digi tal com- puter in the 1940s, and even then, the possibilities were limited. For over 40 y ears, sym- bolic AI has been the most important approach in the study of NLP. That this has not always been the case, may be concluded from the early work on NLP by Harris. As a mat- ter of fact, Chomsky's Syntactic Structures was an attack on the lack of structu ral proper- ties in the mathematical methods used in those days. But, as the latter's work r emained the standard in NLP, the former has been forgotten completely until recently. As the scientific community in NLP devoted all its attention to the symbolic AI-like theories, the only use- ful practical implementation of NLP systems were those that were based on statis tics rather than on linguistics. As a result, more and more scientists are redirectin g their atten- tion towards the statistical techniques available in NLP. The field of connectio nist NLP can be considered as a special case of these mathematical methods in NLP. More than one reason can be given to explain this turn in approach. On the one h and, many problems in NLP have never been addressed properly by symbolic AI. Some exa m- ples are robust behavior in noisy environments, disambiguation driven by differe nt kinds of knowledge, commensense generalizations, and learning (or training) abilities. On the other hand, mathematical methods have become much stronger and more sensitive to spe- cific properties of language such as hierarchical structures. Last but not least, the relatively high degree of success of mathematical techni ques in commercial NLP systems might have set the trend towards the implementation of si mple, but straightforward algorithms. In this study, the implementation of hierarchical structures and semantical feat ures in mathematical objects such as vectors and matrices is given much attention. These vectors can then be used in models such as neural networks, but also in sequential stati stical pro- cedures implementing similar characteristics. 3.0 Information Retrieval The study of information retrieval (IR) was traditionally related to libraries o n the one hand and military applications on the other. However, as PC's grew more popular, most common users loose track of the data they produced over the last couple of years . This, together with the introduction of various "small platform" computer programs mad e the field of IR relevant to ordinary users. However, most of these systems still use techniques that have been developed ove r thirty years ago and that implement nothing more than a global surface analysis of the textual (layout) properties. No deep structure whatsoever, is incorporated in the decisi on whether or not to retrieve a text. There is one large dilemma in IR research. On the one hand, the data collections are so incredibly large, that any method other than a global surface analysis would fai l. On the other hand, such a global analysis could never implement a contextually sensitiv e method to restrict the number of possible candidates returned by the retrieval system. As a result, all methods that use some linguistic knowledge exist only in laboratories and no t in the real world. Conversely, all methods that are used in the real world are based on technolog- ical achievements from twenty to thirty years ago. Therefore, the field of information retrieval would be greatly indebted to a met hod that could incorporate more context without slowing down. As computers are only capab le of processing numbers within reasonable time limits, such a method should be based on vec- tors of numbers rather than on symbol manipulations. This is exactly where the c hallenge is: on the one hand keep up the speed, and on the other hand incorporate more co ntext. If possible, the data representation of the contextual information must not be rest ricted to a single type of media. It should be possible to incorporate symbolic language as well as sound, pictures and video concurrently in the retrieval phase, although one does not know exactly how yet... Here, the emphasis is more on real-time filtering of large amounts of dynamic da ta than on document retrieval from large (static) data bases. By incorporating more context ual infor- mation, it should be possible to implement a model that can process large amount s of unstructured text without providing the end-user with an overkill of information . 4.0 The Combination As this study is a very multi-disciplinary one, the risk exists that it remains restricted to a surface discussion of many different problems without analyzing one in depth. To avoid this, some central themes, applications and tools are chosen. The themes in this work are self-organization, distributed data representations and context. The application s are NLP and IR, the tools are (variants of) Kohonen feature maps, a well known model fro m neural network research. Self-organization and context are more related to each other than one may suspec t. First, without the proper natural context, self-organization shall not be possible. Nex t, self-orga- nization enables one to discover contextual relations that were not known before . Distributed data representation may solve many of the unsolved problems in NLP a nd IR by introducing a powerful and efficient knowledge integration and generalization tool. However, distributed data representation and self-organization trigger new probl ems that should be solved in an elegant manner. Both NLP and IR work on symbolic language. Both have properties in common but bo th focus on different features of language. In NLP hierarchical structures and sema ntical fea- tures are important. In IR the amount of data sets the limitations of the method s used. However, as computers grow more powerful and the data sets get larger and larger , both approaches get more and more common ground. By using the same models on both app li- cations, a better understanding of both may be obtained. Both neural networks and statistics would be able to implement self-organization , distrib- uted data and context in the same manner. In this thesis, the emphasis is on Koh onen fea- ture maps rather than on statistics. However, it may be possible to implement ma ny of the techniques used with regular sequential mathematical algorithms. So, the true aim of this work can be formulated as the understanding of self-org anization, distributed data representation, and context in NLP and IR, by in depth analysis of Kohonen feature maps. ============================================================================== From corpora-request@uib.no Tue Feb 9 00:55:41 1993 Date: Mon, 8 Feb 1993 23:55:41 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: STATITEXT ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Mon, 8 Feb 1993 20:19:02 UTC+0100 From: Patrick John Coppock Subject: STATITEXT Have no information on STATITEXT myself, but I am VERY interested in any information you might receive. So if you get direct replies, please post on the list. Thanks in advance....... :-) pat coppock the multimedia lab university of trondheim avh n-7055 dragvoll patCoppock@avh.unit.no From corpora-request@uib.no Thu Feb 11 00:38:12 1993 Date: Wed, 10 Feb 1993 23:38:12 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Polish and Swedish corpora ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Tue, 9 Feb 1993 10:24:56 UTC+0100 From: Ela Dura Subject: send info Hi, We would like to get some information on Polish and Swedish corpora. We are doing research on contrastive lexicology. Regards Yours Ela Dura and Maria Toporowska-Gronostaj gronostaj@svenska.gu.se From corpora-request@uib.no Thu Feb 11 13:58:26 1993 Date: Thu, 11 Feb 1993 12:58:26 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: Polish and Swedish corpora ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Wed, 10 Feb 1993 18:05:01 UTC-0800 From: andras Subject: Re: Polish and Swedish corpora You might want to get in touch with Ingrid Maier (slingma@lobster.hsc.uu.se) at the Slaviska institutionen, Uppsala Universitet. Andras Kornai From corpora-request@uib.no Fri Feb 12 21:37:37 1993 Date: Fri, 12 Feb 1993 20:37:37 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Q: statistics-based NLP ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Thu, 11 Feb 1993 12:56:50 UTC+0100 From: (Alvaro Sanchez) Subject: Q: statistics-based NLP Dear readers, I would like to receive information on statistics-based NL processing (excepting approaches to speech); in particular: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - references or basic bibliography (state of the art, algorithms, mathematical methods and statistical techniques in NLP, lines of research, applications, etc...) - international workshops, conferences or symposia - available practical implementations of NLP systems based on statistics Thanks in advance, Alvaro Sanchez alvaro@alcala.dia.fi.upm.es From corpora-request@uib.no Fri Feb 12 21:37:48 1993 Date: Fri, 12 Feb 1993 20:37:48 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Flaubert ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Thu, 11 Feb 1993 9:21:00 UTC-0500 From: E_Dean.Detrich <22743MGR@msu.edu> Subject: Flaubert Does any one know where I could find the text of _Madame Bovary_ , or other works by Flaubert in electronic text? Please reply privately. ------- E. Dean DETRICH 22743mgr@msu.bitnet Department of Romance and Classical Languages 22743MGR@MSU.EDU Michigan State University East Lansing, Michigan 48824 From corpora-request@uib.no Sat Feb 13 11:12:37 1993 Date: Sat, 13 Feb 1993 10:12:37 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: Q: statistics-based NLP (3 msgs) ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** 1) ------------------------------------------------------------------------ Send-date: Sat, 13 Feb 1993 8:05:44 UTC From: (Wu Zhibiao) Subject: Re: Q: statistics-based NLP > > Send-date: Thu, 11 Feb 1993 12:56:50 UTC+0100 > From: (Alvaro Sanchez) > Subject: Q: statistics-based NLP > > Dear readers, > > I would like to receive information on statistics-based > NL processing (excepting approaches to speech); in particular: > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > - references or basic bibliography (state of the art, > algorithms, mathematical methods and statistical techniques > in NLP, lines of research, applications, etc...) > > - international workshops, conferences or symposia > > - available practical implementations of NLP systems based > on statistics > > Thanks in advance, > > Alvaro Sanchez > alvaro@alcala.dia.fi.upm.es > I have written a survey several months ago "A survey of Statistical-based approaches to NLP. It is avaiable in latex and PostScript form. If you would like to read it, I can send it to you via email. best zhibiao 2) ------------------------------------------------------------------------ Send-date: Sat, 13 Feb 1993 10:04:39 UTC-0600 From: (Keh-Yih Su) Subject: Re: Q: statistics-based NLP You can find some papers in the proceedings of Coling-92, ACL-92, and TMI-92. Regards, Keh-Yih Su 3) ------------------------------------------------------------------------ Send-date: Sat, 13 Feb 1993 3:05:27 UTC-0500 From: (Nipon Charoenkitkarn) Subject: Re: Q-Statistics-based NLP Hi, The following books and articles may interest you: - Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon edited by Uri Zernik, 1991. - Text-based intelligent systems:current research and practice in information extraction and retrieval edited by Paul S. Jacobs, 1992 - 'Structural Ambiguity and Lexical Relations' by D. Hindle and M. Rooth from ACL 1991, pp. 229-236. - The Computational Analysis of English by Garside, G. et al. 1987. If any one has more interesting books, articles please suggest (I am new to the field). Also if any info on workshops, conferences please pose them. Thanks, Nipon Charoenkitkarn charoen@ie.utoronto.ca From corpora-request@uib.no Mon Feb 15 02:29:51 1993 Date: Mon, 15 Feb 1993 01:29:51 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Re: Q: statistics-based NLP (2 msgs) ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** 1) --------------------------------------------------------------------- Send-date: Sun, 14 Feb 1993 16:07:48 UTC-0600 From: Robert Goldman Subject: Q: statistics-based NLP AAAI will soon be releasing, as a technical report, the working notes of the Fall Symposium on Probability and Natural Language Processing. I am afraid they haven't been issued yet, so I have no information about price, etc., but we are assembling them now. Best, R 2) --------------------------------------------------------------------- Send-date: Sun, 14 Feb 1993 10:25:06 UTC-0500 From: Subject: Re: Q: statistics-based NLP My current PhD thesis is on this topic of i/ 'Proximity' or 'co-occurrence' statistics of words in technical texts. ii/ The possibilities to filter semantic networks from these data. iii/ The use and application of networks obtained in that way. Here are some references to my work, along with other pertinent references to the field: - --------------------------------------------------------------------------- @INPROCEEDINGS{aaai92, AUTHOR = "G. Grefenstette and M. Hearst", TITLE = "A Knowledge-Poor Method for Refining Automatically-Discovered Lexical Relations: Combining Weak Techniques for Stronger Results", BOOKTITLE = "AAAI Workshop on Statistically-Based NLP Techniques", PUBLISHER = "Tenth National Conference on Artificial Intelligence", MONTH = "July", YEAR = 1992 } @INPROCEEDINGS{FallSymposium92, AUTHOR = "G. Grefenstette", TITLE = "Finding Semantic Similarity in Raw Text: the Deese Antonyms", BOOKTITLE = "Fall Symposium on Probability and Natural Language", PUBLISHER = "AAAI", MONTH = "October 23-25", YEAR = 1992 } @TECHREPORT{sextant-tr, AUTHOR = "G. Grefenstette" , TITLE = "{SEXTANT:} Extracting Semantics from Raw Text, Implementation Details", INSTITUTION = "University of Pittsburgh, Computer Science Dept.", YEAR = "1992", MONTH = "February", NUMBER = "CS92-05", NOTE = "To appear in Heuristics, The Journal of Knowledge Engineering, Special Issue on Extraction of Information from Text", ANNOTE = "" } @INPROCEEDINGS{sigir92, AUTHOR = "G. Grefenstette" , TITLE = "Use of Syntactic Context to Produce Term Association Lists for Text Retrieval", ORGANIZATION = "ACM", BOOKTITLE = "Proceedings of SIGIR'92", YEAR = "1992", MONTH = "June 21-24", ADDRESS = "Copenhagen, Denmark", ANNOTE = "" } @INPROCEEDINGS{acl92, AUTHOR = "G. Grefenstette", TITLE = "SEXTANT: Exploring Unexplored Contexts for Semantic Extraction from Syntactic Analysis", BOOKTITLE = "30th Annual Meeting of the Association for Computational Linguistics", PUBLISHER = "ACL'92", ADDRESS = "Newark, Delaware", MONTH = "28 June -- 2 July", YEAR = 1992 } - ------------------------------------------------------------------------ - ------------- @inproceedings{brent91a, author = "Michael R. Brent", title = "Automatic Acquisition of Subcategorization Frames from Untagged, Free-Text Corpora", booktitle = "Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics", year = 1991 } @inproceedings{yarowsky92, author = "David Yarowsky", title = "Word-Sense Disambiguation Using Statistical models of Roget's Thesaurus Categories Trained on a large Corpus", booktitle = "Proceedings {COLING} '92", year = 1992 } @INPROCEEDINGS{lewis90, AUTHOR = "D. D. Lewis and W. B. Croft", TITLE = "Term Clustering of Syntactic Phrases", BOOKTITLE = "13th International Conference on Research and Development in Information Retrieval" , EDITOR = "J.L Vidick", PUBLISHER = "Association for Computing Machinery", ORGANIZATON = "SIGIR'90" , YEAR = "1990", ADDRESS = "New York", PAGES = "385--404", MONTH = "September 5-7" } @INPROCEEDINGS{jacobs, AUTHOR = "P. S. Jacobs and U. Zernick", TITLE = "Acquiring lexical knowledge from text: A case study", BOOKTITLE = "Proceedings of the Seventh National Conference on Artificial Intelligence", PUBLISHER = "Morgan Kaufmann", ADDRESS = "St. Paul, MN", PAGES = "739--744", YEAR = 1988 } @ARTICLE{SPIRIT, AUTHOR = "F. Debili and C. Fluhr and P. Radasao", TITLE = "About Reformulation in Full-Text IRS", JOURNAL = "Information Processing and Management", VOLUME = 25, YEAR = 1989, PAGES = "647--657" } @ARTICLE{crouch, TITLE = "An Approach to the Automatic Construction of Global Thesauri", AUTHOR = "C. J. Crouch", JOURNAL = "Information Processing and Management", YEAR = 1990, VOLUME = 26, NUMBER = 5, PAGES = "629--640" } @BOOK{longman, EDITOR = "P. Proctor", TITLE = "Longman Dictionary of Contemporary English", YEAR = "1978", ADDRESS = "London", PUBLISHER = "Longman", ANNOTE = "Machine readable dictionary, using restricted vocabulary, and standardized entry structures" } @TECHREPORT{claritproject, AUTHOR = "David A. Evans and Steve K. Henderson and Robert G. Lefferts and Ira A. Monarch", TITLE = "A Summary of the {CLARIT} project", MONTH = "November", YEAR = 1991, NUMBER = "CMU-LCL-91-2", INSTITUTION = "Laboratory for Computational Linguistics, Carnegie-Mellon University", ANNOTE = "presentation of pertinent noun-phrase extraction from a corpus, and use in indexing and retrieval. A fuzzy matching between noun-phrases avoids drawbacks of previous noun-phrase systems" } @ARTICLE{peat, AUTHOR = "Helen J. Peat and Peter Willet", TITLE = "The limitations of term co-occurrence data for query expansion in document retrieval systems", JOURNAL = "Journal of the American Society for Information Science", YEAR = 1991, VOLUME = "42", NUMBER = 5, PAGES = "378--383", ANNOTE = "" } @INPROCEEDINGS{choueka, AUTHOR = "Yaacov Choueka", BOOKTITLE = "RIAO'88 Conference Proceedings" , TITLE = "Looking for a Needle in a Haystack, or Locating Interesting Collocational Expressions in Large textual Databases", YEAR = "1988", ADDRESS = "MIT,Cambridge,Mass", MONTH = "Mar", PAGES = "609--623", ANNOTE = "" } @INPROCEEDINGS{evans, AUTHOR = "David A. Evans and K. Ginther-Webster and Mary Hart and R. G. Lefferts and Ira A. Monarch", TITLE = "Automatic Indexing Using Selective {NLP} and First-Order Thesauri", PAGES = "624--643", ADDRESS = "Barcelona", BOOKTITLE = "RIAO'91", PUBLISHER = "CID, Paris", MONTH = "April 2--5", YEAR = 1991 } @INPROCEEDINGS{ruge, AUTHOR = "Gerda Ruge", TITLE = "Experiments on Linguistically Based Term Associations", PAGES = "528--545", BOOKTITLE = "RIAO'91", ADDRESS = "Barcelona", PUBLISHER = "CID, Paris", MONTH = "April 2--5", YEAR = 1991 } @INCOLLECTION{krovetz, AUTHOR = "R. Krovetz", TITLE = "Lexical Acquisition and information retrieval", BOOKTITLE = "Lexical Acquisition: exploiting on-line resources to build a lexicon", PAGES = "45--65", EDITOR = "U. Zernik", YEAR = "1991", PUBLISHER = "Lawrence Erlbaum Associates", ADDRESS = "Hillsdale, New Jersey" } @ARTICLE{charlesmiller, AUTHOR = "Walter G. Charles and George A. Miller", TITLE = "Contexts of antonymous adjectives", JOURNAL = "Applied Psycholinguistics", VOLUME = "10", NUMBER = "3", PAGES = "357--375", YEAR = "1989", ANNOTE = "Found that antonyms don't have same contexts when not in same sentence" } @ARTICLE{lewis67, AUTHOR = "P. A. W. Lewis and P. B. Baxendale and J. L. Bennet", TITLE = "Statistical Discrimination of the Synonymy/Antonymy Relationship between Words", JOURNAL = "Journal of the ACM", YEAR = "1967", VOLUME = "14", NUMBER = "1", PAGES = "20--44", MONTH = "January", ANNOTE = "" } @ARTICLE{deese, AUTHOR = "J. E. Deese", TITLE = "The associative structure of some common English adjectives", JOURNAL = "Journal of Verbal Learning and Verbal Behavior", VOLUME = 3, NUMBER = 5, PAGES = "347--357", YEAR = 1954, ANNOTE = "Tests response times on common antonyms " } @ARTICLE{deerwester, AUTHOR = "Scott Deerwester and Susan T. Dumais and George W. Furnas and Tomas K. Landauer and Richard Harshman ", TITLE = "Indexing by latent semantic indexing", JOURNAL = "Journal of the American Society for Information Science", VOLUME = "41", NUMBER = "6", PAGES = "391--407", MONTH = "October", YEAR = "1990" } @ARTICLE{morris, AUTHOR = "J. Morris and G. Hirst", TITLE = "Lexical Cohesion as Computed by Thesaural Relations as an Indicator of the Structure", JOURNAL = "Computational Linguistics", VOLUME = "17", NUMBER = "1", MONTH = "March", YEAR = "1991", PAGES = "21--48", ANNOTE = "" } @ARTICLE{justeson, AUTHOR = "John S. Justeson and Slava M. Katz", TITLE = "Co-occurrences of Anonymous Adjectives and Their Contexts", JOURNAL = "Computational Linguistics", VOLUME = "17", NUMBER = "1", MONTH = "March", YEAR = "1991", PAGES = "1--19", ANNOTE = "Finds that antonyms occur frequently in same sentence" } @ARTICLE{church, AUTHOR = "Kenneth Ward Church and Patricia Hanks", TITLE = "Word Association Norms, Mutual Information, and Lexicography", JOURNAL = "Computational Linguistics", VOLUME = "16", NUMBER = "1", MONTH = "March", YEAR = "1990", PAGES = "22--29" } @inproceedings{calzolari90, author = "Nicoletta Calzolari and Remo Bindi", title = "Acquisition of Lexical Information from a Large Textual Italian Corpus", booktitle = "Proceedings of the Thirteenth International Conference on Computational Linguistics", address = "Helsinki", year = 1990, } @BOOK{phillips85, AUTHOR = "Martin Phillips", TITLE = "Aspects of Text Structure: An investigation of the lexical organization of text", PUBLISHER = "Elsevier", ADDRESS = "Amsterdam", YEAR = "1985" } @ARTICLE{hersh, AUTHOR = "W. R. Hersh and D. A. Evans and I. A. monarch and R. G. Lefferts and S. K. Handerson and P. N. Gorman", TITLE = "Indexing Effectiveness of Linguistic and Non-Linguistic Approaches to Automated indexing", JOURNAL = "Unpublished manuscript", YEAR = "1991", ADDRESS = "Laboratory for Computational Linguistics, Carnegie-Mellon University", } @INCOLLECTION{hearst92, AUTHOR = "Marti A. Hearst", TITLE = "Automatic Acquisition of Hyponyms from Large Text Corpora", BOOKTITLE = "Proceedings of the Fourteenth International Conference on Computational Linguistics", PUBLISHER = "COLING'92", ADDRESS = "Nantes, France", MONTH = "July", YEAR = 1992 } @INCOLLECTION{vossen, AUTHOR = "P. Vossen and W. Meijs and M. {den Broeder}", BOOKTITLE = "Computational Lexicography for Natural Language Processing", EDITOR = "Bran Boguraev and Ted Briscoe", TITLE = "Meaning and Structure in Dictionary Definitions", PUBLISHER = "Longman Group UK Limited", YEAR = "1989", ADDRESS = "London", PAGES = "171--190" } @incollection{wilks92, author = "Yorick Wilks and Louise Guthrie and Joe Guthrie and Jim Cowie", title = "Combining Weak Methods in Large-Scale Text Processing", booktitle = "Text-Based Intelligent Systems: Current Research and Practice in Information Extraction and Retrieval", editor = "Paul S. Jacobs", publisher = "Lawrence Erlbaum Associates", pages = "35-58", year = 1992 } @BOOK{blois, AUTHOR = "Marsden S. Blois", TITLE = "Information and Medicine", PUBLISHER = " University of California Press", YEAR = "1984", ADDRESS = "Berkeley, CA" } @PHDTHESIS{debili, AUTHOR = "Fathi Debili", TITLE = "Analyse Syntaxico-Semantique Fondee sur une Acquisition Automatique de Relations Lexicales-Semantiques", SCHOOL = "University of Paris XI, France", YEAR = "1982", ANNOTE = "" } @BOOK{ksj64, AUTHOR = "Karen {Sparck Jones}", TITLE = "Synonymy and Semantic Classification", PUBLISHER = "Edinburgh University Press", ADDRESS = "Edinburgh", NOTE = " PhD thesis delivered by University of Cambridge in 1964", YEAR = "1986", ANNOTE = "" } @BOOK{ksj71, AUTHOR = "Karen {Sparck Jones}", TITLE = "Automatic Keyword Classification and Information Retrieval", PUBLISHER = "Butterworths", ADDRESS = "London", YEAR = 1971, ANNOTE = "" } @ARTICLE{ksj91, AUTHOR = "Karen {Sparck Jones}", TITLE = "Notes and References on Early Automatic Classification Work", JOURNAL = "{SIGIR} Forum", MONTH = "Spring", VOLUME = 25, NUMBER = 1, YEAR = 1991, PAGES = "10--17" @INPROCEEDINGS{hindle-acl, AUTHOR = "D. Hindle", TITLE = "Noun Classification from Predicate-Argument Structures", BOOKTITLE = "Proceedings of the 28th Annual Meeting of the Association for Computational Linguistics", PUBLISHER = "ACL", PAGES = "268--275", ADDRESS = "Pittsburgh", YEAR = "1990" } @inproceedings{smadja90, author = "Frank A. Smadja and Kathleen R. McKeown", title = "Automatically Extracting and Representing Collocations for Language Generation", booktitle = "Proceedings of the 28th Annual Meeting of the Association for Computational Linguistics", pages = "252-259", year = 1990 } @book{kelly75, author = "Edward Kelly and Philip Stone", title = "Computer recognition of english word senses", series = "North-Holland Linguistics Series", volume = 13, publisher = "North-Holland", address = "Amsterdam", year = 1975 } --Gregory Grefenstette -- grefen@cs.pitt.edu -- Computer Science Dept. -- University of Pittsburgh -- Pittsburgh, PA. 15260 From corpora-request@uib.no Thu Feb 18 01:01:39 1993 Date: Thu, 18 Feb 1993 00:01:39 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: RE: Statistics-based NLP (3 msgs) ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** 1) --------------------------------------------------------------- Send-date: Mon, 15 Feb 1993 10:02:00 UTC+0100 From: Subject: RE: Statistics-based NLP In a few months Rodopi publishers (Amsterdam, Holland) hope to publish a volume relevant to the issue of statistics in NLP in the series "Language and Computers: Studies in Practical Linguistics" (series editors Jan Aarts and Willem Meijs). This will be a volume edited by Ezra Black, Roger Garside and Geoffrey Leech, with further contributions by Elizabeth Eyes, Anthony McEnery, John Lafferty, David Magerman and Salim Roukos. This book, entitled "Statistically-driven Computer Grammars of English: The IBM/Lancaster Approach", reports on five years of collaboration in this area between researchers at the IBM T.J. Watson Research Centre, Hawthorne, New York, USA, and the Unit for Computer Research on the English Language (UCREL), University of Lancaster, UK. It thus provides a detailed state-of-the-art account of work which combines the best of two traditions: a soundly linguistic analysis of corpus data and a solidly statistical handling of large amounts of language data. The resulting approach, illustrated in great detail in this book, can be characterized as one in which the grammarian *supplies* the rules of linguistic analysis and the statistical algorithm *applies* them to English sentences of the sort we find around us every day. More details about how to obtain the book, price etc. will follow when the book comes out. Willem Meijs Arts Faculty Computing Centre Amsterdam University Spuistraat 134 1012 VB Amsterdam, Holland wmeijs@alf.let.uva.nl 2) --------------------------------------------------------------- Send-date: Mon, 15 Feb 1993 18:06:15 UTC From: (DR. DEKAI WU) Subject: Re: Q: statistics-based NLP AAAI/MIT Press will also soon be publishing the proceedings from the AAAI-92 Workshop on Statistically-Based NLP Techniques, July 1992, San Jose, CA. Regards, Dekai Wu (dekai@uxmail.ust.hk) 3) --------------------------------------------------------------- Send-date: Mon, 15 Feb 1993 15:41:21 UTC-0500 From: (Gregory Grefenstette) Subject: Q: statistics-based NLP, Citation Correction I gave an erroneous citation in my last post. The lines: > @ARTICLE{church, > AUTHOR = "Kenneth Ward Church and Patricia Hanks", > TITLE = "Word Association Norms, Mutual Information, > and Lexicography", should read: AUTHOR = "Kenneth Ward Church and Patrick Hanks", my apologies to Mr. Hanks. --gregory grefenstette From corpora-request@uib.no Thu Feb 18 10:19:40 1993 Date: Thu, 18 Feb 1993 09:19:40 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Bilingual Corpora ??? ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Thu, 18 Feb 1993 1:43:00 UTC-0500 From: KEITH J. MILLER Subject: Bilingual Corpora ??? Hello --- Does anyone have any information on the availability of bilingual French / English corpora? I am not particularly concerned about the dialect of French at the moment (ie. any regional variety of French would suit my present needs). Any leads will be greatly appreciated. If you wish to e-mail responses to me directly, I can be reached at MILLERK@GUVAX.GEORGETOWN.EDU Thank you in advance. Keith J. Miller From corpora-request@uib.no Thu Feb 25 03:25:26 1993 Date: Thu, 25 Feb 1993 02:25:26 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: (historical) german databases ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Tue, 23 Feb 1993 14:26:30 UTC+0100 From: (Beatrice Santorini) Subject: (historical) german databases i am interested in receiving information about online databases of german. i'm aware of a couple of sources (see below) but would appreciate any further references or leads. i'm particularly interested in the early stages of the language (from old high german on) and vernacular/spoken texts. many thanks. beatrice santorini 1. institut f"ur angewandte kommunikations- und sprachforschung (c/o gerd willee) 2. institut f"ur deutsche sprache mannheimer korpus (subsuming freiburger korpus), bonner zeitungskorpus, dialogstrukturenkorpus 3. oxford text archives From corpora-request@uib.no Thu Feb 25 03:25:39 1993 Date: Thu, 25 Feb 1993 02:25:39 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Corpora ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Wed, 24 Feb 1993 10:00:00 UTC+0100 From: Subject: Corpora From Jan Svartvik, Lund University to those interested in (1) access to corpora & (2) Jespersen. In view of the constant demand for corpora I want to supply information about a recent publication "Linguistic research in Sweden" which has an appendix on corpora and language databases produced in Sweden, including a large number of languages. Orders to: Swedish Science Press, Box 118, S-75104 Uppsala, Sweden. (2) There will be an Otto Jespersen Symposium in Copenhagen on 29-30 April this year with Randolph Quirk giving the keynote address. There is no con- ference fee but registration is required by 1 April to English Department, University of Copenhagen, Njalsgade 96, DK-2100 Copenhagen, Denmark. Enjoy! Jan Svartvik From corpora-request@uib.no Wed Feb 24 15:07:10 1993 From: Robert Goldman Date: Wed, 24 Feb 93 21:07:10 CST To: corplst@nora.hd.uib.no Subject: Spanish Corpora Can anyone on this list point me to a large Spanish corpus? And, if at all possible, an online dictionary? A colleague and I are doing some preliminary work, so we are particularly interested in finding free materials for present use, although we hope to prove ourselves enough to afford corpora we will have to pay for. R From corpora-request@uib.no Wed Mar 3 03:01:08 1993 Date: Wed, 3 Mar 1993 02:01:08 +0100 From: corplst@nora.hd.uib.no (CORPORA list) To: corpora@nora.hd.uib.no Subject: Word frequency ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** Send-date: Tue, 2 Mar 1993 14:24:58 UTC-0500 From: David Michael Seaman Subject: Word frequency A colleague of mine is working on word frequencies and other items in a series of educational texts for children grades 3-9. Does anyone know of an existing electronic corpus of this material (English language), or of electronic texts of word frequency lists derived from similar material? David Seaman Phone: 804-924-3230 Electronic Text Center Fax: 804-924-1431 Alderman Library email: etext@virginia.edu University of Virginia Charlottesville, Virginia 22903 From corpora-request@uib.no Wed Mar 3 03:19:53 1993 Date: Wed, 3 Mar 1993 02:19:53 +0100 From: JOHN FLOWERDEW To: corpora@x400.hd.uib.no Subject: Lecture Corpus I have been working with a corpus of lectures given by native speakers of English to non-natives. I would now like to compare this corpus with comparable data given by natives to natives. I would therefore appreciate any information on the availability of native/native corpora. John Flowerdew From corpora-request@uib.no Wed Mar 3 05:00:33 1993 To: corplst@nora.hd.uib.no (CORPORA list) Subject: Re: Word frequency Date: Wed, 03 Mar 1993 10:00:33 EST From: Elizabeth Hodas The Consortium for Lexical Research (CLR) at New Mexico State University in Las Cruces, New Mexico does include word lists and statistical data among their resources. You might try contacting them for information. Their email address is: lexical@nmsu.edu Like the Linguistic Data Consortium (LDC), CLR charges an annual fee for membership. Members then have access to the data provided by the Consortium. CLR distributes lexical data and tools (such as word lists, dictionaries, and glossaries), while the LDC distributes primarily recorded speech and natural language text data. We are also hoping to have some parallel text data available later this year. Hope this helps. Regards, -- Elizabeth Hodas ----------------------------------------------------------------------------- Elizabeth Hodas | Linguistic Data Consortium Administrative Assistant, LDC | 441 Williams Hall ehodas@unagi.cis.upenn.edu | University of Pennsylvania Tel: +1/215/898-0464 | Philadelphia, PA 19104-6305 Fax: +1/215/573-2175 | U.S.A. ----------------------------------------------------------------------------- From corpora-request@uib.no Wed Mar 3 03:41:35 1993 Date: Wed, 03 Mar 93 09:41:35 CST From: stan kulikowski ii Subject: re: Word frequency To: corpora@nora.hd.uib.no david seaman, i saw your notice on corpora. i too am studying the educational use of text, so i would be interested in hearing what you find. i have submitted proposals to collect exactly this kind of data in large scale, but few in peer review seems to think it is a good project. i have heard that there is an american heritage corpus of 3 million words graded by school levels, but so far have not gotten access to it. the general problem is that textbooks are proprietary information and publishers who have them in electronic formats usually do not want to make them available for study. it their cash cow afterall so you have to be inhouse to have access, effectively preventing detailed cummulative studies. the rest of us are then doomed to endless OCR scanning and thus tiny sample sizes. stan stankuli@UWF.bitnet . === we all help each other get a little further down the road, : : or be damned for the fools that we are. --- -- the motorcycle modificationist's motto From corpora-request@uib.no Wed Mar 3 19:21:06 1993 Date: Wed, 3 Mar 1993 18:21:06 +0100 From: Henry Kucera To: corpora@x400.hd.uib.no Subject: re: Word frequency >Posted on 3 Mar 1993 at 11:40:22 by stan kulikowski ii > i have heard that there is an american heritage corpus of 3 million >words graded by school levels, but so far have not gotten access to it. >the general problem is that textbooks are proprietary information and >publishers who have them in electronic formats usually do not want to make >them available for study. it their cash cow afterall so you have to be >inhouse to have access, effectively preventing detailed cummulative studies. >the rest of us are then doomed to endless OCR scanning and thus tiny sample >sizes. > stan > Yes,there is such a corpus, actually based on some 5 million words, if I am not mistaken. It was assembled by John B. Carroll et al. in the late 1060's, and the results were published in the American Heritage Frequency Book. There are also tapes of the corpus (a great many, dozens maybe), now in possession of the current copyright owner Houghton Mifflin Company in Boston. The problem with the tapes seems to be that they are all in the old BCD seven-track coding format and there are very few (if any) seven-track drives around these days. Try to contact Dr. Win Carus in Houghton Mifflin's Software Division, 1 Memorial Drive, Cambridge, MA 02142; Fax 617-252-3145 Good luck, Henry Kucera, Brown University From corpora-request@uib.no Wed Mar 3 17:24:35 1993 Subject: re: Word Frequency To: corpora@nora.hd.uib.no Date: Wed, 3 Mar 93 17:24:35 GMT From: A.Davies%mcs.surrey.ac.uk@alf.uib.no From corpora-request@uib.no Thu Mar 4 14:26:02 1993 Date: Thu, 4 Mar 1993 13:26:02 +0100 From: cs%scs.leeds.ac.uk@alf.uib.no To: corplst@nora.hd.uib.no Subject: Re: Word frequency There is a corpus of children's spoken English (British) called the Polytechnic of Wales corpus. Its transcribed and fully parsed using systemic functional grammar. Children were aged 6, 8, 10 and 12 for the study. It contains about 65,000 words and is available from ICAME and the Oxford Text Archive. It was originally collected to explore the development of various syntactico-semantic constructs in children's English. I wrote a short handbook to the corpus which I append below. Mail me back if you have any queries, Clive Souter University of Leeds School of Computer Studies Leeds LS2 9JT UK ------------------------- A Short Handbook to the Polytechnic of Wales Corpus Clive Souter Centre for Computer Analysis of Language and Speech (CCALAS) School of Computer Studies University of Leeds Leeds LS2 9JT Janet: cs@uk.ac.leeds.ai Tel: (0532) 335460 Introduction This booklet is intended to accompany the machine readable version of the Polytechnic of Wales (PoW) Corpus, to be distributed through the International Computer Archive of Modern English (ICAME) at Bergen. It aims to introduce the reader to the corpus notation and format, and to list the systemic functional grammar codes which have been used in the hand parsing of the corpus (Appendix 1). A very brief description of the corpus was supplied to the Lancaster Preliminary Survey of Machine-Readable Language Corpora [Taylor and Leech 89] and this is also included as Appendix 2. Papers related to the compilation of the corpus, and its subsequent use for computational linguistic research in the COMMUNAL project at Leeds are given in the references section. Queries regarding the corpus and the grammar can be addressed to the author, or to Dr Robin Fawcett, one of its original compilers, at The Computational Linguistics Unit, SESJP, Aberconway Building, University of Wales College of Cardiff, Cardiff CF1 3XA, Wales, UK. (Janet: fawcett@uk.ac.cardiff.abcy.vaxc). Any suggestions for additions or improvements to this handbook are most welcome, and should be addressed to the author in Leeds. Background The corpus was originally collected between 1978-84 for a child language development project to study the use of various syntactico-semantic constructs in children between the ages of six and twelve. A sample of approximately 120 children in this age range from the Pontypridd area in South Wales was selected, and divided into four cohorts of 30, each within three months of the ages 6, 8, 10, and 12. These cohorts were subdivided by sex (B,G) and socio-economic class (A,B,C,D). The latter was achieved using details of i) `highest' occupation of both the parents of the child, or one in single-parent families. ii) educational level of the parents. The children were selected in order to minimise any Welsh or other second language influence. The above subdivision resulted in small homogeneous cells of three children. Recordings were made of a play session with a Lego brick building task for each cell, and of an individual interview with the same "friendly" adult for each child, in which the child's favourite games or TV programmes were discussed. Transcription The first 10 minutes of each play session commencing at a point where normal peer group interaction began (ie: when the microphone was ignored) were transcribed by 15 trained transcribers. Likewise for the interviews. Transcription conventions were adopted from those used in the Survey of Modern English Usage at University College London, and a similar project at Bristol. Intonation contours were added by a phonetician to produce a hard copy version, and the resulting transcripts published in four volumes [Fawcett and Perkins 80]. A short report on the project was also published [Fawcett 80]. Syntactic analysis Again ten trained analysts were employed to manually parse the transcribed texts, using Fawcett's version of Systemic-Functional Grammar (SFG), the main architect of which is Michael Halliday. The SFG used in the analysis handles phenomena such as raising, dummy subject clauses and ellipsis. Despite thorough checking, some inconsistencies remain in the text owing to several people working on different parts of the corpus. The grammar used in this hand parsing process is described in more detail below. The parsed version is available in machine readable form but does not contain any prosodic information. Availability and Conditions The resulting parsed corpus consists of approximately 65,000 words (Footnote 1) in 11,396 (sometimes very long) lines, each containing a parse tree. The corpus of parse trees fills 1.1 Mb. There are 184 files, each with a reference header which identifies the age, sex and social class of the child, and whether the text is from a play session or an interview. The corpus is also available in wrap-round form with a maximum line length of 80 characters, where one parse tree may take up several lines. The four-volume transcripts can be supplied by the British Library Inter-Library Loans System. **************************************************************************** (Footnote 1) NB: Earlier papers quote the size of the corpus as being approximately 100,000 words. The latest automatic extraction of a wordlist from the machine readable corpus shows it to be just over 65,000 words, but this figure can only be approximate. Noise in the original typing of the corpus in the form of omissions of category labels, or of the spaces between such labels and the words in the text, makes it difficult to give an accurate figure. The difference between the two totals is almost certainly the difference between the total for the recorded spoken texts, and the total for those which have been hand-parsed. ***************************************************************************** The following conditions apply to the distribution of the Polytechnic of Wales Corpus from ICAME: a) The original source of the corpus should be mentioned in any documents published which derive from the data in the corpus in any way, and copies of such documents should be sent to ICAME and Dr Robin Fawcett at the address given in the introduction. b) The corpus is made available to specialist scholars for scientific linguistic research purposes only, and is not to be used for commercial purposes without the prior agreement of Dr Fawcett. c) The corpus will not be further distributed or reproduced in part or whole for any purpose other than scholarly research, and will only be supplied to a third party with the prior written permission of ICAME. d) If these conditions are not complied with, any tape(s) of the corpus (including backup copies) must be returned to ICAME at the Norwegian Computing Centre for the Humanities, Bergen, Norway. Systemic-Functional Grammar Categories The grammatical theory on which the manual parsing is based is Robin Fawcett's development of a Hallidayan Systemic-Functional Grammar, described informally but in detail in [Fawcett 81]. The grammar is traditionally formalised in a system network of semantic choices (systems), and a set of realisation rules to be used in natural language generation. From the point of view of natural language analysis, grammars formalised for parsing can be extracted from the corpus automatically in the form of phrase structure rules or a recursive transition network [Atwell and Souter 88, Souter 89a, 89b]. The terminology of SFG is quite complicated at first sight, but I will attempt to introduce it clearly below. A syntax tree is characterised by having two alternating types of category labels. The first are called elements of structure, such as Subject (S), Complement (C), Adjunct (A), head (h), modifier (mo) and qualifier (q). Note that, in a hand-analysis, capital letters are used for elements of clause structure, and lower case letters for elements of group (and cluster) structure. In the computational analysis, capitals are used throughout. Elements of structure are typically filled by the second type of category, ie: groups (which are also called units, cf phrases in TG or GPSG) such as nominal group (ngp), prepositional group (pgp) and quantity-quality group (qqgp), or clusters such as genitive cluster (gc). Terminal elements of structure are expounded by lexical items. The top-level symbol is Z (sigma) and is invariably filled by one or more clauses (Cl). Trees tend to be fairly flat immediately below the clause level, and this has a direct effect on the size and shape of the formal grammar which can be extracted from the parsed corpus. Some areas have a very elaborate description, eg: there are 15 types of adjuncts, six types of modifiers, nine different determiners, and ten auxiliaries. Other categories are relatively simple, eg: main-verb (M), head (h), and apex (ax). (The apex is typically expounded by an adverb or adjective in a quantity-quality group). A list of all the categories used in the parsing of the corpus is given in Appendix 1, with details of whether the symbol is used as a non-terminal or terminal category, and some example lexical items which expound the terminal categories. Notation The tree notation employs numbers rather than the more traditional bracketed form to define mother-daughter relationships, in order to capture discontinuous units. The number directly preceding a group of symbols refers to their mother. The mother is itself found immediately preceding the first occurrence of that number in the tree. In the example section of a corpus file given below in Figure 1, the first tree shows a sentence (Z) consisting of two daughter clauses (Cl), as each clause is preceded by the number one. The long lines have been folded manually for ease of reading. The first number in each tree is a sentence reference, and I have edited the file below to show these with a right bracket ")" symbol, which does not appear in the actual corpus. I also include below (Figure 2) a few hand-drawn syntax trees which correspond to sentences from Figure 1. All alphabetic characters are in upper case. The only lower case alphabetical characters are in the sentence references, which have occasionally been subdivided into 24a, 24b etc, where what was initially analysed as one sentence was, on checking, reanalysed as two (or more). Occasionally when the correct analysis for a structure is uncertain, the one given is followed by a question mark. Cases where unclear recordings have made word identification difficult are treated similarly. Apart from the grammatical categories and the words themselves, the only other symbols in the tree are three types of bracketing: i) square [NV...], [UN...], [RP...], [FS...], for non-verbal, unclear/unfinished, repetition, false start, etc. ii) round (...) for ellipsis of items recoverable from previous text. iii) angle <...> for ellipsis of items not so recoverable, eg: in rapid speech. Figure 1: A Sample Section of a POW Corpus File **** 58 1 1 1 0 59 6ABICJ 1) [FS:Y...] Z 1 CL F YEAH 1 CL 2 S NGP 3 DD THAT 3 HP ONE 2 OM 'S 2 C NGP 4 DQ A 4 H RACING-CAR 2) Z CL 1 S NGP 2 DD THAT 2 HP ONE 1 OM 'S 1 C NGP 3 DQ A 3 MO QQGP AX LITTLE 3 H TRUCK 3) [HZ:WELL] Z 1 CL 2 S NGP HP I [RP:I] 2 AI JUST 2 HAD 2 C NGP 3 DQ A 3 MO QQGP AX LITTLE 3 H THINK 1 CL 4 & THEN 4 S NGP HP I 4 M THOUGHT 4 C CL 5 BM OF 5 M MAKING 5 C NGP 6 DD THIS 6 HP ONE 4) Z 1 CL 2 S NGP HP I 2 AI JUST 2 M FINISHED 2 C NGP 3 DD THAT 3 HP ONE 1 CL 4 & AND 4 S NGP HN FRANCIS 4 M HAD 4 C NGP 5 DD THE 5 H IDEA 5 Q CL 6 BM OF 6 M MAKING 6 C NGP 7 DQ A 7 RACING-CAR 5) [FS:THEN-I] Z CL 1 & THO 1 S NGP HP I 1 M MADE 1 C NGP DD THIS 6) Z CL 1 & THEN 1 S NGP HP FRANCIS 1 OX WAS 1 AI JUST 1 X GOING-TO 1 M MAKE 1 C NGP HP ONE 1 A CL 2 B WHEN 2 S NGP H YOU 2 M CAME 2 CM QQGP AX BACK 2 CM QQGP AX IN 7) [NV:MM] Z 1 CL F NO [FS:FRAN...] 1 CL 2 S NGP HP WE 2 M HAD 2 C NGP 3 DQ AN 3 H IDEA 3 Q CL 4 BM OF 4 M MAKING 4 C NGP 5 DQ FOUR 5 H THINGS 8) Z 1 CL F YEAH 1 CL 2 S NGP HP I 2 M PLAYED 2 C PGP 3 P WITH 3 CV NGP HP IT 2 A PGP 4 P AT 4 CV NGP H HOME 9) Z CL F YEAH 10) [FS:I] [FS:I] Z 1 CL F NO 1 CL 2 S NGP HP I 2 OX 'VE 2 AI JUST 2 M GOT 2 C NGP 3 DQ ONE 3 MO QQGP AX BIG 3 H TIN [FS:OF?] 3 Q QQGP 4 AX FULL 4 SC PGP 5 P OF 5 CV NGP HP IT 11) [NV:ER] Z CL 1 (S) 1 (M) 1 C NGP 2 DQ NGP 3 DQ ALL 3 H SORTS 2 VO OF 2 H THINGS 12) Z 1 CL 2 S NGP HP I 2 M MAKE 2 C NGP H CARS 2 A QQGP AX ALWAYS 1 CL 3 & AND 3 A SOMETIMES 3 S NGP HP I 3 M MAKE 3 C NGP H HOUSES 1 CLUN & AND 13) Z 1 CL F YEAH 1 CL 2 S NGP HP I 2 M GOT 2 C NGP HN KERPLUNK 14) [FS:IT] [FS:IT] [NV:UM] Z 1 CL 2 S NGP HP YOU 2 M PUT 2 C NGP H STRAWS 2 C PGP 3 PM INTA 3 CV NGP 4 DQ A [RP:A] [RP:A] 4 MOTH NGP H GLASS 4 H TUB 4 Q PGP 5 P WITH 5 CV NGP 6 H HOLES 6 Q PGP 7 P IN 7 (CV) 1 CL 8 & THEN 8 S NGP HP YOU 8 M PUT 8 C NGP 9 DD THE 9 H STRAWS 8 C PGP 10 PM IN 10 CV NGP 11 DD THE 11 H HOLES 1 CL 12 & THEN 12 S NGP HP YOU 12 M PUT 12 C NGP 13 DD THE 13 H MARBLES 12 CM QQGP AX DOWN 1 CL 14 & AND 14 (S) 14 M PULL 14 C NGP 15 DQ A 15 H STRAW 14 CM QQGP AX OUT 14 A CL 16 I TO 16 M SEE 16 C CL 17 B IF 17 S NGP 18 DQ A 18 H MARBLE 17 M GOES 17 C PGP 19 P INTO 19 CV NGP 20 DQ A 20 H POINT 15) Z CL 1 S NGP HP I 1 ON DUN 1 M NO 1 (C) 16) [NV:ER] [NV:ER] Z CL 1 S NGP HP I 1 M PLAY 1 C PGP 2 P WITH 2 CV NGP 3 DD MY 3 H BIKE 17) Z 1 CL 2 S NGP HP I 2 M PLAY 2 C PGP 3 P WITH [FS:MY-CHIP] 3 CV NGP 4 DD MY [RP:MY] 4 MO QQGP AX BIG 4 H TIPPER-LORRY 1 CL 5 & AND 5 S NGP HP I [RP:I] 5 M CALL 5 C PGP 6 PM FOR 6 CV NGP HN DAVID 18) Z 1 CL F YEAH [FS:HE'S-ONE-MY] [FS:HE'S-ROUND] 1 CL 2 S NGP HP HE 2 OM 'S 2 C PGP 3 P IN 3 CV NGP 4 DD MY 4 H CLASS 19) [NV:OH] [FS:WE-JUST] Z CL 1 S NGP HP WE 1 M PLAY 1 C PGP 2 P AT 2 CV 3 NGP H FOOTBALL [HZ:AND-STUFF] 3 NGP 4 & AND 4 H CRICKET 20) [NV:ER] [FS:WE-PLAY-S...] Z CL 1 S NGP HP WE 1 M PLAY 1 C 2 NGP H FIREMEN 2 NGP 3 & AND 3 H POLICE Figure 2: Parse trees 1) [FS:Y...] Z 1 CL F YEAH 1 CL 2 S NGP 3 DD THAT 3 HP ONE 2 OM 'S 2 C NGP 4 DQ A 4 H RACING-CAR /-\ |Z| \+/ | | /------+-----\ | | | | /+-\ /+-\ |CL| |CL| \+-/ \+-/ | | | | | /----++-----\ | | | | | | | | /+\ /+\ /+-\ /+\ |F| |S| |OM| |C| \+/ \+/ \+-/ \+/ | | | | | | | | | | | | | | | | | | | | | /-+-\ | /-+-\ YEAH |NGP| 'S |NGP| \-+-/ \-+-/ | | | | /--+--\ /--+---\ | | | | | | | | /+-\ /+-\ /+-\ /+\ |DD| |HP| |DQ| |H| \+-/ \+-/ \+-/ \+/ | | | | | | | | | | | | | | | | | | | | | | | | THAT ONE A RACING-CAR 2) Z CL 1 S NGP 2 DD THAT 2 HP ONE 1 OM 'S 1 C NGP 3 DQ A 3 MO QQGP AX LITTLE 3 H TRUCK /-\ |Z| \+/ | | | | | /+-\ |CL| \+-/ | | /----+--+-------\ | | | | | | /+\ /+-\ /+\ |S| |OM| |C| \+/ \+-/ \+/ | | | | | | | | | | | | | | | /-+-\ | /-+-\ |NGP| 'S |NGP| \-+-/ \-+-/ | | | | /--+--\ /-----++------\ | | | | | | | | | | /+-\ /+-\ /+-\ /+-\ /+\ |DD| |HP| |DQ| |MO| |H| \+-/ \+-/ \+-/ \+-/ \+/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | /-+--\ | THAT ONE A |QQGP| TRUCK \-+--/ | | | | | /+-\ |AX| \+-/ | | | | | | LITTLE 3) [HZ:WELL] Z 1 CL 2 S NGP HP I [RP:I] 2 AI JUST 2 HAD 2 C NGP 3 DQ A 3 MO QQGP AX LITTLE 3 H THINK 1 CL 4 & THEN 4 S NGP HP I 4 M THOUGHT 4 C CL 5 BM OF 5 M MAKING 5 C NGP 6 DD THIS 6 HP ONE /-\ |Z| \+/ | | /-------------+------------\ | | | | /+-\ /+-\ |CL| |CL| \+-/ \+-/ | | | | /-----+--+---+-----\ /------+---+---+------\ | | | | | | | | | | | | | | | | /+\ /+-\ /-+-\ /+\ /+\ /+\ /+\ /+\ |S| |AI| |HAD| |C| |&| |S| |M| |C| \+/ \+-/ \---/ \+/ \+/ \+/ \+/ \+/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | /-+-\ | /-+-\ | /-+-\ | /+-\ |NGP| JUST |NGP| THEN |NGP| THOUGHT |CL| \-+-/ \-+-/ \-+-/ \+-/ | | | | | | | | | /-----++------\ | /-----++------\ | | | | | | | | | | | | | | | | /+-\ /+-\ /+-\ /+\ /+-\ /+-\ /+\ /+\ |HP| |DQ| |MO| |H| |HP| |BM| |M| |C| \+-/ \+-/ \+-/ \+/ \+-/ \+-/ \+/ \+/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | /-+--\ | | | | /-+-\ I A |QQGP| THINK I OF MAKING |NGP| \-+--/ \-+-/ | | | | | /--+--\ | | | | | | /+-\ /+-\ /+-\ |AX| |DD| |HP| \+-/ \+-/ \+-/ | | | | | | | | | | | | | | | | | | LITTLE THIS ONE Acknowledgements I would like to thank Robin Fawcett (Cardiff) for his kind help in proof reading this document, and Tim O'Donoghue (Leeds) for his assistance in producing parse trees for the sample PoW corpus text. References Atwell, Eric Steven and Clive Souter, (1988) "Experiments with a very large corpus-based grammar." To appear in Proceedings of the 15th International Conference on Literary and Linguistic Computing (ALLC). Jerusalem, June 5-9 1988. Atwell, Eric Steven, Clive Souter and Tim O'Donoghue, (1988) "Prototype Parser 1." COMMUNAL Report No. 17, CCALAS, School of Computer Studies, Leeds University. Fawcett, Robin P., (1980) "Language Development in Children 6-12: Interim Report." Linguistics 18 pp 953-958. Fawcett, Robin P., (1981) "Some Proposals for Systemic Syntax." Department of Behavioural and Communication Studies, Polytechnic of Wales. Fawcett, Robin P. and Michael R. Perkins, (1980) "Child Language Transcripts 6-12." With a preface, in 4 volumes. Department of Behavioural and Communication Studies, Polytechnic of Wales. Fawcett, Robin P., (1988) "A note on the relationship between the syntactic categories used in (1) the analysis of the Polytechnic of Wales Corpus and (2) generation and analysis in the COMMUNAL project." (personal communication) Souter, Clive, (1989a) "The COMMUNAL Project: Extracting a grammar from the Polytechnic of Wales Corpus." ICAME Journal No. 13, April 1989, pp20-27. Norwegian Computing Centre for the Humanities, Bergen University. Souter, Clive, (1989b) "Systemic-Functional Grammars and Corpora." Research Report 89.12, School of Computer Studies, University of Leeds. also appeared in Jan Aarts and Willem Meijs (eds), "Theory and Practice in Corpus Linguistics", 1990, Amsterdam: Rodopi Press. Souter, Clive and Eric Atwell, (1988a) "Constraints on Legal Syntactic Configurations." COMMUNAL Report No. 14, CCALAS, School of Computer Studies, Leeds University. Souter, Clive and Eric Atwell, (1988b) "Morphological Analysis." COMMUNAL Report No. 16, CCALAS, School of Computer Studies, Leeds University. Taylor, Lita and Geoffrey Leech, (1989) "Lancaster Preliminary Survey of Machine-Readable Corpora." ICAME, The Norwegian Computing Centre for the Humanities, P.O. Box 53, Universitet, N-5027 Bergen, Norway. Appendix 1: Systemic-Functional Grammar categories in the PoW Corpus (This needs to be formatted as a table, using % as the separator) Name of Category%Symbol in PoW%NT/T%Examples (for Terminals) TEXT AND SENTENCE Text%text%NT%- Unfinished Text%textun%NT%- Sentence%Z (for sigma)%NT%- CLAUSE Clause%Cl%NT%- Unfinished Clause%Clun%NT%- Adjunct (= Experiential Adjunct)%A%NT/T%really, mostly Affective Adjunct%Aa%NT%- Discourse organizational Adjunct%Ad%NT/T%first-of-all, anyway Replacement Adjunct%Arepl%NT%- Feedback-seeking Adjunct%Af%NT/T%look, right, you know Inferential Adjunct%Ai%NT/T%just, only Logical Adjunct%Al%NT/T%really, though, as well Replacement Logical Adjunct%Alrepl%NT%- Wh-logical Adjunct%Alwh%NT%- Modal Adjunct%Am%NT/T%maybe, probably Metalingual Adjunct%Aml%NT/T%say, I mean Negative Adjunct%An%NT/T%never, neither Politeness Adjunct%Ap%NT/T%there, please Tag Adjunct%Atg%NT/T%is it, isn't it Wh-Adjunct%Awh%NT/T%how, when, where, why Binder%B%NT/T%because, cos, if, so, when Main-verb-completing Binder%Bm%T%of Negative Binder%Bn%T%- Complement%C%NT%- Anticipatory Complement%Cantic%NT%- Replacement Complement%Crepl%NT%- Main-verb-completing Complement%Cm%NT/T%across, in, on, up Predicative Complement%Cp%NT/T%able Formula%F%NT/T%alright, yes, no, pardon, what Frame%Fr%NT/T%right, now Infinitive element%I%T%to Main verb%M%T%builds, kicked, went Operator%O%T%did, does, do, let's Modal Operator%Om%T%'ll, 'd, 'm, are, can, could, is Negative Modal Operator%Omn%T%can't, couldn't, isn't, won't Negative Operator%On%T%didn't, doesn't, don't Auxiliary Operator%OX%T%'m, 're, 've, have, was Negative Auxiliary Operator%OXn%T%haven't, wasn't Subject%S%NT%- Anticipatory Subject%Santic%NT%- Replacement Subject%Srepl%NT%- Dummy it Subject%Sit%T%it Dummy there Subject%Sth%T%there Wh-Subject%Swh%NT%- Vocative%V%NT%- Auxiliary%X%T%be, going to, have, used Modal/Necessity Auxiliary%Xm%T%better, got to, have to Negative Modal Auxiliary%Xmn%T%mustn't Negative Auxiliary%Xn%T%don't, hadn't, haven't NOMINAL GROUP nominal group%ngp%NT%- unfinished nominal group%ngpun%NT%- deictic determiner (also in qqgp)%dd%NT/T%the, this, that, her, my wh-deictic determiner%ddwh%T%what, which quantifying determiner (also in qqgp)%dq%NT/T%a, an, one, four, any, all negative quantifying determiner%dqn%NT/T%no, none wh-quantifying determiner%dqwh%T%how many, how much ordinative determiner%do%NT/T%first, sixth, last partitive determiner%dp%NT/T%part superlative determiner%ds%NT%- typic determiner%dt%NT%- selector (of)%vo%T%of modifier (= experiential modifier)%mo%NT%- affective modifier%moa%NT/T%flipping comparison modifier%moc%NT/T%other, else, same, different quantifying modifier%moq%NT/T%five, only, ten situation modifier%mosit%NT/T%opening thing modifier%moth%NT/T% plastic, square, table head (i.e. 'common noun')%h%T%brick, books, men ('proper') name head%hn%T%America, Alf, Barry-Island, Batman pronoun head%hp%T%anything, he, her, him, I, it negative pronoun head%hpn%T%no-one, nobody, nothing situation head%hsit%NT/T%painting, reading wh-pronoun head%hwh%T%what, which, who qualifier%q%NT/T%ago, left replacement qualifier%qrepl%NT%- PREPOSITIONAL GROUP prepositional group%pgp%NT%- unfinished prepositional group%pgpun%NT%- preposition%p%NT/T%on, in, up, under Main-verb-completing preposition%pm%T%about, after, at, for, into completive%cv%NT%- replacement completive%cvrepl%NT%- wh-completive%cvwh%NT%- QUANTITY-QUALITY GROUP quantity-quality group%qqgp%NT%- unfinished quantity-quality group%qqgpun%NT%- temperer (also in pgp)%t%NT/T%a bit, about, all, over, very wh-temperer%twh%T%how apex%ax%NT/T%always, away, back, big, black tempering apex%axt%T%biggest, better, higher, smaller wh-apex%axwh%NT/T%how, where, why, when scope%sc%NT/T%more finisher%fi%NT/T%of all, together GENITIVE CLUSTER genitive cluster%gc%NT%- genitive element%g%T%'s possessor%ps%NT%- owner%own%T%own ELEMENTS OCCURRING IN MORE THAN ONE UNITS NOT SPECIFIED ABOVE inferer%inf%T%just, only Linker%&%T%and, and then, but, or, so, then Appendix 2: Brief Description of the Corpus Date of Compilation: 1978-84 Location: Polytechnic of Wales, Pontypridd, S. Wales. Compiled by: Dr. Robin P. Fawcett and Dr. Michael R. Perkins Type of Data: Spoken corpus, recordings transcribed using conventions from SMEU at UCL, and those of a similar project at Bristol, with pitch movements marked by trained phonetician. Fully hand parsed, using a Systemic Functional Grammar developed by Fawcett, with rich syntactico-semantic categories, capable of handling raising, dummy subject clauses, ellipsis, replacement strings. Parse trees stored in a numerical format (not standard bracketed) to capture discontinuities in syntactic structures. Children's English from Pontypridd, S.Wales. Informal register. The subjects were screened to exclude those with strong second language influence (Welsh or otherwise). 120 children aged between 6-12, (all within 3 months either side of their 6th, 8th, 10th or 12th birthday ) divided equally according to sex, age, and socio-economic class established by profession and highest educational level of parents. Small cells of 3 children were recorded at play with Lego bricks, and each child also interviewed by the same `friendly' adult on his/her favourite games and TV programmes. Size: 65,000 words approximately, in 11,396 lines. 1 parsed sentence per line, hence some very long lines. (also available in 80 chars wrap round format) 1.1 Mb. storage. Tape format: 194 files, each with a reference to age, social class, sex, play session or interview, and child's initials. (each file is a sample of a single child's speech in a play session or an interview). Availability: Only the parsed corpus is available in machine readable form; the recorded tapes and 4-volume transcripts with intonation contours are available in hard copy from the British Library Inter-Library Loans System. Original recordings are available from: Dr Robin Fawcett, Computational Linguistics Unit, SESJP, Aberconway Building, University of Wales College of Cardiff. Cardiff CF1 3XA Wales Original reason for collection: Psycholinguistic research into development of childrens' English between ages of 6 and 12, investigating the growing use of a variety of syntactico-semantic structures. Current research (1987-9): COMMUNAL project; Natural Language Processing at UWCC and Leeds University Extracting machine-readable systemic functional grammars and lexicons for use in parsing. Suites of programs developed to achieve this, including converting the corpus into bracketed form. The grammar used for the hand parsing in the corpus was not formalised in terms of phrase-structure rules, or RTNs, but in system networks of semantic/functional features and their realisation rules more suitable for NL generation than parsing. From corpora-request@uib.no Fri Mar 12 10:29:34 1993 From: Stephen Clarke Date: Fri, 12 Mar 93 10:29:34 GMT To: CORPORA@nora.hd.uib.no Subject: FRENCH CORPORA FRENCH CORPORA I'm looking for a corpus (or corpora) of written or spoken modern French (ie not a(n) historical corpus). If I can't find a ready-made corpus then I'd be interested in building one, preferably pooling resources with other data-gatherers. We can handle electronically-held data (typesetters' tapes, newspaper archives etc) or scan texts on an OCR. It's a question of getting publishers' permission to exploit their data for research. Does anyone know of such a corpus? Or have data to contribute to one? Or is anyone interested in helping to build a corpus? From corpora-request@uib.no Mon Mar 15 03:18:08 1993 From: rocltsh@iis.sinica.edu.tw Subject: Corpus-Based Frequency Count of Modern Chinese To: linguist@tamvml.tamu.edu, corpora@nora.hd.uib.no, Date: Mon, 15 Mar 93 10:12:35 EAT Corpus-Based Frequency Count of Modern Chinese Corpus-based study of Chinese is one of the research projects of the Chinese Knowledge Information Processing Group (CKIP) at Academia Sinica. The current research is based on a Chinese newspaper corpus, which amounts to 20,698,116 characters ( 9,540,444 words after word segmentation.) Four technical reports in Chinese are published. These include: 1. Corpus-Based Frequency Count of Characters in Journal Chinese 30 pages (US$ 5) 2. Corpus-Based Frequency Count of Words in Journal Chinese 300 pages (US$ 20) 3. The Most Frequent Verbs in Journal Chinese and Their Classification 140 pages (US$ 10) 4. The Most Frequent Nouns in Journal Chinese and Their Classification 150 pages (US$ 10) The first report lists 5,666 distinct characters which appear in the entire corpus. The second report contains 42,686 words that occur more than three times in the corpus. The most common 14,956 words constitute more than 99.9995 percent of all the words occurring in the corpus. The third and the fourth report include 19,907 verbs and 21,368 nouns respectively which occur more than twice in the corpus with their syntactic or semantic classification. To order, please list the desired title(s) and enclose a cheque of the appropriate amount payable to the Computational Linguistic Society of the R.O.C. (ROCLING). The prices listed above include postage and handling. Address : Miss Tsai Shu-hui ROCLING Institute of Information Science Academia Sinica, Nankang Taipei, Taiwan 11529 R.O.C. Tel. : 886-2-788-1638 Fax : 886-2-788-1638 E-Mail : rocltsh@iis.sinica.edu.tw From corpora-request@uib.no Tue Mar 16 17:58:55 1993 Date: Tue, 16 Mar 1993 16:58:55 +0100 From: PSP10@phx.cam.ac.uk To: "corplst (CORPORA list)" Subject: Re: [Word frequency] To Person orgainising this corpora distribution by e-mail, please include Ian Johnson of Sharp Laboratories of Europe, Oxford, UK - e-mail ianj@uk.ac.oxford.prg. Many thanks, Paul (Procter), Cambridge Language Survey From corpora-request@uib.no Fri Mar 19 09:22:57 1993 From: "Henry S. Thompson" Date: Fri, 19 Mar 93 09:22:57 GMT To: CORPORA@hd.uib.no Subject: HCRC Map Task Corpus on CD: Audio and transcripts of natural speech The HCRC Map Task Corpus The Human Communication Research Centre (HCRC) is happy to announce the release of the Map Task Corpus. The Map Task Corpus is a set of 8 CD-ROMs containing linked audio and transcriptions of a total of about 18 hours of spontaneous speech that was recorded from 128 two-person conversations according to a detailed experimental design. Altogether, the corpus as distributed provides a thorough and invaluable set of resources and tools for use in analyzing all levels of linguistic structure, via both text-based and speech-based investigation. The range of research questions that are addressable using this corpus span a wide spectrum of linguistic and cognitive issues. We have kept the price as low as possible to encourage researchers from many disciplines to use this corpus as a common reference point for many different kinds of research. The HCRC is an interdisciplinary research centre at the Universities of Edinburgh and Glasgow, supported by the UK Economic and Social Research Council and the Universities Funding Council. The publication of the Map Task Corpus was made possible by assistance from the Linguistic Data Consortium. Corpus Details 64 different speakers, 32 female, 32 male, all adults, each took part in four conversations in a quiet recording studio. They were all students at the University of Glasgow, 61 of them being native Scots. The conversations were carried out in an experimental setting in which each participant has a schematic map in front of them, not visible to the other. Each map is comprised of an outline and roughly a dozen labelled features (e.g. "a white cottage", "an oak forest", "Green Bay", etc). Most features are common to the two maps, but not all. One map has a route drawn in, the other does not. The task is for the participant without the route to draw one on the basis of discussion with the participant with the route. In addition to the conversations, each speaker provides a wordlist reading, consisting of the major vocabulary items contained in the conversations. All recordings were direct to Digital Audio Tape (DAT) at 48KHz, providing very good acoustic quality. The experimental design allows a number of different phonemic, syntactico-semantic and pragmatic contrasts to be explored in a controlled way. In particular, maps and feature names were designed to allow for controlled exploration of phonological reductions of various kinds in a number of different referential contexts, and to provide, via varying patterns of matches and mis-matches between the two maps, a range of different stimuli for referent negotiation. Also the conditions of the conversations were carefully balanced: In half of them the speakers were strangers, in half friends; in half of them the speakers could see each other's faces, in half they could not. Subjects accommodated easily to the task and experimental setting, and produced evidently unselfconscious and fluent speech. The syntax is largely clausal rather than sentential; showing good turn-taking, with modest amounts of overlap and interruption. The total corpus runs to about 18 hours of speech, with the transcripts consisting of around 150,000 word tokens drawn from just over 2,000 word form types. Transcription is at the orthographic level, quite detailed, including filled pauses, false starts and repetitions, broken words, etc. Considerable care has been taken to ensure consistency of notation, which is thoroughly documented. Although the full complexity of overlapped regions has not been reflected in the transcriptions, such regions are clearly set off from the rest of the transcripts. Transcripts are connected to the acoustic sampled data by sample numbers marked every few turns. CD-ROM Contents The waveform data are provided in "raw" (headerless) files (16-bit samples, 20 kHz sample rate, 2 channels per conversation), and alternative header files are provided for use with software based on either the NIST "SPHERE" header structure or the European "SAM" header structure. Transcriptions are provided for each conversation, marked up with TEI-compliant SGML, in a minimally intrusive and easily separated way. PostScript files of the map images used in the experiments are provided, along with full documentation of the experimental design and data collection protocol, resources for using SGML tools on the transcriptions and other text materials, and an extensive set of source code for performing basic signal processing functions on the waveform data, such as down-sampling, de-multiplexing, channel summation, and D/A conversion for Sun workstations (including playback of segments selected via inspection of transcripts in Emacs). The CD-ROMs are in High Sierra (ISO 9660) format with the RockRidge extensions, and are compatible with (inter alia) Unix, MS-DOS and Macintosh operating systems. Copies of the Map Task Corpus are available from the LDC for $200 or from HCRC for 164.50 UK pounds (including VAT) at the addresses given below, plus postage and packing as necessary. Please contact us (by e-mail if possible) for details of payment methods and shipping costs. In Europe please contact Henry Thompson University of Edinburgh Human Communication Research Centre 2 Buccleuch Place Edinburgh EH8 9LW Scotland Tel: +44 31 650-4440 Fax: +44 31 650-4587 email: maptask@cogsci.ed.ac.uk or Dawn Griesbach ELSNET 2 Buccleuch Place Edinburgh EH8 9LW Scotland Tel: +44 31 650-4594 Fax: +44 31 650-4587 email: elsnet@cogsci.ed.ac.uk Outside Europe please contact Elizabeth Hodas Linguistic Data Consortium 441 Williams Hall University of Pennsylvania Philadelphia, PA 19104-6305 Tel: (215) 898-0464 Fax: (215) 573-2175 email: ehodas@unagi.cis.upenn.edu From corpora-request@uib.no Fri Mar 19 12:41:45 1993 Date: Fri, 19 Mar 93 12:25:47 MEZ From: "Prof.Dr. Winfried Lenders" Subject: Tibetean Corpora To: corpora@hd.uib.no Does anybody know whether and where corpora or single texts in electron. form o f the Tibetean Language are available? Please give a short remark to?? Winfried Lenders e-mail adress: Lenders at uni-bonn.de %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Prof. Dr. Winfried Lenders % % Institut fuer Kommunikationsforschung & Phonetik % % Universitaet Bonn % % Poppelsdorfer Allee 47 % % D-W - 5300 Bonn 1 / Germany % % % % phone: +49 228 / 73 - 5646 % % fax: +49 228 / 73 - 5629 % % internet: lenders at uni-bonn.de % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% From corpora-request@uib.no Fri Mar 19 06:08:12 1993 Date: Fri, 19 Mar 93 11:08:12 EST From: glenn@metis.com (Glenn Adams) To: UPK013@dbnrhrz1.bitnet Subject: Tibetean Corpora See the directory /pub/tibetan on pylos.nmsu.edu. Glenn Adams From corpora-request@uib.no Mon Mar 22 16:46:45 1993 Date: Mon, 22 Mar 1993 15:46:45 +0100 From: tekstlab.hf@ilf.uio.no To: corplst@hd.uib.no, corpora@hd.uib.no Subject: Re: RE:overview of formats? From corpora-request@uib.no Tue Mar 23 16:13:45 1993 Date: Tue, 23 Mar 1993 15:13:45 +0100 From: Knut Hofland To: corpora@hd.uib.no Subject: LEXA: Corpus processing software Lexa, a set of programs for lexical data processing, written by Raymond Hickey, is now available from the Norwegian Computing Centre for the Humanities for about 100 USD. The programs run under MS-DOS and comes on 4 diskettes with a manual of 750 pages in 3 volumes. To get more information and order form, send the following line to FILESERV@HD.UIB.NO send icame lexa.info This file can also be fetched with FTP og Gopher from nora.hd.uib.no in the catalogue icame. Knut Hofland Norwegian Computing Centre for the Humanities, Harald Haarfagres gt. 31, N-5007 Bergen, Norway Phone: +47 5 212954/5/6, Fax: +47 5 322656, E-mail: knut@x400.hd.uib.no Here is a short description of the programs written by the author. ------------------------------------------------------------------------- Raymond Hickey, English Department, University of Munich, Germany Lexical Data Processing The present set of programmes is intended to offer a wide range of software which will carry out (i) the lexical analysis and (ii) information retrieval tasks required by linguists involved in the investigation of text corpora. The suite has been particularly adapted to be used with the corpus of historical English compiled at the University of Helsinki. The general nature of the software, however, permits its application to any set of texts, particularly those which are arranged in the so-called Cocoa format. Lexical analysis. The main programme, Lexa, puts at the disposal of the interested linguist the options he or she would require in order to process lexical data with a high degree of automation on a personal computer. The set is divided into several groups which perform typical functions. Of these the first, lexical analysis, will be of immediate concern. Lexa allows one, via tagging, to lemmatise any text or series of texts with a minimum of effort. All that is required is that the user specify what (possible) words are to be assigned to what lemmas. The rest is taken care of by the programme. In addition, one can create frequency lists of the types and tokens occurring in any loaded text, make lexical density tables, transfer textual data in a user-defined manner to a database environment, to mention just some of the procedures which are built into Lexa. The results of all operations are stored as files and can be examined later, for instance with the text editor shipped with the package. Each item of information used by Lexa when manipulating texts is specifiable by means of a setup file which is loaded after calling Lexa and used to initialise the programme in the manner desired by the user. Information retrieval. The second main goal of the Lexa set is to offer flexible and efficient means of retrieving information from text corpora. The programme Lexa Pat allows one to specify a whole range of parameters for combing through text files. By determining these precisely the user can achieve a high level of correct returns which are of value when evaluating texts quantitatively. A further programme, Lexa DbPat, permits similar retrieval operations to be applied to databases, for instance those generated by Lexa from text files of a corpus. Ascertaining the occurrence of syntactic contexts is catered for by the programme Lexa Context with which users can specify search strings, their position in a sentence, the number of intervening items and then comb through any set of texts in search of them. By means of the utility Cocoa it is possible to group text files of a corpus on the grounds of shared parameters from the Cocoa-format header at the beginning of each file in many text collections, e.g. the Helsinki corpus. All information retrieval operations can then have as their scope those files grouped on the basis of their contents by the Cocoa utility. In the design of the current suite of programmes, flexibility has been given highest priority. This is to be seen in the number of items, in nearly all programmes, which can be determined by the user. Furthermore, techniques have been employed which render the structure of each programme as user-friendly as possible (pull-down menus, window technology, mouse support, similarity of command structure between the 40-odd programmes of the set), permitting the linguist to concentrate on essentially linguistic matters. From corpora-request@uib.no Tue Mar 23 09:08:33 1993 Date: 23 Mar 93 14:08:33 EST From: Lenore.A.Grenoble@Dartmouth.EDU (Lenore A. Grenoble) Subject: To: corpora@hd.uib.no please unsubscribe me from this list lenore.grenoble@dartmouth.edu From corpora-request@uib.no Thu Mar 25 02:47:15 1993 To: jie@babel.ling.upenn.edu (Jie Liu) Subject: Re: Chinese corpus Date: Thu, 25 Mar 93 10:23:33 V From: syun tutiya You could contact Professor Chu-Ren Huang of Academia Sinica, Taiwan, to get information about projects on Chinese corpus. I do not have his address at hand but he might be reached at hschuren@csli.stanford.edu Syun Tutiya (Chiba University, Japan) From corpora-request@uib.no Wed Mar 24 08:11:59 1993 From: jie@babel.ling.upenn.edu (Jie Liu) Subject: Chinese corpus To: corpora@hd.uib.no Date: Wed, 24 Mar 93 13:11:59 EST > > >Hi, > > We are in the process of doing a paper on Centering Theory in > >Mandarin Chinese. I'd be grateful if anyone who has any corpus on line on the computer, would let me use it. Or if you know that someone has it, please tell me. > > > > Thanks a lot. > > > > Jie liu From corpora-request@uib.no Fri Mar 26 01:04:01 1993 From: wuzhibia@iscs.nus.sg (Wu Zhibiao) Subject: Re: Chinese corpus To: jie@babel.ling.upenn.edu (Jie Liu) Date: Fri, 26 Mar 93 7:54:14 WST There is a PH corpus available in the net. Contact guojin@iss.nus.sg for more information. The PH Corpus is a GB coded, preprocessed, and (automatically) segmented collection of general news of about four million Chinese graphic characters and symbols selected from publications of the Xinhua News Agency of China during a period from January 1990 to March 1991. best zhibiao From corpora-request@uib.no Fri Mar 26 17:55:22 1993 Date: Fri, 26 Mar 1993 16:55:22 +0100 From: J.Mills@cen.ex.ac.uk To: corpora@hd.uib.no Subject: Cornish I am currently assembling a corpus of the Cornish language and I would like to hear from anyone who is working in the same field and/or has any Cornish language texts in electronic form. Jon Mills Department of Applied Linguistics University of Exeter UK From corpora-request@uib.no Fri Mar 26 06:53:39 1993 To: J.Mills@cen.ex.ac.uk Subject: Re: Cornish Date: Fri, 26 Mar 1993 12:53:39 -0600 From: Jon Whalen Jon, You might try posting your request to gaelic-l@irlearn.ucd.ie (a Gaelic language list) and welsh-l@irlearn.ucd.ie (a Welsh language list) You can subscribe to either list by mailing to listserv@irlearn.ucd.ie or listserv@irlearn.bitnet with a message body: SUBSCRIBE WELSH-L Your Name or SUBSCRIBE GAELIC-L Your Name Both also maintain file archives containing texts in several different Celtic languages. You can get the index of files by mailing to the above listserv with the message body: INDEX WELSH-L or INDEX GAELIC-L --jon In message <18080.9303261555@queens>you write: >I am currently assembling a corpus of the Cornish language and I would >like to hear from anyone who is working in the same field and/or has >any Cornish language texts in electronic form. > > Jon Mills > Department of Applied Linguistics > University of Exeter > UK *Jon S. Whalen Phone: (708) 576-0166* *Lead Software Engineer, Motorola, Inc. Fax: (708) 576-0892* *Corporate, Computer & Communications R&D * *Internet: jon@hook.corp.mot.com / Compuserve: 76665,3043 / AOL:JonSWhalen* From corpora-request@uib.no Wed Apr 3 05:09:42 1993 Date: Wed, 31 Mar 93 13:09:42 -0800 From: edwards@cogsci.Berkeley.EDU (Jane Edwards) To: corpora@hd.uib.no Subject: Masterpiece Library (Forwarded from the Linguist list) -------------------------------- Date: Tue, 30 Mar 93 16:41:15 -0700 From: miller@defun.cs.utah.edu (Cliff Miller) Subject: Literary Corpus on CD-ROM Masterpiece Library CD-ROM Project: Search for Reviewers A group of current and former graduate students and teachers at the University of Utah have produced a CD-ROM called Masterpiece Library which contains over 1300 pieces of classic literature from the public domain. If you are interested in reviewing this CD-ROM, please contact us for a free disc. Brief Description: Masterpiece Library is a collection of 1338 pieces of public domain literature and texts (the entire Bible, the Koran, Twain, Thoreau, Whitman, complete Shakespeare, US govt docs such as the Constitution, hundreds of Greek works, etc.) It has a comprehensive indexing system of 175,000 words with browser and search interfaces for both Macs and PCs. Its searching capabilities allow for AND/OR searches of several words as well as complete phrases within the books and the titles of the books. All of the classic works on our disc are in the public domain, and in that spirit we'd like to offer the CD to the public at a minimal cost. (We're planning to charge $39.95.) Anyone should be able to use these works -- quote, copy, read for personal enjoyment or research -- as freely as possible. Depending upon the feedback we receive from the research community, we will consider enhancing the searching capabilities to allow for more sophisticated linguistic analysis of the corpus. Contact Information: Pacific HiTech 4530 Fortuna Way Salt Lake City, Utah 84124 email: 71175.3152@CompuServe.COM phone: 801-278-2042 800-765-8369 fax: 801-278-2666 -------------------------------------------------------------------------- LINGUIST List: Vol-4-234. From corpora-request@uib.no Wed Mar 28 16:41:57 1993 Date: Thu, 1 Apr 93 00:41:57 -0800 From: edwards@cogsci.Berkeley.EDU (Jane Edwards) To: corpora@hd.uib.no Subject: Masterpiece Library source Please send email queries to: miller@defun.cs.utah.edu rather than to me. I am not affiliated with them; I just forwarded their posting from another list. Thanks, -Jane From corpora-request@uib.no Fri Apr 2 21:47:09 1993 From: shimizu To: edwards@cogsci.Berkeley.EDU (Jane Edwards) Subject: Re: Masterpiece Library source Date: Fri, 02 Apr 93 12:47:09 +0900 >> Return-Path: corpora-request@uib.no >> Received: from aun.uninett.no by hakobera.isct.kyutech.ac.jp (5.65/6.4J.6) >> id AA09044; Thu, 1 Apr 93 18:09:16 +0900 >> Received: from alf.uib.no by aun.uninett.no with SMTP (PP) >> id <12581-0@aun.uninett.no>; Thu, 1 Apr 1993 10:54:50 +0200 >> Received: from nora.hd.uib.no by alf.uib.no with SMTP (PP) >> id <18942-0@alf.uib.no>; Thu, 1 Apr 1993 10:42:14 +0200 >> Received: from cogsci.Berkeley.EDU by nora.hd.uib.no with SMTP >> id AA09650 (5.65c8/IDA-1.4.4 for ); >> Thu, 1 Apr 1993 10:44:58 +0200 >> Received: by cogsci.Berkeley.EDU (5.63/1.29) id AA06493; >> Thu, 1 Apr 93 00:41:57 -0800 >> Date: Thu, 1 Apr 93 00:41:57 -0800 >> From: edwards@cogsci.Berkeley.EDU (Jane Edwards) >> Message-Id: <9304010841.AA06493@cogsci.Berkeley.EDU> >> To: corpora@hd.uib.no >> Subject: Masterpiece Library source >> Cc: edwards@cogsci.Berkeley.EDU >> >> >> Please send email queries to: miller@defun.cs.utah.edu >> rather than to me. I am not affiliated with them; I just >> forwarded their posting from another list. >> Thanks, >> -Jane Sorry. From corpora-request@uib.no Sun Apr 11 19:25:00 1993 Date: Sun, 11 Apr 1993 23:25 EDT From: MLEWELLEN@guvax.acc.georgetown.edu Subject: subscribe to corpora To: corpora@hd.uib.no Could you please put me on the mailing list for "corpora"? My name is Mark Lewellen, at: mlewellen@guvax (bitnet) mlewellen@guvax.georgetown.edu (internet) Could I also get information about getting back issues or ftp sites? Thank you very much. Mark From corpora-request@uib.no Mon Apr 12 06:50:34 1993 Date: Mon, 12 Apr 1993 10:50:34 -0400 From: Inderjeet Mani To: corpora@hd.uib.no Subject: POS tagging - Spanish Hello, I'm interested in information on any part-of-speech taggers for Spanish. I assume this hasn't been recently discussed in this group and archived somewhere. If you wish, you may reply to me and I will post a collected reply. Thanks, Inderjeet Mani Artificial Intelligence Technical Center Mail Station Z401 The MITRE Corporation 7525 Colshire Drive McLean, Virginia 22102-3481 mani@starbase.mitre.org From corpora-request@uib.no Tue Apr 13 12:04:10 1993 Date: Tue, 13 Apr 93 16:04:10 EDT From: ACLX@CORNELLA.cit.cornell.edu To: corpora@hd.uib.no subscribe From corpora-request@uib.no Wed Apr 14 03:39:51 1993 Date: Wed, 14 Apr 93 07:39:51 EDT From: martinp@verdi.sra.com (Pat Martin) To: corplst@hd.uib.no Subject: listings of famous person/places Does anyone know of any listings/corpora of famous people and/or places which might be available? Pat Martin SRA Corp. (martinp@sra.com) From corpora-request@uib.no Wed Apr 14 15:51:25 1993 Date: Wed, 14 Apr 93 13:27:54 BST From: pflynn@curia.ucc.ie (Peter Flynn) Subject: Re: listings of famous person/places To: corplst@hd.uib.no > Does anyone know of any listings/corpora of famous people and/or > places which might be available? Can you narrow down your target a little? ///Peter From corpora-request@uib.no Wed Apr 14 15:52:02 1993 Date: Wed, 14 Apr 93 15:05 MET From: CELEX@mpi.nl Subject: RE: listings of famous person/places To: corpora@hd.uib.no Dear Pat, > Does anyone know of any listings/corpora of famous people and/or > places which might be available? The Computer Usable Version of the Oxford Advanced Learner's Dictionary (available from the Oxford Text Archive) contains an extra list of 2,500 common British English forenames, large towns, countries, states etc. Richard Piepenbrock CELEX Nijmegen, the Netherlands celex@mpi.nl From corpora-request@uib.no Wed Apr 14 02:25:08 1993 Date: Wed, 14 Apr 93 08:25:08 MDT From: ted To: martinp@verdi.sra.com Subject: listings of famous person/places try contacting the consortium for lexical research who have an updated and cleaned up list of place names. i believe they should also have a proper name recognizer soon. the best contact point is lexical@nmsu.edu From corpora-request@uib.no Wed Apr 14 12:59:16 1993 From: Doug Cutting To: corpora@hd.uib.no, empiricists@csli.stanford.edu, linguist@tamvm1.tamu.edu Subject: Xerox part-of-speech tagger available Date: Wed, 14 Apr 1993 19:59:16 PDT The Common Lisp source code for version 1.0 of the Xerox part-of-speech tagger is available for anonymous FTP from parcftp.xerox.com in the file pub/tagger/tagger-1-0.tar.Z. This code has been tested in the following CL implementations: . Franz Allegro Common Lisp version 4.1 on SunOS 4.x; . CMU Common Lisp version 16e on SunOS 4.x; and . Macintosh Common Lisp 2.0p2. Enjoy. Doug Cutting , and Jan Pedersen From corpora-request@uib.no Thu Apr 15 21:25:30 1993 Date: Fri, 16 Apr 93 01:25:30 EDT From: fujii%mackay@cs.umass.edu (Hideo Fujii) To: CORPORA@hd.uib.no Subject: join to your List Dear members of the List of CORPORA, My name is Hideo Fujii, and I am a Ph.D. student at the Computer Science Dept, Univ. of Massachusetts. I am working in the information retrieval field, and currently evaluating Japanese IR performance with a Japanese corpus. I am thinking to compare Japanese and English IR performance using English and Japanese corpus - same content but the language is different. I think your List will stimulate my research, and am expecting exchange information/opinions with your member. I would like to subscribe your List. Could you send application forms to join your List, if it is necessary. I appreciate very much for your assistance. Thank you. Sincerely, Hideo Fujii IR Lab, Computer Science Dept. University of Massachusetts Amherst, MA 01003, U.S.A. (fujii@cs.umass.edu) From corpora-request@uib.no Tue Apr 20 06:49:41 1993 Subject: Query re LEXA and other such software To: corpora@hd.uib.no (Corpora List) Date: Tue, 20 Apr 1993 10:49:41 -0400 (EDT) From: Jon Aske Aritza I have a question about computer software for analyzing corpora, which runs on IBM pc's. Recently there was an "ad" for a Lexa package of tools for analyzing corpora which seemed very interesting. It was billed as "a set of programs for lexical data processing, written by Raymond Hickey" and "available from the Norwegian Computing Centre for the Humanities for about 100 USD". Has anyone heard of this software? Used it? What other software do people use to analyze text? I have used Shoebox up till now as a database, but will need more tools in the near future. What can people recommend or warn against? Any information will be greatly appreciated. BTW, my current research is on the pragmatic factors which influence word order in Basque and that is what I will be using the software for. Thanks a lot. Jon ------------------------------------------------------------------------- Jon Aske Political Science / Anthropology Home address: Bates College Jon Aske Lewiston, Maine 04240, USA "Aritza Enea" 12 Bardwell St. Work phone: (207) 786-6472 Lewiston, Maine 04240-6336 Fax number: (207) 786-6123 -Phone: (207) 786-0589 e-mail: jaske@abacus.bates.edu or jonaske@garnet.berkeley.edu ------------------------------------------------------------------------- From corpora-request@uib.no Wed Apr 21 07:26:03 1993 Date: Wed, 21 Apr 93 12:26:03 CDT From: "Eric Johnson DSU, Madison, SD 57042" Subject: Text analysis To: CORPORA list , Jon Aske In reply to the question by Jon Aske about software for text analysis, I realize programming is not for everyone, but a serious researcher might want to learn a programming language designed for text analysis: SNOBOL4 (or the speedy implementation called SPITBOL) or Icon. I could say more about these languages if anyone is interested. -- Eric Johnson eric@sdnet.bitnet JohnsonE@columbia.dsu.edu From corpora-request@uib.no Wed Apr 21 09:50:01 1993 Subject: Re: Text analysis To: ERIC@sdnet.bates.edu (Eric Johnson DSU Madison SD 57042) Date: Wed, 21 Apr 1993 13:50:01 -0400 (EDT) From: Jon Aske Aritza Eric Johnson DSU, Madison, SD 57042 | | | In reply to the question by Jon Aske about software for text analysis, | I realize programming is not for everyone, but a serious researcher | might want to learn a programming language designed for text analysis: | SNOBOL4 (or the speedy implementation called SPITBOL) or Icon. I could | say more about these languages if anyone is interested. | | -- Eric Johnson | eric@sdnet.bitnet | JohnsonE@columbia.dsu.edu | I am not against programming, and i've done a bit of it in my time, but it seems to me that it shouldn't be necessarity in this day and age for most purposes. That is, if there are proven and tested applications out there they should be made available. But I, and no doubt others, would surely like to know more about SNOBOL4 and SPITBOL and Icon. I admit my total ignorance. Thanks a lot. Jon ------------------------------------------------------------------------- Jon Aske Political Science / Anthropology Home address: Bates College Jon Aske Lewiston, Maine 04240, USA "Aritza Enea" 12 Bardwell St. Work phone: (207) 786-6472 Lewiston, Maine 04240-6336 Fax number: (207) 786-6123 -Phone: (207) 786-0589 e-mail: jaske@abacus.bates.edu or jonaske@garnet.berkeley.edu ------------------------------------------------------------------------- From corpora-request@uib.no Wed Apr 21 06:17:55 1993 Date: Wed, 21 Apr 93 12:17:55 MDT From: ted To: ERIC@SDNET.bitnet Subject: Text analysis Date: Wed, 21 Apr 93 12:26:03 CDT From: "Eric Johnson DSU, Madison, SD 57042" I realize programming is not for everyone, but a serious researcher might want to learn a programming language designed for text analysis: SNOBOL4 (or the speedy implementation called SPITBOL) or Icon. snobol and spitbol are both relics of an ancient past. icon is an excellent tool. for many applications, awk suffices, for many others perl is excellent. From corpora-request@uib.no Thu Apr 22 09:30:28 1993 Date: Thu, 22 Apr 1993 08:30:28 +0100 To: ted , ERIC@SDNET.bitnet From: eytan@dpt-info.u-strasbg.fr (Michel Eytan, LILoL) Subject: Re: Text analysis At 12:17 21/04/93 -0600, ted wrote: > Date: Wed, 21 Apr 93 12:26:03 CDT > From: "Eric Johnson DSU, Madison, SD 57042" > > > I realize programming is not for everyone, but a serious researcher > might want to learn a programming language designed for text analysis: > SNOBOL4 (or the speedy implementation called SPITBOL) or Icon. > > >snobol and spitbol are both relics of an ancient past. > > >icon is an excellent tool. > > >for many applications, awk suffices, for many others perl is excellent. I do not agree with Ted's abrupt dismissal of Snobol (that I have had a look at quite a few years ago) and Spitbol (that I know close to nothing about). It is true that Snobol was in some ways ahead of its time: "Ne trop jeune dans un monde trop vieux". But it was *very* well adapted to what it set out to do: manipulate strings of words. Of course the newer versions must be able to do much more... IMHO the real problem then (as now) was that "behavioural science" people are very reticent to program. Moreover don't forget that at that time there were no PC's and no Mac's!. As for 'awk' and 'perl' they are tools for technicians with unix boxes -- which is *not* the case of many people. Finally in the courses I teach on NLP as well as in the Computational Linguistics community the *one* programming language people use is Prolog which can handle string-manipulation and is a very high-level language that allows consideration of incomplete structures and gives multiple solutions (if they exist). Of course, *theoretically* whatever one language does all the others can do too -- we are talking here about Pragmatics: ease of use, syntactic simplicity, closeness to human ways of doing things. In all this I believe Prolog to be superior to anything else I know of. -- Michel Eytan, Lab Info, Log & Lang eytan@dpt-info.u-strasbg.fr Dpt Info, U Strasbourg II V: +33 88 41 74 29 22 rue Descartes, 67084 Strasbourg FR F: +33 88 41 74 40 From corpora-request@uib.no Thu Apr 22 12:08:53 1993 Date: Thu, 22 Apr 93 10:08:53 +0200 From: Oliver Christ To: eytan@dpt-info.u-strasbg.fr Subject: Text analysis >>>>> On Thu, 22 Apr 1993 08:30:28 +0100, eytan@dpt-info.u-strasbg.fr (Michel Ey tan, LILoL) said: Michel> As for 'awk' and 'perl' they are tools for technicians with unix boxes - - Michel> which is *not* the case of many people. Well, perl as well as (g)awk are within the GNU public domain license -- both have been ported to PCs (as far as I know). But I agree with you that perl's syntax is somewhat strange... Michel> Finally in the courses I teach on NLP as well as in the Computational Michel> Linguistics community the *one* programming language people use is Prolo g Michel> which can handle string-manipulation and is a very high-level language t hat Michel> allows consideration of incomplete structures and gives multiple solutio ns Michel> (if they exist). Of course, *theoretically* whatever one language does a ll Michel> the others can do too -- we are talking here about Pragmatics: ease of u se, Michel> syntactic simplicity, closeness to human ways of doing things. In all th is Michel> I believe Prolog to be superior to anything else I know of. I prefer Lisp (but I don't want to start a discussion on the pros and cons of Lisp and Prolog here!). But both languages aren't designed for fast string handling in huge texts. Running with Lisp through a corpus of 20MB takes several hours, whereas awk or perl can process the same amount of data in a few minutes (that's my experience on a 'unix box'). I think one can do a lot of corpus management and retrieval tasks with simple (but fast) tools like sed, awk, grep and perl. And I will always prefer them when it comes to mere string handling or tasks which have to be done quickly (user interfacing, index rebuilding,...). Additionally, their sources are available so that they may patched for individual and specialized purposes (see, for example, the implementation of 'cgrep' by johannes@math.uni-muenster.de which does quite fast concordancing in plain ascii texts). Oli --------------------------------------------------------------------------- Oliver Christ Institute for Natural Language Processing, University of Stuttgart, Germany oli@ims.uni-stuttgart.de/christ@is.informatik.uni-stuttgart.de --------------------------------------------------------------------------- From corpora-request@uib.no Thu Apr 22 02:12:56 1993 Date: Thu, 22 Apr 93 08:12:56 MDT From: ted To: eytan@dpt-info.u-strasbg.fr Subject: Text analysis I do not agree with Ted's abrupt dismissal of Snobol (that I have had a look at quite a few years ago) and Spitbol (that I know close to nothing about). it isn't really an abrupt dismissal. it has taken me years to decide that snobol is a relic :-). It is true that Snobol was in some ways ahead of its time: "Ne trop jeune dans un monde trop vieux". But it was *very* well adapted to what it set out to do: manipulate strings of words. Of course the newer versions must be able to do much more... that is exactly the problem. icon, awk, sed and perl *are* the newer versions of snobol in many ways. As for 'awk' and 'perl' they are tools for technicians with unix boxes -- which is *not* the case of many people. programming in awk is about as easy as it gets. basic for string hackers. programming in perl can be made difficult, but with the excellent debugger and the excellent o'riley book on the subject, it isn't very hard. icon should be even easier, but i don't have direct experience there. Finally in the courses I teach on NLP as well as in the Computational Linguistics community the *one* programming language people use is Prolog this is said from a fairly franco-centric point of view. it is true that we use prolog quite a bit here in crl, but we also use lisp and for anything to do with large corpora, prolog is pretty hopeless due to lack of good buffered and inlined character i/o. lisp can be made to do character i/o within a factor of two of c. the staples of the large corpus work here, though, are c, awk and perl. (with a little help from lex). (if they exist). Of course, *theoretically* whatever one language does all the others can do too -- we are talking here about Pragmatics: ease of use, syntactic simplicity, closeness to human ways of doing things. In all this I believe Prolog to be superior to anything else I know of. prolog is reasonably good this way, especially if all your programmers already know it. the other side of pragmatics, though, is whether the programs will complete within your lifetime. for us, 20MB is a small amount of text and some of our corpora are several GB. one handy trick is to use the tools like awk or perl to produce statistical summaries of large amounts of text which can then be converted by automatic means into prolog or c programs. this allows some really wonderful cross fertilization. From corpora-request@uib.no Thu Apr 22 02:16:13 1993 Date: Thu, 22 Apr 93 08:16:13 MDT From: ted To: CORPORA@hd.uib.no Subject: Text analysis Well, perl as well as (g)awk are within the GNU public domain license -- as a point of information, the gnu license does not place the gnu software in the public domain. with public domain works, you can do anything you like. with the gnu public license, you must distribute source of derived works if you distribute binaries. this should not affect academic researchers, but people in a commercial setting have different constraints. From corpora-request@uib.no Thu Apr 22 04:49:53 1993 From: Robert Goldman Date: Thu, 22 Apr 93 09:49:53 CDT To: corpora@hd.uib.no Subject: Text analysis It certainly seems that the corpora list would benefit from more clear delineation of the relationship between application and good choice of programming language. I don't think that trying to get people to buy the same tool for all purposes is helpful. I would tend to agree that the "Unix languages" are better for CORPUS analysis and Prolog and Lisp for NLP, which tends to be less I/O bound. I think that concerns about the availability of awk and perl are overstated --- any place that has a sophisticated undergraduate majoring in computer science will probably have a copy --- and I would imagine that SNOBOL and SPITBOL are harder to come by (and less likely to be free). R From corpora-request@uib.no Thu Apr 22 05:25:02 1993 Date: Thu, 22 Apr 93 11:25:02 CST From: stan kulikowski ii Subject: re: text analysis To: corpora@hd.uib.no ok, i will wade into the 'text analysis' thread which seems to be discussing the relevant advantages of various programming languages. i am a programmer so i can fuss about snobol vs prolog and the like. the really big drawback of those languages is the lack of limitation upon variable scope-- at least, in the last versions of these (that i saw) you were given nothing but global scope variables. this assures that only small programs would be feasible, or you got macho-brained programmers applying self-discipline and peculiar var-naming conventions to keep large projects in human memory. the same rotten habits that basic used to teach ya. at least lisp by time it got 'common' gave local scope to vars, but it still has that mind numbing right linear syntax (and endless parentheses stacking) to deal with. in these days, if a language doesn't have drag-and-drop interfaces for constructing prototype objects, it ain't diddly. i don't think that any of this is the point. tell us what we mean by 'text analysis' and that should tell us something about the kind of tool needed. off-the-shelf application software is just fine so long as you are doing relatively standard analysis tasks. i would suggest we are really talking about several levels: 1. simple analysis-- put in the text and out comes its dictionary, sorted by frequency? just about anything should do this, and so the humanity of its user interface is a proper metric in recommending its widespread use. 2. sort of simple, at least done often-- put in the text and parse by a standard grammar? that task is a little less specific and will require a narrower class of application programs-- afterall, the purpose of this might be: to classify the text by its correspondence to the grammar (as prescriptive teachers do), or to rate the grammar by its correspondence to the text (as descriptive linguists do). surely we have all seen those horrible 'grammar checker' programs which do a markup on your text questioning your use of the passive. the inflexibility of these things usually cause me to go to programming (but then i can write my own), but i have no problem in dealing will people who choose otherwise. the main problem here is to keep up with what program offers what alternatives. by the time off-the-shelf applications get some task flexibility, they begin to resemble programming in complexity. i think discussion lists like this ought to traffic in that kind of information. so where do i get a good sgml editor? 3. 'hm, thats tough' class of problem-- put in the text and output the grammar that best fits the data. here we are going to be bashing heads agains metalinguistics-- what means 'best fit'? what form of grammar can be considered as output? what phenomena are going to be accounted for? when we tackle stuff like this, there is no other reasonable alternative except choosing a programming language and jumping in the caca. well, that is my two cents worth. what kind of analysis we talking about? stan stankuli@UWF.bitnet . === we all help each other get a little further down the road, º º or be damned for the fools that we are. --- -- the motorcycle modificationist's motto From corpora-request@uib.no Thu Apr 22 04:13:43 1993 Date: Thu, 22 Apr 93 11:13:43 PDT From: bclarke@cogsci.UCSD.EDU (Bob Clarke) To: corpora@hd.uib.no Subject: Text analysis languages I used (and gave up on) SNOBOL twenty-five years ago. Same for Lisp fifteen years ago and Prolog ten years ago. A current text analysis project is very happy with Rogue Wave's Tools.h++, a C++ class library. Rogue Wave is at (800) 487-3217 (USA) P.O.Box 2328, Corvallis, OR 97339. From corpora-request@uib.no Thu Apr 22 13:31:40 1993 From: Robert Goldman Date: Thu, 22 Apr 93 18:31:40 CDT To: STANKULI@UWF.bitnet Subject: text analysis Stan writes " ok, i will wade into the 'text analysis' thread which seems to be discussing the relevant advantages of various programming languages. i am a programmer so i can fuss about snobol vs prolog and the like. the really big drawback of those languages is the lack of limitation upon variable scope-- at least, in the last versions of these (that i saw) you were given nothing but global scope variables. this assures that only small programs would be feasible, or you got macho-brained programmers applying self-discipline and peculiar var-naming conventions to keep large projects in human memory. the same rotten habits that basic used to teach ya. at least lisp by time it got 'common' gave local scope to vars, but it still has that mind numbing right linear syntax (and endless parentheses stacking) to deal with. in these days, if a language doesn't have drag-and-drop interfaces for constructing prototype objects, it ain't diddly." I'm afraid that I couldn't agree with you less. If all one cares about is getting a megabyte or so of text sorted, counted or some such, I can't imagine that the presence or absence of lexically-scoped variables is of the slightest interest. You are far better off just writing 5 lines of awk instructions than learning ANY general-purpose programming language, even if it has a "rub the magic lantern" interface for constructing prototype objects. For that matter, I think object-orientation is going to be more conceptual trouble than its worth if you are interested in stream processing or even parsing. Let's not confuse the issue of "what's a good language for producing commercial-quality software" with the issue of "what's a good language for producing quick and dirty solutions to problems we need to solve in order to get our data in a usable form." Learning C++ for the latter is pointless overkill. In fact, learning C++ even for producing quick and dirty prototypes to test concepts which we will describe in journal articles and then let someone else program into commercial-quality software, may ALSO be overkill, depending on what problem is to hand. If you are writing a simple parser and don't care about high-speed throughput, use Prolog. To a first approximation, you can just write the grammar and you will get an executable parser for free. Might be slow, but maybe you are more concerned about making effective use of YOUR time and less concerned about making effective use of the computer's time (after all, you are supposed to own it, and not vice versa). Of course, if you DO want to write commercial-quality software, it's a different matter. But then again, if you want to write commercial-quality software, you don't need me to choose a programming language for you (or if you do, you ought to reconsider writing commercial-quality software!) Cheerio, R From corpora-request@uib.no Thu Apr 22 18:52:00 1993 Date: Thu, 22 Apr 1993 22:52 EDT From: "Keith J. Miller" Subject: prog. lgs. and lx applications To: corpora@hd.uib.no In a recent reply to a request for textual analysis software, Michel Eytan mentions PROLOG as the *one* language that is used in current research in NLP. From what I understand, this is truer in EUROPE than in the US, where LISP gives PROLOG a good run for its money. This is not to discount the value of PROLOG (which I actually use more), but only to aknowledge a viable alternative. I think, however, that we are digressing a bit. The original question had to do with textual analysis *software*, of which there is a good deal available, as someone has already mentioned (sorry I dont have the message any longer). Programming is great for those of us who are game, but computational analysis of corpora is definitely not out of the grasp of even the least computer-literate among us.(as we all know). Some of the software I have recently used includes TACT, which is available through ftp, and Conc for the Mac. Just two more to add to the list. ----- Keith J. Miller Georgetown University Computational Linguistics From corpora-request@uib.no Sat May 1 18:59:03 1993 Date: Sat, 1 May 1993 16:59:03 +0200 From: Henry Kucera To: corpora@x400.hd.uib.no Subject: Word Cruncher I know about Word Cruncher for text processing and retrieval but I have missplaced the information about it, including address for ordering it, price, systems requirements, etc. I would appreciate any help. My understanding is that it is for PC compatibles but not (yet) for Macs. Thanks in advance, Henry Kucera, Brown U. From corpora-request@uib.no Wed May 5 02:10:34 1993 Date: Wed, 5 May 1993 00:10:34 +0200 From: CORPORA list To: corpora@hd.uib.no Subject: text processing/analysis (3 msgs/213 lines) ********************* Text Corpora List: Addresses *************************** CORPORA@NORA.HD.UIB.NO for messages to the list CORPORA-REQUEST@NORA.HD.UIB.NO for messages to list administrator FILESERV@NORA.HD.UIB.NO for requests to file server (try sending HELP) ****************************************************************************** [ This was sent to the corpora-request address and appear late since I have been away/busy for the last weeks, I am sorry, -Knut ] Send-date: Wed, 21 Apr 1993 21:55:48 UTC-0500 From: Richard L. Goerwitz To: Subject: text processing >I am not against programming, and i've done a bit of it in my time, but it >seems to me that it shouldn't be necessarity in this day and age for most >purposes. That is, if there are proven and tested applications out there >they should be made available. But I, and no doubt others, would surely >like to know more about SNOBOL4 and SPITBOL and Icon. I admit my total >ignorance. Thanks a lot. In theory you're right: Applications should be available for all situ- ations. They should be flexible and intuitive, and should be easily cus- tomized. Rarely, though, does this happen in practice. It is therefore vital that humanists know something about programming if they are going to use the computer as a serious research tool. Or even a not-so-serious research tool :-). SNOBOL was a neat steppingstone, but it's quite outmoded now, and the so- called "Green Book" (the usual basic text) is out of print. SNOBOL has been replaced by Icon, which is far more of a general programming lang- uage, and which has a more modular and up-to-date overall structure. Icon should be a first choice for humanities-oriented programming these days. Here is the blurb I send out for people interested in learning Icon. It's directed at people who already understand something about programming lang- uage structure and design, though, so it may not be of equal use to every- one here: Icon (1976) represents a combination of Prolog-like evaluation mechanisms with an Algol-based syntax and SNOBOL-derived string processing facilities. Icon offers automatic storage allocation and garbage collection, as well as built in associative arrays, lists, "real" strings (i.e. not just char arrays), and a data type resembling mathematical sets. Icon is a strongly, though not statically, typed language offering transparent automatic type conversions (i.e. 10, depending on its context, may be converted to "10", etc.) and an elegant string processing mechanism known as "scanning." Central to Icon is the concept of the generator, i.e. the inherent capacity on the part of expressions to produce multiple results. Central also is the notion of goal-directed evaluation - a form of backtracking in which the components of an expression are resumed until some result is achieved, or else the expression as a whole fails. Icon was originally designed by Ralph Griswold, Dave Hanson, and Tim Korb. It was first implemented in C by Steve Wampler. Definitive references: Ralph E. and Madge T. Griswold, _The Icon Programming Language_ (2nd ed.; Prentice Hall, 1989); _The Implementation of the Icon Programming Language_ (Princeton Univ. Pr., 1986). Icon is at its best when used as a prototyping tool, for processing text, for performing various mappings and conversions, and as a general tool for solving problems that tend to require heuristic mechanisms, rather than purely algorithmic ones. In general, Icon's design assigns a higher priority to consistency and lucidity than to functionality within one or another operating environment. For this reason, it is not a good UNIX system administration tool. Nor is it particularly fast. It is a clean, portable system implemented under VMS, MVS, SYSV, Mach, BSD, Ultrix, HP/UX, AI/X, AU/X, and many other operating systems, as well as for various micros, such as the Atari, Mac, and PC. Icon is a good language choice for theorists exploring language design, for scholars in the humanities, and generally for people interested in nonnumeric computing. ---- -Richard L. Goerwitz goer@midway.uchicago.edu Send-date: Thu, 22 Apr 1993 13:52:50 UTC-0500 From: Richard L. Goerwitz To: Subject: try this (was text analysis) >>snobol and spitbol are both relics of an ancient past. > >I do not agree with Ted's abrupt dismissal of Snobol Nor do I, but in fact it is generally true that they use an outmoded mechanism for control flow, basically corresponding to the dreaded goto we are all taught to eschew. In Snobol, also, string scanning was not integrated into the language. These problems (both in control flow and scanning) are solved with Icon. >Finally in the courses I teach on NLP as well as in the Computational >Linguistics community the *one* programming language people use is >Prolog which can handle string-manipulation and is a very high-level >language that allows consideration of incomplete structures and gives >multiple solutions... This is exactly the way Icon works. Icon is like a Pascal with a Prolog-ish evaluation mechanism built in. The resemblance is coincidental, according to their respective designers, but the effect is similar: Both languages lend themselves to elegant expression of problems that permit multiple solutions. Prolog, of course, is a fine language, and I would never discourage anyone from learning it. I simply feel that Icon has not been given enough attention by the NLP community. Icon has a terrific OS/2 and X interface (NT and Windows in the works), and is intuitive, powerful, and easy to use. Very much a real-world system - one that has the expressiveness of a Prolog, the clarity of a Pascal, and the functionality of a Snobol (along with optimized string processing). At times I use C or LISP out of necessity. And I've been through one course in NLP for Prolog. Don't dump the other languages! Just trust me that there really is nothing quite like Icon. Note that Icon is also PD. Free. -Richard Goerwitz U of Chicago goer@midway.uchicago.edu Send-date: Thu, 22 Apr 1993 21:00:49 UTC-0500 From: Richard L. Goerwitz To: Subject: text analysis This discussion is important for people thinking about what pro- gramming language to learn, so I hope this lengthy posting won't seem an intrusion on a list devoted otherwise to corpus-related matters.... >>lisp by time it got 'common' gave local scope to vars, but it still >>has that mind numbing right linear syntax (and endless parentheses >>stacking) to deal with. in these days, if a language doesn't have >>drag-and-drop interfaces for constructing prototype objects, it ain't >>diddly. >I'm afraid that I couldn't agree with you less. If all one cares >about is getting a megabyte or so of text sorted, counted or some >such, I can't imagine that the presence or absence of lexically-scoped >variables is of the slightest interest. This is an excellent point. What language is widely available, will work under a number of interfaces and operating systems, and is a good, easy-to-learn, general-purpose system? An answer here will get you a language to use for about 90% of your programming needs. Let's take some of the languages we're discussing here. I'll grade them based on what I know of them, either first or second hand. By utility, I mean, "Can you do real-world things in it?" I don't mean, "Can you do things easily?" By availability, I mean, "Are high-quality cheap or free implementations widely available?" By ease of use I mean, "Is the language easy to pick up and use, or does it tend to detour the user into things like storage, rigid static typing, and low-level operations a humanities programmer shouldn't normally have to worry about?" Speed is self-explanatory. Expressiveness refers to a language's intrinsic ability to state complex problems simply and elegantly. I inserted question marks where I just don't have enough information to say anything intelligent, or where the variability is too great to allow meaningful categorization. Note that I've only studied C and Icon intensively. I've worked with Prolog, Common Lisp, and Pascal. I also use AWK sometimes. I've not really done much with PERL, and I haven't touched SNOBOL, except to read a few introductory articles. There's obviously going to be a lot of room for debate here: general utility availability ease of use speed expressiveness pascal B+ B+ B ? B- c++ A B+ C+ A- B- c A A C A C snobol B C A ? B icon B+ B+ A- B A awk B- B A- B B perl A C+ B+ B B prolog B B B+ B A common lisp B+ B+ A- ? B+ I don't see any clear winners, myself. Re writing quick and dirty grammars in Prolog, I might add that the fol- lowing statement doesn't match my own experience: To a first approximation, you can just write the grammar and you will get an executable parser for free. Might be slow... Prolog works fine _with certain kinds of grammars_, i.e. those that can be parsed easily using a recursive descent backtracking mechanism. Anything more fancy, and it doesn't give you anything on, say, LISP. Icon also han- dles the kinds of grammars Prolog does quite easily, because of its similar control mechanisms. Personally, I find Lex and YACC to be exquisite tools, and rather high-level, too. But what if we want to create a Tomita-style parser that handles ambiguity? A chart parser? I'll bet that for these, LISP, Prolog, and Icon will all do just fine. The choice between these 3 is not altogether obvious to me, although it seems that LISP and Icon are capable of real-world tasks to an extent that Prolog implementations I have seen aren't. Icon retains the expressiveness of Prolog, and keeps the high- level string-processing tools. It is therefore my favorite. But not by any wide margin. -Richard Goerwitz goer@midway.uchicago.edu From corpora-request@uib.no Wed May 5 02:24:49 1993 Date: Wed, 5 May 1993 00:24:49 +0200 From: Knut Hofland To: corpora@hd.uib.no Subject: Adm. message: sub/unsub to listserv@uib.no It is now possible to unsubscribe/subscribe by sending messages to listserv@uib.no Send one of these lines in the body of the message unsub corpora sub corpora firstname lastname If you have problems with unsubscribing with messages to listserv, then you are registered with a different address and you have to send a message to corpora-request@nora.hd.uib.no for manual deletion. Please observe that listserv@uib.no is not a Bitnet LISTSERV, it can only respond to very few commands. Knut Hofland Listadm. Corpora From corpora-request@uib.no Tue May 4 12:58:44 1993 Date: Tue, 4 May 93 18:58:44 MDT From: ted To: corpora@hd.uib.no Subject: text processing/analysis (3 msgs/213 lines) i really feel it important to try to clear up this common misconception. Send-date: Thu, 22 Apr 1993 21:00:49 UTC-0500 From: Richard L. Goerwitz To: Subject: text analysis ... To a first approximation, you can just write the grammar and you will get an executable parser for free. Might be slow... Prolog works fine _with certain kinds of grammars_, i.e. those that can be parsed easily using a recursive descent backtracking mechanism. Anything more fancy, and it doesn't give you anything on, say, LISP. this is patently not true. in fact, the most commonly supplied implementation of definite clause grammars (DCG's) translates these grammars to prolog clauses which implement a recursive descent parser, but this is not necessary at all. in fact, there are versions which translate DCG's into depth first iterative deepening versions, into tables for chart parsers and almost anything else you would like. richard o'keefe's book `the craft of prolog' describes many of these translations. the nice thing is that you don't have to change a line of your grammar when changing between these alternative parsing technologies. many people think that because prolog provides depth first backtracking, that such is the only way that prolog can execute your code. not only can you change the way that grammars are translated into prolog, but you can change the way that all of your programs are translated into prolog. this means that you can transform your programs to do depth first, best first, breadth first or depth first iterative deepening execution. lisp provides this same sort of capability, but in a rather different manner. definitions of source transformations in lisp are more decentralized which can lead to better modularity of transformations but the native syntax is a bit less flexible which means that people who want fancy syntax generally don't get it. i generally don't mind. the lack of prolog's native backtracking engine in lisp, though, can make quite a difference in these sorts of programs. while depth first search makes recursive descent parsers trivial, it also assists in implementing other forms, largely because iterative deepening is such a wonderful alternative to breadth first search. so please... prolog grammars are *not* limited to those grammars suitable for depth first parsing. From corpora-request@uib.no Wed May 5 05:46:15 1993 Date: Wed, 05 May 93 10:46:15 EST From: mark To: goer@midway.uchicago.edu, corpora@hd.uib.no Subject: Text processing R. Goerwitz (sp?) writes that Icon is public domain. Where can I get a DOS implementation of Icon? It sounds interesting and quite possibly useful to me. Mark A. Mandel Dragon Systems, Inc. : speech recognition : +1 617 965-5200 320 Nevada St. : Newton, Mass. 02160, USA : mark@dragonsys.com From corpora-request@uib.no Sat May 8 05:52:53 1993 Date: Sat, 8 May 1993 11:48:19 HKT From: "lcjohn@usthk.ust.hk" Subject: source of tei guidelines? To: corpora@hd.uib.no Forgive me if this is a FAQ, but could someone tell me how to obtain the latest TEI guidelines? Thanks. John Milton Hong Kong University of Science and Technology LCJOHN@USTHK.UST.HK From corpora-request@uib.no Sat May 8 10:02:13 1993 To: "lcjohn@usthk.ust.hk" Subject: Re: source of tei guidelines? Date: Sat, 08 May 93 16:57:52 V From: syun tutiya John Milton writes: > Forgive me if this is a FAQ, but could someone tell me how to obtain the > latest TEI guidelines? Thanks. In general, send a BITNET subscription message to TEI-L at LISTSERV@UICVM.BITNET or (from an Internet site) at LISTSERV@UICVM.UIC.EDU. The welcome message contains further information, including how to get the fascicles of what the project call "TEI P2," the latest draft guidelines. In your case, however, contact me at tutiya@culle.l.chiba-u.ac.jp as I am in charge of making the drafts more accessible from Asian countries. They are distributed in several formats so I may be of some help if you let me know of the computational resources available to you. Syun Tutiya From corpora-request@uib.no Fri May 7 18:06:18 1993 To: "lcjohn@usthk.ust.hk" Subject: Re: source of tei guidelines? Date: Sat, 08 May 93 01:06:18 -0700 From: syun tutiya John Milton writes: > Forgive me if this is a FAQ, but could someone tell me how to obtain the > latest TEI guidelines? Thanks. In general, send a BITNET subscription message to TEI-L at LISTSERV@UICVM.BITNET or (from an Internet site) at LISTSERV@UICVM.UIC.EDU. The welcome message contains further information, including how to get the fascicles of what the project call "TEI P2," the latest draft guidelines. In your case, however, contact me at tutiya@culle.l.chiba-u.ac.jp as I am in charge of making the drafts more accessible from Asian countries. They are distributed in several formats so I may be of some help if you let me know of the computational resources available to you. Syun Tutiya From corpora-request@uib.no Sat May 8 15:19:00 1993 Date: Sat, 8 May 1993 22:19 PDT From: HSCHUREN@TWNAS886.bitnet Subject: Abstract Deadline for PACFoCoL To: linguist@tamvm1.tamu.edu, corpora@hd.uib.no, empiricist@csli.stanford.edu, Pacific Asia Conference on Formal and Computational Linguistics (PACFoCoL I, 1993) Academic Activity Center, Academia Sinica, Taipei, Taiwan August 30-31, 1993 Abstract Deadline (by email): MAY 10 (MONDAY) This is a reminder that the deadline for sumbitting an abstract for PACFoCoL is upcoming. PACFoCoL will provide an opportunity to further the scholarly exchange among linguists in Pacific Asia region in the areas of formal and computational linguistics and in fostering a cooperative environment for better understanding of the development or new trend in theoretical and computational linguistics in Pacific Asia region. Topics of the conference include theoretical and computational studies in syntax, semantic, corpus linguistics and contrastive analysis of Pacific Asian languages. The organizing committee welcomes submittance of one-page abstracts (with an additional optional page for references and/or data) that address the above topics. The abstract may be submitted via e-mail or via airmail. The notice of abstract acceptance will be mailed out by June 11, 1993. All the accepted Papers will be collected in a volume of conference proceedings. The full paper should be camera-ready and not exceed twelve (12) pages of A-4 or letter size paper (single-sided, single-spaced, and at least 12 points in size, with at least 1" margin on the right, at top, and 1.5" margin on the bottom). The deadline for the submission of the full paper and pre-registration is June 28, 1993. Please address all the correspondences to the organizing committee at the following address: Professor Chu-Ren Huang ROCLING c/o Institute of Information Science Academia Sinica Nankang, Taipei 115 Taiwan, R.O.C. Tel: 886-2-788-1638 Fax: 886-2-788-1638 e-mail: churen@iis.sinica.edu.tw OR hschuren@ccvax.as.edu.tw nccut086@twnmoe10.bitnet (Professor Claire H. Chang) nccut146@twnmoe10.bitnet (Professor O.-S. Her) Sponsored By: The Computational Linguistics Society of R.O.C. Co-Sponsored By: Institute of History and Philology, Academia Sinica Institute of Information Science, Academia Sinica Department of English and Graduate Institute of Linguistics, National Chengchi University The Logico-Linguistics Society of Japan The Linguistic Society of Hong Kong From corpora-request@uib.no Tue May 11 12:28:50 1993 Date: Tue, 11 May 1993 10:28:50 +0200 From: j.t.loenning@ilf.uio.no To: linguist@tamvm1.tamu.edu, corpora@hd.uib.no Subject: Job I take the liberty to post this add in Norwegian as an applicant who masters Norwegian will be preferred, and I do not have an English translation at hand: --------------- Forskerstilling ved Tekstlaboratoriet Ved Det historisk-filosofiske fakultet, Universitetet i Oslo, er det fra 1.8.93 ledig stilling for en forsker. Stillingen er opprettet av Norges forskningsraad avd NAVF for en periode inntil 31.12.95. Den som tilsettes skal arbeide paa prosjektet "Elektroniske verktoey i spraakforskningen". Prosjektets formaal er aa stimulere spraakforskere til aa benytte elektroniske hjelpemidler i forskningsprosessen, dels gjennom opplaeringstiltak overfor fakultetets spraakforskere, dels gjennom forskning paa, og utvikling av, slike hjelpemidler. Stillingen vil vaere delt mellom undervisning, administrasjon og forskning omtrent som for faste vitenskapelige stillinger. Paa undervisningssiden skal forskeren utvikle og avholde kurs for fakultetets laerere, stipendiater og hovedfagsstudenter i bruk av elektroniske verktoey i spraakforskningen. Den som tilsettes boer ha foerstestillingskompetanse i et spraakfag eller foerstestillingskompetanse i informatikk og erfaring fra spraakforskning. Vedkommende boer videre ha erfaring i bruk av elektroniske verktoey i spraakforskningen eller erfaring i aa utvikle slike verktoey. Hvis ingen soeker med slike kvalifikasjoner melder seg, kan det bli aktuelt aa tilsette en soeker med amanuensiskompetanse. Med doktorgrad er stillingen loennet tilsvarende l.tr. 22 (Kr 251 606 per aar). Ellers er den loennet tilsvarende l.tr. 16 21. Medlemskap i Kommunal landspensjonskasse. NAVF arbeider for aa rekruttere flere kvinner til forskning. Under ellers tilnaermet like forhold vil kvinnelige soekere bli foretrukket. Soeknad med kopi av vitnemaal og attester sendes i 5 eks. til Tekstlaboratoriet, Institutt for lingvistiske fag, P.b. 1102 Blindern, 0317 Oslo. Det er utarbeidet en stillingsomtale som presiserer arbeidsoppgaver, hva det vil bli lagt vekt paa ved ansettelse og hva soeknaden boer inneholde. Denne faas ved henvendelse til Institutt for lingvistiske fag (tlf. 22 85 43 48). Naermere opplysninger om stillingen kan faas hos Jan Tore Loenning (tlf. 22 85 69 71). Soeknadsfrist 27.5. ---------------------------------------------------------------------- Jan Tore Loenning University of Oslo phone: + 47 22 85 69 71 Dept. of Linguistics dept. + 47 22 85 43 48 P.o. Box 1102, Blindern fax: + 47 22 85 69 19 0315 Oslo, Norway e-mail: JTL@ilf.uio.no ---------------------------------------------------------------------- From corpora-request@uib.no Tue May 11 09:09:09 1993 Date: Tue, 11 May 93 13:09:09 EDT From: fujii%mackay@cs.umass.edu (Hideo Fujii) To: corpora@hd.uib.NO Subject: Re: text processing/analysis Hi, I saw R.Goerwoitz's accessment (4/22) of various prog. languages for text processing/analysis. I would like to know your idea about Smalltalk -- I think it has a good built-in class library as a workbench for this perpose. How do you fit it in the accesment table? I would like also to know if and how someone is extensively work with this language for text processing. --Hideo Fujii U. of Mass, Amherst From corpora-request@uib.no Wed May 12 11:41:01 1993 Date: Wed, 12 May 1993 10:41:01 +0100 From: Lou Burnard To: CORPORA@HD.UIB.NO Subject: Help: German/English corpora? [I pass on the following message in the hope that someone on this list will be able to provide some helpful pointers for this student:] From: Anna Bewes Date: Wed, 12 May 93 10:24:16 BST I am a student at the University of Manchester, currently working on my project for an MSc in Cognitive Science. I will be working on a small machine translation system with the aim of translating German temporal prepositions into their correct English equivalents. Once I have a working program going I will need some way of testing the results. With this in mind I am looking for a bilingual corpus in German and English. From this I could extract German sentences containing temporal prepositions and pass them through my program, comparing the results with the English translation from the corpus. As yet, however, I have failed to find any such corpus. My supervisor, David Bree, suggested that you might be able to help. If you have no knowledge of a bilingual German-English corpus, do you know of any properly tagged German corpus? Many thanks for your help, Anna Bewes From corpora-request@uib.no Thu May 13 08:15:04 1993 Date: Thu, 13 May 93 12:15:04 EDT From: Gonzalo.Silverio@um.cc.umich.edu To: corpora@uib.no subscribe corpora@uib.no g.silverio From corpora-request@uib.no Fri May 14 11:44:22 1993 From: Elizabeth Garner Subject: Advisory Dialogues To: corpora@hd.uib.no Date: Fri, 14 May 1993 09:44:22 +0200 > > I am looking for a set of advisory dialogues (in text form) that I would be fr ee to use > as the basis of my PhD dissertation. In particular, I am interested in dialogu es > between an information-seeker and an advisor, where to answer the client's que ries > the advisor needs to elicit further information form the client. I have in min d > perhaps bank-type situations, (applying for loans...), obtaining information f rom > social service departments, etc. > > Does anyone know of such a corpus? > > Thanks in advance > > Elizabeth Garner > Institut f. Medizinische Kybernetik u. AI > Vienna > elizabeth@ai.univie.ac.at > > > From corpora-request@uib.no Fri May 14 13:27:53 1993 From: "Henry S. Thompson" Date: Fri, 14 May 93 10:00:45 BST To: elizabeth@venedig.ai.univie.ac.at Subject: Re: Advisory Dialogues Hi; You might be interested in the HCRC MapTask corpus of transcribed speech dialogues. This is a database of dialogues between two people, one of whom is trying to explain a route on a map to the other. This is available on CD-ROM from HCRC (contact or for more information) David McKelvie Human Communication Research Centre, University of Edinburgh 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 31 650-4439 Fax:(44)31 650-4587 email: eucorp@cogsci.ed.ac.uk Thank you for your enquiry concerning the Map Task Corpus. I've included below a shar file which contains a sample README from one of the disks, as well as the documentation of the basic corpus design. I hope this will answer most of your questions. ht ------------- #! /bin/sh # This is a shell archive. Remove anything before this line, then unpack # it by saving it into a file and typing "sh file". To overwrite existing # files, type "sh file -c". You can also feed this as standard input via # unshar, or by typing "sh /home/data/mt-cd/cd/6/6readme <<'END_OF_/home/data/mt-cd/cd/6/6rea dme' X The HCRC Map Task Corpus X X Version 1.0 X X Human Communication Research Centre X University of Edinburgh & University of Glasgow X X Copyright 1992 Human Communication Research Centre X XLICENCE: The copyright holder grants to the purchaser of these CD-ROMs Xunrestricted licence to use all the corpus materials (speech, Xtranscription, maps, tools, documentation) included herein, subject Xonly to the following restrictions: 1) No onward distribution of the Xcorpus materials is allowed -- copies may be made only for use by the Xpurchaser and his/her research group, for ease of use by that group, Xetc.; 2) The contribution of HCRC is acknowledged in any public Xpresentation or publication of any work based on the corpus. X X The HCRC Map Task Corpus carries no warranty of any kind. X XSince HCRC continues to use the Corpus in our own research, we welcome Xcontact with colleagues engaged in similar projects. For this reason Xwe ask purchasers to notify us as a matter of courtesy of the topic of Xtheir intended work with these materials. X X Funding by Economic and Social Research Council, UK X Pre-mastering by the Linguistic Data Consortium, USA X XThis is CD-ROM 6 of a set of eight. Taken together the full set contains: X * all the materials used to collect a set of 128 spoken dialogues; X * sampled digital audio for those dialogues; X * orthographic transcriptions of the dialogues; X * documentation; X * source code for tools. X XI. Directory structure and file contents X XAll eight CD-ROMs have a common structure. X XThe top-level directory contains the following files on all eight: X X 0dir A complete listing of all files, giving the CD on X which each can be found X 6readme This file, with the CD number changing from one CD to X the next. X maptask.sgm A TEI Corpus file of the complete set of transcripts. X Contains basic documentation about the corpus. X XThe top-level of each CD contains the following directories in all Xcases: X X doc/ ASCII and/or PostScript(TM) versions of various papers X about the corpus: START HERE X etc/ Miscellaneous useful bits and pieces X lib/ Resources for included tools X src/ UNIX(TM) scripts and C sources for useful tools X trans/ A complete set of transcripts X XAll of these, as well as most of the others listed below, contain Xfurther `0readme' files with more detailed descriptions of their Xcontents. X XThe 0readme in the src directory contains a number of examples of use Xof the distributed tools to obtain different kinds of information from Xthe corpus. X XIn addition to the common directories, this CD also contains X X q6/ X XThis contains all sampled audio and transcripts for one eighth of the Xdesign (see doc/design.sgm for a description of the design), in a Xdirectory structure which reflects certain key aspects of the design, Xas follows: X X wordlst2.sgm The script for the wordlist recitations (see below) X e/ The eye-contact condition X n/ The no-eye-contact condition X maps/ Bitmaps and other information for the maps used here X XThe e/ and n/ directories have the same structure: X 0readme Includes a brief description of the transcript files X diagnost/ Dialect diagnosis materials: X NIST header (.nst), SAM header (.seo), X sampled speech (.ses) X talkers/ Information about the talkers X wordlist/ Sampled audio of the wordlist recitations* X NIST header (.nst), SAM header (.seo), X sampled speech (.ses) X c1/ X ... Conversations X c8/ X XEach conversation directory has the following files X X NIST header (.nst), SAM header (.seo), X sampled speech (.ses), transcript (.trn), X TEI entry-point (.sgm), TEI wrapper (x.pge). X X Note that the transcripts are linked to the sampled X speech files by time-stamps every few turns. X X The format of these files is described in the X 0readme file one level up, in the e/ and n/ X directories. X XSome of the wordlist recordings are split across two files, as there Xwere discontinuities at recording time. The recording for one Xsubject, q8nta2, is missing. X XThe displaced wordlists are in their proper place in the directory Xtrees, so that if it were possible to mount all 8 CDs with the same Xroot, the complete directory structure, with q1--q8 all in place, Xwould result. X XNote that the top-level file 0dir gives for each file in the corpus Xthe number(s) of the CD(s) on which it appears. In the case of files Xpresent on all eight CDs, an asterisk (*) is used. X XII. File naming conventions X XFile names for files associated with talkers (diagnostics, wordlists, Xinformation) are constructed to the following model, where [1-8] means Xa (quad) number between 1 and 8, [en] means e or n for eye-contact or Xno-eye-contact, [ab] means a or b, [12] means 1 or 2 and [dw]? means d X(for diagnostic), w (for wordlist) or nothing (for information): X X q[1-8][en]t[ab][12][dw]? X XFor example, q2etb2d.nst is the NIST header for the diagnostic reading Xby the 2nd talker of pair b of quad 2, eye-contact condition, and Xwould be found on CD 2 in q2/e/diagnost. X XIn the cases where a wordlist recitation is split across two files, Xsuffixes 'p' and 'q' are used, e.g. q8nta1wp.ses, q8nta1wq.ses. X XFile names for files associated with conversations are constructed to Xthe following model, where [1-8] means a (quad or conversation) number Xbetween 1 and 8 and [en] means e or n for eye-contact or Xno-eye-contact: X X q[1-8][en]c[1-8] X XFor example, q4nc3.ses is the sampled speech for conversation 8 of Xquad 4, no-eye-contact condition, and would be found on CD 4 in Xq4/n/c3. X XNote that as each conversation has an id, and each turn has a number, Xto refer to an individual turn in a standard way, use XMTC1::, e.g. MTC1:q4nc3:32 is the Instruction XFollower saying "No. Sorry." X XIII. Use of SGML (Standard Generalized Markup Language) X XThe transcripts, documentation and some of the associated materials Xincluded in this corpus are marked up using SGML, following the draft Xguidelines of the Text Encoding Initiative (TEI). We have been quite Xscrupulous in observing the guidelines for document headers, where we Xhave changed very little of what has been distributed by the TEI. In Xthe body of the transcripts, mindful of the needs of those who will Xread them as they stand and/or process them with tools which are not Xsensitive to SGML markup, we have had to deviate rather more from TEI Xnorms. All the files anywhere in the corpus with extension ".sgm" are XSGML-conformant, as validated by version 1.0 of the public domain XUNIX(TM) tool sgmls, which is included herewith in the src directory. X XTwo different ways of accessing the transcripts as conformant SGML/TEI Xdocuments are provided: X X 1) Via the top-level corpus file maptask.sgm, which encompasses the X entire 128 transcripts; X 2) Via the individual .sgm files at the leaves of the directory X tree, which each embody exactly one transcript. X XOf course, non-SGML-based tools can access the .trn files directly, Xeither in the top-level trans directory, or at the leaves of the Xdirectory tree. The files contained in each place are identical, and Xare provided in duplicate purely for convenience in accessing them in Xdifferent ways. X XThe file doc/editorl.sgm provides detailed information about the Xeditorial conventions and markup used in the transcripts. X XPublic entity references are used throughout for external references, Xand the script in src/mtei documents the search path which is required Xfor those references to succeed. X XFor further information about these issues, see lib/tei/0readme and Xthe DTD files in the same directory. X XIV. Contacts X XThe production work on these CDs, as opposed to the corpus itself, was Xdone by Henry S. Thompson and Miles Bader, HCRC, University of Edinburgh. X XPre-mastering was done by David Graff, LDC, University of Pennsylvania. X XThe CDs were pressed by Discovery Systems, Dublin, Ohio. X XFor further information and for notification of use of the corpus as Xper the request above, please send electronic mail to X X maptask@uk.ac.edinburgh (JANET) X maptask@edinburgh.ac.uk (INTERNET) X Xor surface mail to X X Map Task X Human Communication Research Centre X University of Edinburgh X 2 Buccleuch Place X Edinburgh EH8 9LW X SCOTLAND X X-------------------- XUNIX is a trademark of AT&T Bell Laboratories. XPostScript is a trademark of Adobe Systems Incorporated. END_OF_/home/data/mt-cd/cd/6/6readme if test 8411 -ne `wc -c ../cdtree/template/doc/lnspeech.sgm <<'END_OF_../cdtree/template/d oc/lnspeech.sgm' X X X X X X X X [The HCRC Map Task Corpus], electronic version X X Anne H. Anderson X Miles Bader X Ellen Gurman Bard X Elizabeth Boyle X Gwyneth Doherty X Simon Garrod X Stephen Isard X Jacqueline Kowtko X Jan McAllister X Jim Miller X Catherine Sotillo X Henry Thompson X Regina Weinert X Henry S. Thompson X TEI tags X UK Economic and Social Research Council X X &HCRC.dist; X X

Based on a minimally formatted version of the electronic basis of the X original paper

X X Anne H. Anderson X Miles Bader X Ellen Gurman Bard X Elizabeth Boyle X Gwyneth Doherty X Simon Garrod X Stephen Isard X Jacqueline Kowtko X Jan McAllister X Jim Miller X Catherine Sotillo X Henry Thompson X Regina Weinert X The HCRC Map Task Corpus X Language and Speech X X Kingston Press Services, Ltd. X Twickenham, UK X 1991 X X Volume 34, Number 4, pp. 351-366 X X
X
X X

X Plain ascii text, with spaces and tabs used for formatting. X lnspeech.ps, in this directory, is a postscript version of the original. X

X
X
X

X THE HCRC MAP TASK CORPUS* X X X X X X X Anne H. Anderson, Miles Bader, Ellen Gurman Bard, X Elizabeth Boyle, Gwyneth Doherty, Simon Garrod, X Stephen Isard, Jacqueline Kowtko, Jan McAllister, X Jim Miller, Catherine Sotillo, Henry Thompson, X Regina Weinert X X X X X X University of Edinburgh X X and X X University of Glasgow X X X X X X X X X X X X Running head: Map Task Corpus X X X X *This work was supported by an Interdisciplinary Research Centre Grant X>from the Economic and Social Research Council (UK) to the Universities of XEdinburgh and Glasgow. The authors are grateful to Jim Hieronymus for Xtechnical advice and to Richard Shillcock for imaginative artwork. This is Xa copy of the text of an article appearing in Language and Speech, 1991: X34(4), 351-366, and should be cited as such. For figures and phonetic Xsymbols, consult the published article. Reprint requests should be sent to XE. G. Bard, Human Communication Research Centre, University of Edinburgh, 2 XBuccleuch Place, Edinburgh EH8 9LW, U.K. X X X ABSTRACT X X This paper describes a corpus of unscripted, task-oriented dialogues Xwhich has been designed, digitally recorded, and transcribed to support the Xstudy of spontaneous speech on many levels. The corpus uses the Map Task X(Brown, Anderson, Yule, and Shillcock, 1983) in which speakers must Xcollaborate verbally to reproduce on one participant's map a route printed Xon the other's. In all, the corpus includes four conversations from each Xof 64 young adults and manipulates the following variables: familiarity of Xspeakers, eye contact between speakers, matching between landmarks on the Xparticipants' maps, opportunities for contrastive stress, and phonological Xcharacteristics of landmark names. The motivations for the design are set Xout and basic corpus statistics are presented. X X X X INTRODUCTION X X This paper describes a research tool, a corpus of dialogues, which is Xavailable on CD-ROM both as digitized speech and as verbatim orthographic Xtranscription. The corpus is a response to one of the basic difficulties Xof work on natural language: while most language use is in the form of Xunscripted dialogue, much of our knowledge of language is based on prepared Xmaterials. X X The direct study of real dialogues serving real communicative goals Xcan, of course, be methodologically dangerous. One obvious drawback is Xquantitative: Phenomena of theoretical interest may be so sparsely Xrepresented in naturally occurring speech that huge corpora may still fail Xto supply sufficient instances to support robust conclusions. Underlying Xthis difficulty is a qualitative problem: Many linguistic phenomena depend Xheavily on the linguistic and extralinguistic contexts in which they Xappear, and in corpora of spontaneous speech, critical aspects of context Xmay be either unknown or uncontrolled. As a result, much research employs Xa safer approach, depending not on spontaneous dialogue, but on carefully Xscripted monologues of various lengths or on extended texts. These may be Xanything but spontaneous, and in the case of texts, may not even be Xintended to be spoken. Although such materials assure that the needs of Xthe particular research are met, they often serve to exaggerate the problem Xof the blind men and the elephant. What the resulting research examines is Xnot so much different parts of the same living creature but elegantly Xcrafted non-contiguous plaster casts of parts of the animal. X X The intention behind the present corpus is not to detract from `cast' Xmaterials which allow rigorous study of linguistic phenomena, but to Xattempt to supplement them with a more than usually amenable elephant in Xthe form of a corpus of dialogues large enough and controlled enough to Xpermit profitable simultaneous study from a number of points of view. XWhile the dialogues in the corpus are unscripted, the corpus as a whole Xcomprises a large, carefully controlled elicitation exercise. X X The dialogues were produced during the performance of the Map Task X(Brown, Anderson, Yule, and Shillcock, 1983). Each of the two participants Xin this task has a schematic map which the other cannot see, but both Xcollaborate to reproduce on one of the maps a route already printed on the Xother (see Figures 1a and 1b). Although the participant with the Xpre-printed route is designated the Instruction Giver, and the other as the XInstruction Follower, no restrictions are placed on what either can say. X X Part of the current design derives from the characteristics of the Xtask as Brown et al. devised it: X X X 1. All Map Task dialogues have a similar goal which is known to the X observer independently of what can be gleaned from participants' X utterances: reproducing a route of known form and controlled X complexity on a map with comparable numbers of landmarks. X X 2. Because the goal can be achieved only by means of what the X participants say to one another, successful communication is X important. X X 3. Because the correct solution to the cooperative problem is well X defined, successful communication can be measured in terms of X the extent to which the achieved route corresponds to its model. X X 4. Because the map route and landmarks are set in advance, the X entities referred to in the dialogues are also known to the X observer independently of what is said. Consequently, the X observer can determine not only whether the speaker communicated X effectively, but also whether the form of expression might have X been ambiguous or misleading under the circumstances in which it X was uttered. X X 5. Because mismatches between landmarks, their names, or their X locations on a pair of maps are easy to arrange, the X experimenter is in control of information initially shared by X participants and can alter the difficulty of the task. X X The current version of the Map Task exploits its inherent design Xfeatures via three further embellishments: X X X 1. Because the range of map landmarks is constrained only by the X ingenuity of the artist, the names of the landmarks can be X designed to be of phonological interest. X X 2. Because the pairing of participants is under the experimenters' X control, the familiarity of speakers can be varied X systematically. X X 3. Because the physical setting is arranged so that participants X can hear each other but not see each other's maps, other X channels of communication can also be controlled. In this case, X placing or removal of a small barrier permitted control of eye X contact between participants. X X X X INSERT FIGURES 1A AND 1B ABOUT HERE X X Although the present corpus is suitable for many other kinds of Xresearch, it was designed specifically to furnish a common set of materials Xfor the simultaneous study of several different linguistic phenomena. In Xthe next section, the issues which motivated the design of the corpus are Xoutlined. The third section gives a brief summary of the corpus design and Xthe method for collecting speech. Finally, there is a description of the Xresulting materials. X X X BACKGROUND X XCommunicative Success and Communication Strategies X X Our understanding of the strategies which speakers can use to achieve Xcommunicative goals is hampered by the difficulty of determining which Xstrategies are successful. In the Map Task, however, the overall success Xachieved by any pair of speakers is measurable in terms of the deviation Xbetween the original route found on the map of the Instruction Giver and Xthat reproduced by the Instruction Follower. To measure such route Xdeviations, a 1cm grid is used on which the route is represented by filled Xgrid squares. A deviation score in grid cells gives an objective Xnon-linguistic estimate of communicative success. With this we can Xdetermine the effects of various communicative strategies employed by the Xtask participants. X X This approach has already been used with earlier versions of the Map XTask to determine components of communicative success in young speakers X(Anderson, Clark and Mullin, 1989, 1991). To date, it has been possible to Xdemonstrate great variability in the communicative skills of young subjects Xaged between 7 and 14, and to identify several interactive skills which Xcharacterize the dialogues of the more successful communicators at all the Xages studied. These skills have to do with speakers' ability to establish Xsufficient "mutual knowledge" (Clark and Marshall, 1981) to understand one Xanother's contributions to the conversation. Any mismatch between Xlandmarks on Giver's and Follower's maps makes this requirement Xparticularly salient and gives rise to different strategies in more and Xless successful communicators. More successful communicators are Xdistinguished by the forms of referring expressions chosen to introduce new Xitems in the dialogue, the sequencing of questions and answers, the ways in Xwhich information provided by a partner is assimilated, and the ways in Xwhich communication problems are signalled and responded to. X X In each of the strategies identified, however, successful interactions Xdepend on contributions from both speakers. The Map Task makes it possible Xto study such collaborations because it generates extended and comparable Xspeaker-determined dialogues. The present Map Task Corpus makes it Xpossible to determine whether adults exhibit a range of communicative Xstrategies similar to those displayed by child speakers, and to examine Xvariables which affect the nature of the collaboration. X X The Map Task also allows us to explore two further aspects of the Xcollaboration: its development, and the channels through which it is Xachieved. When the same speaker is observed performing the same tasks with Xfamiliar and unfamiliar partners, some account can be given of strategies Xwhich are used generally by a given speaker and those which appear as Xunfamiliar participants build the bases of their collaboration. By Xexamining collaborations between partners who have eye contact, -- as they Xmight in the most natural situations, -- with those who have only auditory Xcontact, -- as they might over a telephone link, or as a person might have Xwith a machine, -- we can determine which of the speakers' natural Xstrategies are used successfully across situations and which are linked to Xvisual contact. X X XSpeaking and Writing X X Consistent syntactic differences between spoken and written language X(Poole and Field, 1976; Kroll, 1977; Chafe, 1982; Stubbs, 1980; Beaman, X1984; Biber, 1986, 1988; Halliday, 1989) not only require explanation, but Xalso raise the issue of what it is that linguistic theories must account Xfor. The fact that there are differences is now relatively Xuncontroversial. What is controversial is the nature of these differences. XBeaman (1984) suggests that differences in the types of data which have Xbeen analysed provide one reason for disagreements among researchers. In Xparticular, the spoken/written dimension has been confounded with Xfunctional factors such as register, purpose, and formality. The Map Task Xdialogues offer insight into factors affecting spoken language which have Xnot yet been explored elsewhere (Biber, 1988). X X In particular, the Map Task Corpus provides spoken language data from Xinformants of known, and fairly constant, age and educational background. XPrevious studies of spoken British English concentrate on the English Xspoken by university-educated adults (the informants who contributed to the XLondon-Lund Corpus, which Biber analyzed, or the academic speakers recorded Xby Chafe and Halliday) who are fluent users of the formal, written Xlanguage. There is good reason to believe that the everyday conversation Xof the bulk of native English speakers in Britain, who use the written Xlanguage less, differs even further from the written language. This speech Xmay be better approximated by the informal speech of undergraduate Xinformants. X X In addition, the Map Task allows us to consider to what extent a Xgenuinely co-operative communicative task affects language use. Intensive Xinvolvement with the task in hand distracts speakers' attention away from Xtheir language. It has been shown (Kary, 1981) that speech between adults Xsimultaneously collaborating in simple physical tasks (washing dishes, Xtidying a classroom) more closely resembles the `simplified' speech of the Xsame adults to young children than it resembles the adults' speech to Xadults under other circumstances. Appropriately, the Map Task dialogues Xdisplay characteristics typical of informal Scottish speech (Brown and XMiller, 1980; Macafee, 1983; Macauley, 1985). As syntactic subordination Xis replaced by discourse subordination, syntactic structure becomes Xshallower. If- and (be)cause-, and headless relative clauses occur without Xmain clauses and often serve functions quite different from those Xassociated with these forms in written or formal spoken discourse. XIndependent if-clauses, for example, function as directives. X X Moreover, by controlling the participants' goals and pertinent Xbackground knowledge and making them independently available to the Xobserver, the task can reveal how goal-directed spoken language achieves Xparticular goals, -- how, for example, this variety of spoken language Xallows the participants to introduce, focus on, and keep track of entities; Xhow speakers describe the location of entities and movement in relation to Xthem; how types of clauses and phrases function and combine. X X XVariability in Speech X X One of the characteristics of speech which distinguishes it from Xprinted text is that in speech no two tokens of a word are ever identical, Xeven if they are uttered by the same speaker repeating the same utterance. XVariations in the duration, amplitude, and spectral composition of spoken Xwords dictate some of the complex characteristics of linguistic, Xpsycholinguistic, and technological models of speech production and Xrecognition. Were we to understand which factors condition which Xparticular differences, we would be able to further both our models of Xhuman behavior and the related technologies. X X Although particular phonetic environments permit the phonological Xmodifications which change the form of words (Brown, 1977; Gimson, 1980; XLass, 1984), they do not determine to what extent modifications will Xoperate. Some of the determinants are syntactic (Cooper and Paccia-Cooper, X1980). Others are part of a general system whereby the communicative Xburden of a word token determines how easy it is to recognize on the basis Xof its acoustic shape. Differences between words show a relationship Xbetween length and information discovered by Zipf (Zipf, 1935; see also XBard and Anderson, 1983; Fowler and Levy, 1991): Lexical items which bear Xmore information tend to be longer. Similarly, referring expressions which Xintroduce new entities into a discourse are longer than their anaphors. XLonger words, those which carry more information, are easier to recognize Xin speech (Rosenzweig and Postman, 1958). The system also extends to Xdifferent pronunciations of the same words: As Lieberman (1963) showed, Xwords are given longer and more intelligible pronunciations when they occur Xin contexts which do not predict them (The number which you are about to Xhear is nine.) and shorter, less intelligible pronunciations when they are Xpredictable from context (A stitch in time saves nine.) (see also XHunnicutt, 1985; Bard and Anderson, 1983; Fowler and Housum, 1987). X X In fact, a word token appears to be more susceptible to degradation Xthe more its identity can be recovered from its context, whether linguistic Xor extra-linguistic. Shortening and loss of intelligibility accompany Xsecond, coreferential mentions in extended discourse (Fowler and Housum, X1987; Bard, Lowe, and Altmann, 1989; Bard and Anderson, 1991), reference to Xobjects visible to speaker and listener (Bard and Anderson, 1991), and Xinformal or close relationships between speakers (Bard and Anderson, 1983; XMcAllister, Sotillo, and Bard, 1991). X X Phonological modifications appear to create some of the degradation Xeffects. Word boundary phonological processes may affect spectral Xcomposition (for instance the assimilation of alveolar to bilabial nasals : Xphone book, [foun # bUk], becomes [foumbUk]) or reduce duration (list some, X[lIst # sVm] becomes [lIsVm]), creating tokens which are segmentally unlike Xthe careful citation form of the word. Certain of these modifications, Xlike reduced intelligibility, characterize parents' speech to their Xchildren (Shockey and Bond, 1980) and spontaneous rather than read speech X(Shockey, 1983). X X Interestingly, because these effects are part of the speaker's means Xfor transmitting a message to the listener, they occur only in speech which Xconveys meaning. The effects of repetition, for example, range from no Xduration loss in word lists, through a significant effect in readings of Xtranscribed spontaneous speech, to a stronger effect in the original Xspontaneous speech itself (Fowler, 1988). X X Fertile materials for more detailed studies of these phenomena will Xtherefore consist of spontaneous conversations in which it is possible to Xcontrol the relationship between the participants and the linguistic and Xextra-linguistic information available to each. The Map Task provides such Xmaterials. Landmark names can be selected so as to offer sites for Xphonological reductions or assimilations. Participants may read out lists Xof the landmark names at the end of the task sessions for comparison with Xtokens in running speech. Critical landmarks may be excluded from one or Xother map to test the effects of visual support for lexically conveyed Xinformation. Participants may be old friends or strangers. The same Xparticipants may be observed in several cells of the design or at various Xpoints in a dialogue so that individual differences need not confound the Xother comparisons. X X XConversational Structure and Intonation X X It is an aim of much semantic and pragmatic research to relate Xspeakers' intentions to particular linguistic devices used to express those Xintentions. This enterprise can all too easily become circular unless Xintentions can be determined independently of the linguistic means used to Xexpress them. In the Map Task, however, we can use the maps and the Xinformation already conveyed in dialogue to assess both the participants' Ximmediate goals and their state of knowledge when an utterance is made. XFor this reason we are usually able to say with some confidence what the Xspeaker's purpose was in producing the utterance. While we can look for Xindications of the conversational role of an utterance at other linguistic Xlevels, as well, we are particularly interested in pursuing the hypothesis X(Houghton and Isard, 1987; Houghton and Pearson, 1988) that the Xconversational role of an utterance is reflected in its intonational tune, Xthat is, that the intonation will help us determine not only what a speaker Xmeans by an utterance but also what s/he means to accomplish by it. X X To carry out such a program, we need a formal account of Xconversational structure, and we have adopted an analysis (Kowtko, Isard, Xand Doherty, 1991) which builds on the work of Houghton (1986), Houghton Xand Isard (1987), and Power (1979), and views conversations as consisting Xof conversational games, which may nest and loop, and within which Xparticipants make conversational moves. X X Both Power and Houghton attempted to produce a theory of how Xnon-linguistic goals give rise to conversation. The theory was presented Xin terms of an AI model in which a pair of "robots" conversed in order to Xachieve simple goals which neither was capable of achieving on its own. In XHoughton's model, conversation was integrated with the rest of the robots' Xcapacities for interacting with their world. The robots knew that Xsuccessful conversational games could result in the transfer of information Xor the performance of some non-linguistic act by their partner. A robot Xwanting to know whether a door was bolted could consider looking at the Xdoor, pushing it, asking Fred if it was bolted, or even asking Fred to push Xthe door as an alternative means of finding out. Here "asking Fred" and X"asking Fred to push the door" are used as shorthand for "engaging Fred in Xthe FIND_OUT conversational game" and "engaging Fred in the GET_DONE Xconversational game". The actual asking is only the first move in the Xrelevant game. The significance of seeing asking as part of a game lies in Xthe askee knowing what sort of response is required, in the asker knowing Xwhat to make of the reponse, and in both players knowing that getting the Xresponse is the point of starting the game in the first place. X X The robot conversations generated in this way were sensible and Xcoherent, if somewhat stilted. The Map Task can be viewed as a form of the Xrobots' door problem interesting enough for human participants, and we have Xextended the repertoire of games and moves used by the robots to cover new Xconversational forms that arise in Map Task dialogues. For instance, Map XTask participants often seek to confirm their understanding before Xlaunching a new game ("So you're just below the swamp, right?"), whereas Xthe robots had no such tactic in their repertoire. We have begun assessing Xthe adequacy of the extended analysis for the map dialogues (Kowtko, Isard Xand Doherty, 1991). X X The purpose of such an analysis is to make it possible to test a Xtheory of intonational pragmatics in which the role of an intonational tune Xis to signal purpose in the form of the move being made in a conversational Xgame. We might, for example, expect to find a clear distinction in Xintonation between making one of the small number of scripted moves which Xthe structure of a game provides at a given point (e.g., giving a direct Xanswer to a question that has just been posed) and making an (equally Xlegitimate but) unscripted move which interrupts, abandons or otherwise Xmodifies the participants' shared understanding of the current game. XUltimately, game and move structure should offer more accurate predictions Xof intonational patterns, and accordingly, the possibility of more useful Xinterpretation, than an account of sentence types alone. X X X METHOD X XMaterials and Design X X Materials. The materials consisted of 16 pairs of maps, four pairs Xfollowing each of four different basic plans. The plans were devised on Xgrids much like the one used for scoring to provide routes of roughly equal Xcomplexity. Each pair of maps included an Instruction Giver's map, like Xthe one reproduced in Figure 1a, which showed the intended route, and the XInstruction Follower's map, like Figure 1b, which did not. The four Xmap-pairs based on any given plan differed in the particular landmarks Xwhich occupied the landmark positions imposed by the plan. All landmarks Xwere portrayed as line drawings and all were labelled with their intended Xnames. All the maps were reproduced on A3 paper. X X Landmark types. All map routes began with a starting point marked in Xthe same way on Instruction Giver's and Instruction Follower's maps and Xended with a finishing point marked only on the Giver's. Intermediate Xlandmarks along the route alternated between those that were common to the XGiver's and Follower's map, that is, identical in name, form, and location Xfor both (see seven beeches in Figure 1a and 1b), and those that differed Xin some way. Of the landmarks not common to both, each map contained at Xleast one of each of the following types. X X - Absent/Present landmarks were found on the Follower's map but not X the Giver's (blacksmith in Figure 1b). X X - Name Change landmarks have different names but identical forms X and locations on the two maps (reclaimed fields in Figure 1a as X opposed to old flood plain in Figure 1b). X X - The 2:1 landmarks appear twice on the Giver's map, once in a X position close to the route and once far away (vast meadow in X Figures 1a and 1b). The Follower has only the one far away from X the route. X X - Finally, each basic plan was associated with a Contrast Feature, X a pair of landmarks with similar names which might elicit X contrastive pronunciations (Green Bay and Crane Bay). Over the X four map pairs for each basic plan, all the possible combinations X of matching and mismatching contrast were represented: Giver's X and Follower's maps have both members of the contrast, both maps X have only one member, Giver's has both and Follower's one, X Giver's has one and Follower's both. X X In addition, each map contained an Odd Man Out, a landmark which was Xalien to the stereotypical location to which all the other landmarks might Xeasily belong (crashed spaceship in Figure 1b), as well as a number of Xother landmarks located at a distance from the route. X X Phonological Modifications. Landmarks close to the route included Xthose whose names offer sites for one or other of the following Xmodifications: t-deletion (vast meadow may be pronounced [vAs # mEdou]); Xd-deletion (reclaimed fields pronounced as [riklEim # fildz]); Xglottalisation (whitewashed cottage as [wAi?wQSt # kQtIdZ]); nasal Xassimilation (crane bay as [krEim # bEi]). Each type of phonological Xmodification provided the Contrast Feature for one of the basic map plans Xand over the four map-pairs based on each plan, each type of modification Xwas applicable to one Common landmark, to each sort of mismatch, and to the XOdd Man Out. Except for the Contrast Feature, all the other examples of Xphonological modifications changed from map to map, giving altogether 22 Xlexically different sites for each of the modifications. X X Other landmarks on the route included polysyllabic names the first two Xsyllables of which were either Strong-weak (chapel) or Weak-strong X(attractive cliffs). Landmarks elsewhere on the maps varied in character. X X Design. Each Subject was recruited with a Familiar partner who knew Xhim/her well and tested in coordination with another pair of Subjects who Xwere Unfamiliar to him/her. The two pairs formed a quadruple of Subjects Xwho used among them a different set of four map-pairs, one from each of the Xbasic plans, and one from each of the Contrast conditions. Maps were Xassigned to quadruples by Latin Square. X X Every Subject participated in the Map Task four times, twice as XInstruction Giver, twice as Instruction Follower, once in each case with Xhis or her Familiar partner, once with an Unfamiliar partner. As XInstruction Giver, each Subject used the same map twice; as Follower s/he Xused a different map each time. Half of the Subjects gave instructions to Xa Familiar partner first, half to an Unfamiliar. Half the Subjects Xperformed all four tasks while able to see the other participant's face X(With Eye Contact), half while unable to do so (Without Eye Contact). X X Thus Subjects were nested in pairs, in Eye Contact condition, in XFamiliarity Order, and crossed with Giver/Follower role, and Familiarity. XBasic map plans were crossed with Contrast condition, Eye Contact, XFamiliarity, and Familiarity Order. Individual map-pairs (and, therefore, Xindividual landmark names) were nested in basic plans but crossed with XFamiliarity, Familiarity Order, and Eye Contact. With 64 subjects in all, Xthis design allowed opportunities for different speakers to utter each of Xthe landmark names offering a phonological modification site: Each of the Xfour more frequent Contrast landmark names was available to 16 different XInstruction Givers and 32 Followers; each of the four less frequent XContrast landmark names to 12 Givers and 24 Followers; for each potential Xphonological modification 20 lexically different landmark names were Xavailable to each of 4 Givers and 8 Followers. X X XProcedure X X Subject pairs belonging to a quadruple were isolated from one another Xbetween recordings. Subjects were tested in a recording studio facing one Xanother across a pair of drawing boards arranged back to back, which hid Xeach one's map from the other. Subjects in the With Eye Contact condition Xcould see one another's faces over the drawing boards; an additional Xbarrier prevented this in the No Eye Contact condition. Subjects were told Xthat the goal of the task was to enable the Giver's route to be drawn on Xthe Follower's map, that the Giver's and Follower's maps might be different Xin some respects, and that both participants could say whatever was Xnecessary to complete the task, but that neither could use gestures. X X After finishing all their Map Task problems, each Subject read aloud a Xlist of all the landmark names used in the maps employed by his/her Xquadruple and a short list of sentences used as accent diagnostics (Barry, XHoequist, and Nolan, 1989). All the dialogues were orthographically Xtranscribed. X X All materials were digitally recorded on DAT (Sony DTC1000ES) using Xone Shure SM10A close-talking microphone and one DAT channel per Xparticipant. Split-screen video recordings were also made. X X XSubjects X X Sixty-four undergraduates at the University of Glasgow (32 male and 32 Xfemale) took part. Subjects had known their partners for various lengths Xof time before the recordings were made, with an average of 2 years and a Xrange from 6 months to a lifetime. Subjects' ages ranged from 17 to 30, Xwith a mean of 20 years. Of the Familiar pairs, 13 were all female, 13 all Xmale and 6 female-male. Sixty-one of the 64 subjects were Scottish, 56 of Xthem from within a 30 mile radius of the center of Glasgow. The remaining Xsubjects were English (2) or American (1). X X X CHARACTERIZING THE CORPUS X XStandard Scottish English X X Because the Standard Scottish English of Glasgow and the surrounding Xarea is so predominant in the corpus, it may be worthwhile explaining how Xit differs from more widespread varieties of English. Because more Xdetailed accounts of SSE are readily available (Abercrombie, 1979; Aitken, X1984; Macafee, 1983), only a few notes on characteristics particularly Xpertinent to the Map Task Corpus will be offered here. X X Phonologically, Standard Scottish English (hereafter SSE) is treated Xas a variety of Northern English. The segmental phonology of SSE lacks Xcontrasts possessed by British Received Pronunciation (RP) and has Xcontrasts lacking in RP. Where RP has different vowels in the pairs Xbad/balm, not/nought, pull/pool, SSE has only one vowel for each pair: Xrespectively /a/, /O/, and /u/. But in certain instances where RP has one Xvowel, SSE has several: /Vi/ in side v. /ae/ in sighed; /I/ in bird v. X/V/ in word, and /E/ in heard; and /O/ in cord v. /o/ in board. Whereas XRP does not distinguish where from wear, in SSE these are a minimal pair, Xcontrasting /X/ with /w/. X X The realizations of individual phonemes also differ from RP: /u/ is Xtypically realized as a centralized and even fronted vowel; /I/ is lower Xthan in RP and in many local varieties has been lowered to /V/; /E/ as in Xbed is generally higher than in RP, and a small number of words - such as Xnever, ever, seven - have a vowel that is higher than /E/ and slightly Xretracted; /r/ is realized as a tap, a retroflex fricative or a retroflex Xapproximant; /l/ is typically velarized; post-tonic /t/ is often realised Xas a glottal stop. In Glaswegian SSE, /p/ and /k/ may also be realised as Xglottal stops in reduced forms. X X The syntax of SSE diverges from that of Standard English, sharing Xfeatures with the English of Northern Ireland and with the English of the Xsouthern United States (for greater detail, see Miller, 1992, or Wilson, X1915). In general, the incidence of SSE syntax decreases for any Xindividual in direct proportion to exposure to written English and formal Xeducation. In the current Corpus, non-standard forms are rare, although Xthe shallow syntactic structures described earlier are much in evidence. XThe following are the most notable SSE forms in the Corpus: negation Xindicated by no or not ("I've no/t got a castle on my map") and cliticized Xas -nae ("The route doesnae go past the castle"); a general purpose tag eh, Xroughly equivalent to American English huh ("The route goes past the Xcastle, eh?"); whereabout(s) or where....about as the equivalent of where X("Where is the bridge about?"). X X In all, the Corpus contains few words which will be unfamiliar to Xspeakers of other varieties of English. Out of a total of nearly 2000 word Xtypes in the orthographic transcription of the Corpus, only 17 Scottish Xword forms were rejected by the UNIX spell dictionary and these account Xtogether for only 46 word tokens. The words are cliticized forms with X-nae, cannae, didnae, doesnae, havenae, wasnae, wouldnae; other closed Xclass items, nae ("no, not"), doon ("down"), fae ("from"), gonnae ("going Xto"), och ("oh"), roond ("around"), tae ("to"), thae ("those"); and coo X("cow"), stramash ("mess"), and totty ("tiny"). It should be understood, Xhowever, that the orthographic distinction between Scots words and their Xsimilar English counterparts, like doon/down or fae/from, is not completely Xsystematic. Either spelling may correspond to a range of pronunciations. X X X CORPUS STATISTICS X X The corpus consists of digital recordings of 128 unscripted dialogues X(approximately 15 hours of dialogue) and 64 lists of landmark names. Of Xthe 128 dialogues, 32 belong to each of four categories: familiar speakers Xwith eye contact, familiar speakers without eye contact, unfamiliar Xspeakers with eye contact, unfamiliar speakers without eye contact. All Xdialogues have been transcribed verbatim in the standard orthography, but Xwith indication of filled pauses, false starts, repetitions, interruptions Xetc. in a variant of standard mark-up notation. Table 1 reports the Xamounts of material involved. X X X INSERT TABLE 1 ABOUT HERE X X X X PROSPECTS X X The Map Task Corpus is already in use for studies outlined earlier, in Xparticular for studies which relate the concerns of several fields. XAlthough our own work is still very much in progress, the corpus is Xavailable to other researchers. National and international data collection Xexercises are an indication of the need felt by the speech and language Xcommunities for extensive machine-readable corpora. Moreover, there is a Xparticular value in using the same materials for many different kinds of Xresearch: like the participants in the Map Task, the more sure we can be Xthat we are talking about the same things, the more we are likely to reach Xour mutual goals(Note 1). X X X X X FOOTNOTES X X X X X REFERENCES X XABERCROMBIE, D. (1979). The accents of Standard English in Scotland. In X A. J. Aitken and T. McArthur (eds.), Languages of Scotland (pp. X 68-84). Edinburgh: Chambers. X XAITKEN, A. J. (1984). Scottish accents and dialects. In P. Trudgill X (ed.), Language in the British Isles (pp. 94-114). Cambridge, U. K.: X Cambridge University Press. X XANDERSON, A., CLARK, A., and MULLIN, J. (1989). The development of X referential communication skills: Interactions between speakers and X listeners in extended dialogues. Paper presented at the 3rd EARLI X Conference, Madrid. X XANDERSON, A., CLARK, A., and MULLIN, J. (1991) Introducing information in X dialogues: how young speakers refer and how young listeners respond. X Journal of Child Language. 18, 663-687. X XBARD, E. G., and ANDERSON, A. (1983). The unintelligibility of speech X addressed to children. Journal of Child Language, 10, 265-292. X XBARD, E. G., and ANDERSON, A. (1991). The unintelligibility of speech to X children: effects of referent availability. Proceedings of the XIIth X International Congress of Phonetic Sciences, 4, 458-461. Aix-en- X Provence, France. X XBARD, E. G., LOWE, A., and ALTMANN, G. (1989). The effect of repetition on X words in recorded dictation. Eurospeech '89: Proceedings of the X European Conference on Speech Communication and Technology, 2, X 573-576. X XBARRY, W.J., HOEQUIST, C.E., and NOLAN, F.J. (1989). An approach to the X problem of regional accent in automatic speech recognition. Computer X Speech and Language, 3, 355-366. X XBEAMAN, K., (1984). Coordination and subordination revisited: syntactic X complexity in spoken and written narrative discourse. In D. Tannen X (ed.), Coherence in Spoken and Written Discourse (pp.45-80). Norwood, X NJ: Ablex. X XBIBER, D. (1986). Spoken and written textual dimensions in English: X resolving the contradictory findings. Language, 62, 384-414. X XBIBER, D. (1988). Variation Across Speech and Writing. Cambridge, U.K.: X Cambridge University Press. X XBOYLE, E. A. (1990). User's Guide to the HCRC Dialogue Database. HCRC X Internal Publication, Human Communication Research Centre, University X of Edinburgh. X XBROWN, G. (1977). Listening to Spoken English. London: Longman. X XBROWN, G., ANDERSON, A., YULE, G., and SHILLCOCK, R. (1983). Teaching X Talk. Cambridge, U. K.: Cambridge University Press. X XBROWN, E. K., and MILLER, J. E. (1980). The Syntax of Scottish English. X Final Report to SSRC(UK), Project No. 5152. X XCHAFE, W. L., (1982). How people use adverbial clauses. In C. Brugman and X M. Macaulay (eds.), Proceedings of the tenth annual meeting of the X Berkeley Linguistics Society (pp. 437-439). Berkeley: Berkeley X Linguistics Society. X XCLARK, H. H., and MARSHALL, C. R. (1981). Definite reference and mutual X knowledge. In A. K. Joshi, B. L. Webber and I. A. Sag (eds.), X Elements of Discourse Understanding (pp. 10-63). Cambridge, U. K.: X Cambridge University Press. X XCOOPER, W. E., and PACCIA-COOPER, J. (1980). Syntax and Speech. X Cambridge, MA: Harvard University Press. X XFOWLER, C. A. (1988). Differential shortening of repeated content words X produced in various communicative contexts. Language and Speech, 31, X 307-320. X XFOWLER, C. A., and LEVY, E. (1991). Some ways in which forms arise from X functions in linguistic communications. Proceedings of XIIth X International Congress of Phonetic Sciences, 1, 279-82. Aix-en- X Provence, France. X XFOWLER, C. A., and HOUSUM, J. (1987). Talkers' signalling of `new' and X `old' words in speech and listeners' perception and use of the X distinction. Journal of Memory and Language, 26, 489-504. X XGIMSON, A. C. (1980). An Introduction to the Pronunciation of English. X London: Edward Arnold. X XHALLIDAY, M. A. K. (1989). Spoken and Written Language. Oxford: Oxford X University Press. X XHOUGHTON, G. (1986). The Production of Language in Dialogue: A X Computational Model. Unpublished Ph.D. Thesis, University of Sussex. X XHOUGHTON, G., and ISARD, S. D. (1987). Why to speak, what to say and how X to say it. In P. Morris (ed.), Modelling Cognition (pp. 249-267). X London: Wiley. X XHOUGHTON, G., and PEARSON, M. (1988), The Production of Spoken Dialogue, in X M. Zock and G. Sabah (eds.), Advances in Natural Language Generation: X An Interdisciplinary Perspective, Vol. 1 (pp. 112-130). London: X Pinter Publishers. X XHUNNICUTT, S. (1985). Intelligibility versus redundancy -- conditions of X dependency. Language and Speech, 28, 47-56. X XKARY, A. (1981). Motherese without the child. Paper presented at the Child X Language Seminar, Edinburgh, April, 1981. X XKOWTKO, J., ISARD, S. D., and DOHERTY, G. (1991). Conversational games X within dialogue. Proceedings of the ESPRIT Workshop on Discourse X Coherence (pp. 169-180). Edinburgh, U. K. X XKROLL, B. (1977). Combining ideas in written and spoken English. In X E. O. Keenan and T. L. Bennett (eds.), Discourse across Time and Space X (pp. 69-108). Southern California Occasional Papers in Linguistics, X Vol. 5. X XLASS, R. (1984). Phonology: An Introduction to Basic Concepts. Cambridge, X U. K.: Cambridge University Press. X XLIEBERMAN, P. (1963). Some effects of semantic and grammatical context on X the production and perception of speech. Language and Speech, 6, X 172-175. X XMACAFEE, C. (1983). Glasgow. In the series Varieties of English Around X the World. Amsterdam: John Benjamin. X XMACAULAY, R. (1985). The narrative skills of a Scottish coal miner. In X M. GoArlach (ed.), Focus on Scotland (pp. 111-124). In the series X Varieties of English around the World. Amsterdam: John Benjamin. X XMCALLISTER, J. M., SOTILLO, C., and BARD, E. G. (1991). The effect of X addressee familiarity on word duration. Proceedings of the XIIth X International Congress of Phonetic Sciences, 4, 426-429. Aix-en- X Provence, France. X XMCALLISTER, J. M., SOTILLO, C., BARD, E. G., and ANDERSON, A. H. (1990). X Using the map task to investigate variability in speech. Occasional X Paper. Department of Linguistics, University of Edinburgh. X XMILLER, J. E. (1992). Scottish English. In J. Milroy and L. Milroy X (eds.), Non-Standard English in Britain. London: Longman. X (forthcoming) X XPOOLE, M. E., and FIELD, T. W. (1976). A comparison of oral and written X code elaboration. Language and Speech, 19, 305-311. X XPOWER, R. (1979). The organization of purposeful dialogues. Linguistics, X 17, 107-152. X XROSENZWEIG, M. K., and POSTMAN, L. (1958). Frequency of usage and the X perception of words. Science, 127, 26-36. X XSHOCKEY, L. (1983). Phonetic and Phonological Properties of Connected X Speech. Ohio State Working Papers in Linguistics. X XSHOCKEY, L., and BOND, Z. S. (1980). Phonological processes in speech X addressed to children. Phonetica, 37, 267-274. X XSTUBBS, M., (1980). Language and Literacy: The Sociolinguistics of X Reading and Writing. London: Routledge and Kegan Paul. X XWILSON, J. (1915). Lowland Scotch as Spoken in the Lower Strathearn X District of Perthshire. London: Oxford University Press. X XZIPF, G. (1935). The Psycho-Biology of Language. Cambridge, MA: MIT X Press. X X X X X Table 1. Map Task Corpus statistics X X X X Total With Without X Eye Contact Eye Contact X X XNumber of conversations 128 64 64 X XNumber of word types 1,939 1,489 1,469 X XNumber of word tokens 146,855 66,729 80,126 X X Instruction Giver 46,629 54,665 X X Instruction Follower 20,100 25,461 X XAverage word tokens per Xconversation 1,147 1,043 1,252 X XTranscription size (kbytes) 987 449 538 X XDigitized speech size (kbytes) 6500 XDuration (hours) 20 10 10 X X Dialogues 15 X X Word lists and X diagnostics 5 X X X X X X FIGURE TITLES X X Figure 1. Samples of maps used in the Map Task X X a. Instruction Giver's map X X b. Instruction Follower's map X X X X NOTES X X (1)The corpus is available on CD-ROMs and consists of spoken Xdialogues, accent diagnostics, read word lists, time-stamped orthographic Xtranscription of the dialogues, and basic documentation. Please address Xenquiries to the correspondence address for this paper and mark them "Map XTask Corpus Distribution". Also available from this address is more Xdetailed documentation on the Glasgow HCRC Database, which includes Map XTask transcriptions (Boyle, 1990), and on the design of the Map Task Corpus Xitself (McAllister, Sotillo, Bard, and Anderson, 1990) X

X
X
END_OF_../cdtree/template/doc/lnspeech.sgm if test 46291 -ne `wc -c <../cdtree/template/doc/lnspeech.sgm`; then echo shar: \"../cdtree/template/doc/lnspeech.sgm\" unpacked with wrong size! fi # end of overwriting check fi echo shar: End of shell archive. exit 0 From corpora-request@uib.no Fri May 14 17:11:39 1993 From: Peters W Date: Fri, 14 May 93 15:58:27 BST To: corplst@hd.uib.no Dear Netters, Could any of you tell me if there exists a tagged version of the London-Lund corpus? Thanks in advance, Wim Peters -------------------------------------------------------------- W.T.M. Peters Tel. +44 206 872092 CL/MT group Fax: +44 206 872085 Dept. of Language & Linguistics Email: wim@essex.ac.uk University of Essex Wivenhoe Park Colchester, CO4 3SQ United Kingdom From corpora-request@uib.no Fri May 14 19:54:46 1993 From: Peters W Date: Fri, 14 May 93 18:13:17 BST To: CORPORA@hd.uib.no Dear Netters, Could any of you tell me if there exists a tagged version of the London-Lund corpus? Thanks in advance, Wim Peters -------------------------------------------------------------- W.T.M. Peters Tel. +44 206 872092 CL/MT group Fax: +44 206 872085 Dept. of Language & Linguistics Email: wim@essex.ac.uk University of Essex Wivenhoe Park Colchester, CO4 3SQ United Kingdom From corpora-request@uib.no Sat May 15 05:25:29 1993 Date: Sat, 15 May 1993 10:25:29 -0500 From: Yuangshan Chuang To: corpora@hd.uib.no Subject: Part-of-speech taggers. Dear Colleagues: I am interested in any information concerning part-of-speech taggers for Spanish, French, German, English, Japanese, and Chinese. Please be kind to let me know where I can buy POS taggers. Thank you a lot for your attention and kindness. Sincerely, I would like to wish you happiness. Sincerely, Yuangshan Chuang 5-15-1993 From corpora-request@uib.no Mon May 17 12:26:20 1993 Date: Mon, 17 May 93 10:26:20 +0200 From: nnshi01@mailserv.zdv.uni-tuebingen.de (Erhard Hinrichs) To: corpora@hd.uib.no Subject: job announcement The Computational Linguistics research unit in the Department of Linguistics at the University of Tuebingen (Federal Republic of Germany) invites applications for a research position at the level of Wissenschaftlicher Mitarbeiter. An M.A. in linguistics, computational linguistics or a related field is required, a Ph.D. is preferred. The position is becoming available on July 1st, 1993 and is currently funded until December 1994, although a renewal for an additional term is possible. Candidates with knowledge of German and with research experience in one or more of the following areas are particularly encouraged to apply: the design and construction of lexical knowledge bases morphological parsing corpus linguistics Programming experience in Prolog, LISP or C is desirable. Interested persons should send letter of application, curriculum vitae, names of 2 referees, and one representative publication to: Prof. Erhard W. Hinrichs Seminar fuer Sprachwissenschaft Abt. Allg. Sprachw./Computerlinguistik Universitaet Tuebingen Kleine Wilhelmstr. 113 D-W-7400 Tuebingen Federal Republic of Germany For full consideration, applications should be received by May 30, 1993. From corpora-request@uib.no Wed May 19 06:54:38 1993 Date: Wed, 19 May 93 13:54:38 -0700 From: steveng@cogsci.Berkeley.EDU (Steve Greenberg) To: corpora@hd.uib.no The Department of Linguistics, University of California at Berkeley, is looking for an exceptionally qualified individual to assume technical supervision of its Instructional Multimedia and Phonology Laboratories. The successful candidate will work with Professors John Ohala and Steven Greenberg (Director) and a group of approximately ten graduate students. Please contact Steven Greenberg for further information. Department of Linguistics, 2337 Dwinelle Hall, University of California, Berkeley, CA 94720, (510) 643-7620; 642-4938; steveng@cogsci.berkeley.edu Completed UC-Berkeley Employment Applications must be received by the campus Personnel Office no later than June 2, 1993. Application forms are available from: Personnel Office, 2200 University Avenue, Room 7G, University of California, Berkeley, CA 94720 (510) 642-1011. 05-203-20I/CP Programmer/Analyst III - Supervisor (A & PS 5) Instructional Multimedia and Phonology Laboratories, Department of Linguistics, Supervisory Computer Systems Position - 100% time, starting July 1, 1993. Salary range: $ 41,500 - 62,300 (depending on experience and qualifications. Supervise and coordinate computer systems operation and other technical resources of the Phonology and Instructional Multimedia Laboratories. Design and supervise development of software/hardware for conducting speech perception and production experiments and for the analysis of speech sounds using RC,S UNIX shell and program-specific scripts. Develop and program instructional software for Linguistics curricula. Plan and document strategies for efficient analysis of large corpora of speech data. Develop signal processing and data collection applications for speech projects. Assist graduate students in the conduct of speech research projects. Develop and implement strategies for integrating a heterogeneous computational environment composed of Macintosh, PC and Sun computers into a coherent network based on Ethernet. Maintain computer systems by conducting regular archival backups, installing or upgrading applications, creating new user accounts, file-system organization, customization of operating environment through development of shell scripts and window/menu design. Evaluate and recommend software and computer hardware for acquisition. Qualifications: Expertise in speech acoustics and perception essential. Comprehensive knowledge of digital signal processing required. Experience with digital audio also essential, as is detailed knowledge of UNIX and attendant operating system components (such as shell scripting and X windows). Specific experience with Sun computers and Entropic Software (Waves+ and ESPS) highly desirable. Must be proficient in RCS programming language and capable of learning program-specific scripting languages. Experience with Macintosh operating system and Hypercard development required, as is extensive experience with DOS and Windows on PCs. Experience pertaining to networking multiplatform computers (Macintosh, IBM-compatible PCs and Sun workstations) using TCP/IP and ether talk highly desirable. Familiarity with Sun, Macintosh and PC hardware extremely useful. Must possess exceptional communication and pedagogical skills, be well organized, energetic and committed to undergraduate/graduate education. From corpora-request@uib.no Wed May 19 21:25:01 1993 Date: Wed, 19 May 93 21:25:01 GMT From: ingria@BBN.COM To: corpora@hd.uib.no Subject: Seeking Information on Tag Sets for Languages Other than English I am familiar with many of the part-of-speech tag sets for English (e.g. Brown, UPenn Treebank, LOB, etc.) However, I need information about equivalent tag sets for languages other than English. I would appreciate any descriptions, or pointers to published descriptions, of such tag sets. Thanks in advance. -30- Bob From corpora-request@uib.no Thu May 20 18:06:04 1993 From: Ms L Al-Sulaiti Subject: seeking information on Arabic corpora To: corpora@hd.uib.no Date: Thu, 20 May 1993 17:06:04 +0100 (BST) I would like to know if there are any Arabic corpora available. It doesn't matter whether they are based on spoken or written texts. Many thanks From corpora-request@uib.no Thu May 20 08:00:01 1993 From: Dan Conde To: CORPORA@hd.uib.no Date: Thu, 20 May 93 15:00:01 PDT Subject: Non-English tagged corpora I am looking for tagged corpora for languages other than English: primarily Japanese, French and German. A part-of-speech tagger and a corpus would be useful as well. Thank you, Daniel Conde Microsoft Corporation. Internet: danco@microsoft.COM From corpora-request@uib.no Thu May 20 18:40:13 1993 Date: Thu, 20 May 1993 23:40:13 -0500 From: Yuangshan Chuang To: CORPORA@hd.uib.no, danco@microsoft.com Subject: Re: Non-English tagged corpora Dear Daniel: I have received quite a few pieces of information concerning part-of-speech taggers. I will try to finish editing them as soon as posssible. Then I will have them sent to the discussion group of CORPORA. Here, I would like to thank all the contributors whose efforts will benefit us a lot. Sincerely, Yuangshan Chuang 5-20-1993 From corpora-request@uib.no Fri May 21 08:19:03 1993 Date: Fri, 21 May 1993 13:19:03 -0500 From: Yuangshan Chuang To: corpora@hd.uib.no From corpora-request@uib.no Sat May 15 10:41:45 1993 Received: from alf.uib.no by uxa.cso.uiuc.edu with SMTP id AA28608 (5.67a/IDA-1.5 for ); Sat, 15 May 1993 10:41:37 -050 0 Received: from nora.hd.uib.no by alf.uib.no with SMTP (PP) id <13506-0@alf.uib.no>; Sat, 15 May 1993 17:26:31 +0200 Received: from uxa.cso.uiuc.edu by nora.hd.uib.no with SMTP id AA17522 (5.65c8/IDA-1.4.4 for ); Sat, 15 May 1993 17:29:42 +0200 Received: by uxa.cso.uiuc.edu id AA27157 (5.67a/IDA-1.5 for corpora@hd.uib.no); Sat, 15 May 1993 10:25:29 -0500 Date: Sat, 15 May 1993 10:25:29 -0500 From: Yuangshan Chuang Message-Id: <199305151525.AA27157@uxa.cso.uiuc.edu> To: corpora@hd.uib.no Subject: Part-of-speech taggers. Status: R Dear Colleagues: I am interested in any information concerning part-of-speech taggers for Spanish, French, German, English, Japanese, and Chinese. Please be kind to let me know where I can buy POS taggers. Thank you a lot for your attention and kindness. Sincerely, I would like to wish you happiness. Sincerely, Yuangshan Chuang 5-15-1993 From resnik@unagi.cis.upenn.edu Sat May 15 10:54:46 1993 Received: from LINC.CIS.UPENN.EDU by uxa.cso.uiuc.edu with SMTP id AA29667 (5.67a/IDA-1.5 for ); Sat, 15 May 1993 10:54:44 -050 0 Received: from UNAGI.CIS.UPENN.EDU by linc.cis.upenn.edu id AA09099; Sat, 15 May 93 11:55:38 -0400 Received: by unagi.cis.upenn.edu id AA15187; Sat, 15 May 93 11:55:37 EDT Date: Sat, 15 May 93 11:55:37 EDT From: resnik@unagi.cis.upenn.edu (Philip Resnik) Posted-Date: Sat, 15 May 93 11:55:37 EDT Message-Id: <9305151555.AA15187@unagi.cis.upenn.edu> To: ycg9915@uxa.cso.uiuc.edu In-Reply-To: Yuangshan Chuang's message of Sat, 15 May 1993 10:25:29 -0500 <1993 05151525.AA27157@uxa.cso.uiuc.edu> Subject: Part-of-speech taggers. Status: R Hello, Would you please forward me any useful information you receive on part of speech taggers, or post a summary to the corpora mailing list? Thanks very much, Philip resnik@linc.cis.upenn.edu From miles@minster.york.ac.uk Sat May 15 17:31:38 1993 Received: from minster.york.ac.uk by uxa.cso.uiuc.edu with SMTP id AA17835 (5.67a/IDA-1.5 for ); Sat, 15 May 1993 17:31:34 -050 0 From: miles@minster.york.ac.uk Date: Sat, 15 May 93 23:21:50 Message-Id: To: ycg9915@uxa.cso.uiuc.edu Subject: taggers Status: R Here's some info I collected about lexical taggers a month or so back. Miles ************************************************************* Hello all. A while back I put out a call asking about lexical taggers. Here are the responses that I received (-the only pd one I found is at Xerox). Sorry about the delay, I'vbe been out of the country for a while. miles@minster.york.ac.uk "All is vanity". Department of Computer Science York University York YO1 5DD ********************************************************************** >From prangana@stern.nyu.edu Mon Mar 22 10:20:07 EST 1993: Ken Church (references abound) has written one. Hes at Bell Labs and his email address is kwc@research.att.com. When last I heard, his work was not available outside bell Labs. The people at UPenn (work on the Penn Treebank) use a version of his tagger which ahould be freely avilable to you. I forget the precise names of the people involved. I suppose you are also aware of the work on the LOB corpus, and its associated tagger. I'd appreciate a copy of your findings. --Nicky -- Nicky Ranganathan nicky@rnd.stern.nyu.edu Information Systems Dept. (212) 998-0838 Stern School of Business New York University *********************************************************************** >From David.Elworthy@cl.cam.ac.uk Fri Mar 19 14:39:50 +0100 1993 In article <732234355.4046@minster.york.ac.uk>, you write: |> Does anyone know where I could get hold of a lexical tagger? Hi Miles! I have recently developed a tagger which implements the Viterbi and Forward-Backward algorithms, together with Baum-Welch re-estimation. We (i.e. me and Ted Briscoe) have sold a copy of this already, at \pounds 1200. The sale has largely been orchestrated by Ted, who is away at present, but it is likely we could sell you a copy at a similar price. It is written in ANSI C and is fast - on a HP 9000/710, it tags 700 words per second on a typical corpus (though it can be a lot slower). The program is supplied as source code, so you can modify it if you decide you want to make changes for specific fiddles and tweaks. The copyright is retained by me, however. Drop me a line if you are interested. However, if you do decide you want it, I will have to know by the end of April at the latest (for reasons not worth detailing). -- David Elworthy *********************************************************************** >From suthers+@pitt.edu Fri Mar 19 13:34:00 0500 1993: I sent a similar query out recently, for a NP extracter/tagger. Replies so far: --------------------------------------------------------------------------- From: KROVETZ@cs.umass.EDU Hi Dan, As far as I know, this doesn't exist in the public domain. Ken Church's tagger has a noun-phrase bracketter (which is a pre-requisite for extracting them). Unfortunately his tagger isn't even being licensed anymore. It's also not in Lisp. There are a number of taggers out there, but I'm not sure which of them are in Lisp (I believe Wendy's group is developing one in Lisp though). If you hear about anything, please let me know. Bob --------------------------------------------------------------------------- From: cardie%ren@cs.umass.edu (Claire Cardie) Dan, Hmm. To do any kind of text segmenting, you usually you need either: 1) a dictionary/lexicon of the words in your domain that contains part of speech information for each word, or 2) a big corpus (> 1,000,000 words) of examples on which to perform statistical analyses. Option 1) still also requires some kind of parser to actually put together the noun phrases and 2) requires access to a statistical part of speech tagger (like Ken Church's, for example). Unfortunatey, there is a whole lot of overhead associated with getting the part of speech tagger (in 2) working for your "corpus" and in setting up the dictionary entries (for 1). Wendy just wrote a part of speech tagger for our system that might work for you, however (if it's combined with the syntactic analysis part of our parser or with some very simple post-processor of your own). It requires training on a couple hundred sentences from your corpus, but then will assign parts of speech to any novel examples. Finding noun phrases from tagged sentences is pretty easy as long as you don't care about conjunctions or appositives or prepositional phrase attachment. (To get those right requires a lot of semantic knowledge --- and even then it's still really hard to get consistently right unless the noun phrases in your domain tend to be REALLY regular). If you send me a few examples, I can run them through our part of speech tagger (and parser, for that matter) and show you the kind of output you'd get from each. The tagger would do it's work based on the training we've given it (300 sentences worth) for OUR corpus, not yours, so I'm not sure how well it will perform on the sample sentences. (It still has some trouble with our own corpus....) On the other hand, if you only want NP's then it may be perfectly adequate. I hope all is well in Pittsburgh! Claire --------------------------------------------------------------------------- From: lewis@research.att.com (David Lewis) Hi Dan, John Brolio and I used a parser John built to do some partial parsing of sentences to extract noun phrases, and this stuff was all in Lisp. I could try to dig this code up if you don't find something else. One problem is that we used the Longman dictionary for the syntactic classes, and my impression was we didn't have the right to distribute a lexicon dervied from Longman. I think there are other dictionaries around that you could use though. Probably you will find something better than this, though. Another possibility would be to get or build a syntactic tagger, and use some heuristics to pull out noun phrases. See Ken CHurch's paper in the Applied ACL proceedings from 1988. BBN has a syntactic tagger in LISP which you might be able to get if you know someone there. You might also drop a line to Penni Sibun (sibun@parc.xerox.com) and ask her for advice, as she did a lot of work on tagging and NP extraction. --dave --------------------------------------------------------------------------- From: Penni Sibun i have such a thing, though i'm not sure of its current legal status (it's supp. to be pd, but....) or poss bitrot. it doesn't do nesting or conjs, but you can modify the finite state grammar as you please. if y'r further interested, call me--(415) 813-7772. --penni --------------------------------------------------------------------------- Bergler Sabine Dan?!? ... (deletions) ... I am still working on NL processing, newspaper texts. I have a position at Concordia in Montreal since last June and am just settling in. I am building up my research environment and if you do hear about any NP extractors, please let me know! ... (deletions) ... Sabine P.S.: Sorry I can't be of more technical help but I am pondering the very question right now. My temporary solution is to work on texts that have been parsed by the Penn Treebank project (availabe through ACL/DCI). The parse trees are quite reasonable and NPs are of course tagged... Penn uses Fiddich for that (I think the output requires some manual smoothing out) and maybe you could get a copy of Fiddich (Hindle's parser. Said to be robust) Again, if you should contact them (for the ACL/DCI contact Liberman, myl@unagi.cis.penn.edu) and hear whether they do distribute Fiddich, please let me know. That is on my list for summer, I'll take any hint... --------------------------------------------------------------------------- From: Didier.Bourigault@der.edf.fr ( Didier BOURIGAULT ) Dear colleague, Sorry, but my NP-extractor (LEXTER, coling'92, eacl'93) : - is not public-domain, - is not in common-lisp but in C, - is a TERMINOLOGICAL noun phrases extractor and... - works for french-language texts... Contact Atro Voutrilainen (University of Helsinky, Finland) who has written a system that extracts noun phrases from running english text : avoutila@ling.helsinki.fi Good luck! Didier Bourigault ------------------------------------------------- From: amsler@bellcore.com (Robert A Amsler) You realize it isn't quite as simple. The noun phrase as a syntactic unit cannot truly be recognized unless you completely parse the text, e.g., ``flying planes can be dangerous'' might have `flying planes' or `planes' as the noun phrase; in general -ing forms are tricky to classify. Furthermore, noun phrases can be arbitrarily long, ``I saw the man on the hill with a telescope'' type noun phrases, or `A long-winded examplary of the common noun phrase'. A trivial method for finding some objects in text quite like noun compounds (not the same thing as synactic noun phrases, mind you), is to follow a heuristic based on the old joke about how to carve a statue of an elephant out of a block of ice---you cut away everything that doesn't look like an elephant. In this case, everything that doesn't look like a noun compound constituent. This consists of a finite list of prepositions, common verb forms (e.g. be, do, have) conjunctions, pronouns, numbers, etc. AND to eliminate any compounds which end in an -ed form. If one breaks on any element of this set (as well as on punctuation) one could get, for example, the following list out of this paragraph. trivial method, finding, objects, text, noun compounds, thing, syntactic noun phrases, mind, follow, heuristic, old joke, carve, status, elephant, block, ice, cut, look, elephant, case, noun compound constituent, consists, finite list, prepositions, common verb forms, conjunctions, pronouns, numbers, eliminate, compounds, end, -ed form, breaks, element, set, punctuation, example, following list, paragraph. ---------------- ........................................................................... Dan Suthers | LRDC, room 505A suthers+@pitt.edu | 3939 O'Hara Street (412) 624-7036 office | University of Pittsburgh (412) 624-9149 fax | Pittsburgh, PA 15260 ........................................................................... *********************************************************************** >From pedersen@parc.xerox.com Wed Apr 14 19:59:16 PDT 1993: The Common Lisp source code for version 1.0 of the Xerox part-of-speech tagger is available for anonymous FTP from parcftp.xerox.comparcftp.xerox.com in the file pub/tagger/tagger-1-0.tar.Z. This code has been tested in the following CL implementations: . Franz Allegro Common Lisp version 4.1 on SunOS 4.x; . CMU Common Lisp version 16e on SunOS 4.x; and . Macintosh Common Lisp 2.0p2. Enjoy. Doug Cutting , and Jan Pedersen ********************************************************************** >From R.J.Collingham@ncl.ac.uk Thu Mar 18 14:16:23 1993: I've been asking this for about a year, nobody answered me (not even from the SALT mailing list!), but I have found 2 lexical taggers: CLAWS1, Lancaster University ============================ personal licence 150 pounds +vat group licence 400 pounds+vat site licence 1000 pounds+vat This is a statistical based tagger, claims about 96% accuracy. Contact Andrew Wilson for more information: eia018@computing.lancaster.ac.uk Parser for DOS, Prospero Software ================================= Probably a statistical based tagger. Claims very high 90's% accuracy. Costs 340 pounds + vat. Contact Mike Oakes, 081-7418531, 081-7489344(FAX) for more details. If you find anything out that I don't know please let me know! Hope this helps. Cheers Russell -------------------------------------------------------------------- Russell J. Collingham Artificial Intelligence Systems Research Group Computer Science (SECS) University of Durham email: R.J.Collingham@durham.ac.uk Stockton Road phone: 091-374 2637 (2630 secretary) Durham, England DH1 3LE fax: 091-374 3741 From ted@NMSU.Edu Sun May 16 19:12:13 1993 Received: from NMSU.Edu ([128.123.3.5]) by uxa.cso.uiuc.edu with SMTP id AA27875 (5.67a/IDA-1.5 for ); Sun, 16 May 1993 19:09:52 -050 0 Received: from lole (lole.NMSU.Edu) by NMSU.Edu (4.1/NMSU-1.18) id AA11768; Sun, 16 May 93 18:09:47 MDT Date: Sun, 16 May 93 18:09:47 MDT From: Message-Id: <9305170009.AA11768@NMSU.Edu> Received: by lole (4.1/NMSU) id AA21718; Sun, 16 May 93 18:10:04 MDT To: ycg9915@uxa.cso.uiuc.edu In-Reply-To: Yuangshan Chuang's message of Sat, 15 May 1993 10:25:29 -0500 <1993 05151525.AA27157@uxa.cso.uiuc.edu> Subject: Part-of-speech taggers. Status: R we have just finished a simple part of speech tagger for spanish. send email to jhargrav@nmsu.edu for a quick information sheet. the tagger will likely be distributed through the consortium for lexical research (for info about that send email to lexical@nmsu.edu). From eijk@cecehv.enet.dec.com Mon May 17 04:29:16 1993 Received: from inet-gw-2.pa.dec.com by uxa.cso.uiuc.edu with SMTP id AA20699 (5.67a/IDA-1.5 for ); Mon, 17 May 1993 04:29:09 -050 0 Received: by inet-gw-2.pa.dec.com; id AA26699; Mon, 17 May 93 02:29:20 -0700 Received: by vbormc.vbo.dec.com; id AA04257; Mon, 17 May 93 11:27:48 +0200 Date: Mon, 17 May 93 11:27:34 +0200 Message-Id: <9305170927.AA04257@vbormc.vbo.dec.com> From: eijk@cecehv.enet.dec.com (17-May-1993 1127) To: ycg9915@uxa.cso.uiuc.edu Subject: RE: Part-of-speech taggers. Status: R This subject was discussed on this list a couple of months ago. I will send you some of the messages. Regards, Pim van der Eijk. Digital Equipment, Amsterdam. From eijk@cecehv.enet.dec.com Mon May 17 04:29:24 1993 Received: from inet-gw-2.pa.dec.com by uxa.cso.uiuc.edu with SMTP id AA20705 (5.67a/IDA-1.5 for ); Mon, 17 May 1993 04:29:22 -050 0 Received: by inet-gw-2.pa.dec.com; id AA26718; Mon, 17 May 93 02:29:38 -0700 Received: by vbormc.vbo.dec.com; id AA04276; Mon, 17 May 93 11:28:07 +0200 Date: Mon, 17 May 93 11:27:53 +0200 Message-Id: <9305170928.AA04276@vbormc.vbo.dec.com> From: eijk@cecehv.enet.dec.com (17-May-1993 1128) To: ycg9915@uxa.cso.uiuc.edu Status: R From: VBORMC::"ted@nmsu.edu" "MAIL-11 Daemon" 5-OCT-1992 15:48:58.75 To: corplst@nora.hd.uib.no CC: Subj: Non-English taggers and tagged corpora From: 02-Oct-1992 1358 To: Subject: Non-English taggers and tagged corpora I was wondering whether there are people reading this list who are working on (or have references to) taggers for languages other than English. I am especially interested in German, French and Spanish. we are working on a part of speech tagger for spanish. we should have something to distribute via the consortium for lexical research by the end of the year. we are also very interested in non-english tagged corpora. for more information on the consortium, contact lexical@nmsu.edu % ====== Internet headers and postmarks (see DECWRL::GATEWAY.DOC) ====== % Received: by vbormc.vbo.dec.com; id AA09560; Mon, 5 Oct 92 15:43:06 +0100 % Received: by enet-gw.pa.dec.com; id AA23541; Mon, 5 Oct 92 07:44:51 -0700 % From: ted@NMSU.Edu % Received: from lole (lole.NMSU.Edu) by NMSU.Edu (4.1/NMSU-1.18)id AA12311; Mon , 5 Oct 92 08:43:13 MDT % Date: Mon, 5 Oct 92 08:43:12 MDT % Message-Id: <9210051443.AA12311@NMSU.Edu> % Received: by lole (4.1/NMSU)id AA02336; Mon, 5 Oct 92 08:43:10 MDT % To: corplst@nora.hd.uib.no % Cc: % In-Reply-To: " (CORPORA list)"'s message of Fri, 2 Oct 1992 15:11:39 +0100 <19 9210021411.AA06832@nora.hd.uib.no> % Subject: Non-English taggers and tagged corpora % Reply-To: ted@nmsu.edu From eijk@cecehv.enet.dec.com Mon May 17 04:29:29 1993 Received: from inet-gw-2.pa.dec.com by uxa.cso.uiuc.edu with SMTP id AA20713 (5.67a/IDA-1.5 for ); Mon, 17 May 1993 04:29:27 -050 0 Received: by inet-gw-2.pa.dec.com; id AA26727; Mon, 17 May 93 02:29:58 -0700 Received: by vbormc.vbo.dec.com; id AA04283; Mon, 17 May 93 11:28:17 +0200 Date: Mon, 17 May 93 11:28:02 +0200 Message-Id: <9305170928.AA04283@vbormc.vbo.dec.com> From: eijk@cecehv.enet.dec.com (17-May-1993 1128) To: ycg9915@uxa.cso.uiuc.edu Status: R From: VBORMC::"barnett@mcc.com" "Jim Barnett" 5-OCT-1992 19:19:35.96 To: corplst@nora.hd.uib.no CC: cecehv::eijk_p Subj: Non-English taggers and tagged corpora The University of Kyoto has a Japanese segmenter/morphological analyzer, called Juman, that is freely available. We tested it on some newspaper stories and found it roughly 93% accurate (that is, 93% of the words it identified were both correctly segmented and correctly tagged.) - Jim Barnett % ====== Internet headers and postmarks (see DECWRL::GATEWAY.DOC) ====== % Received: by vbormc.vbo.dec.com; id AA16673; Mon, 5 Oct 92 19:10:44 +0100 % Received: by crl.dec.com; id AA27940; Mon, 5 Oct 92 14:10:48 -0400 % Received: from paintbrush.mcc.com by turtle.mcc.com (4.1/isd-master_920825_x)i d AA17281; Mon, 5 Oct 92 13:10:57 CDT % Received: by paintbrush.mcc.com (4.0/isd-other_920825_17:05)id AA04366; Mon, 5 Oct 92 13:10:44 CDT % Date: Mon, 5 Oct 92 13:10:44 CDT % From: barnett@mcc.com (Jim Barnett) % Message-Id: <9210051810.AA04366@paintbrush.mcc.com> % To: corplst@nora.hd.uib.no % Cc: cecehv::eijk_p % In-Reply-To: " (CORPORA list)"'s message of Fri, 2 Oct 1992 15:11:39 +0100 <19 9210021411.AA06832@nora.hd.uib.no> % Subject: Non-English taggers and tagged corpora From eijk@cecehv.enet.dec.com Mon May 17 04:29:43 1993 Received: from inet-gw-2.pa.dec.com by uxa.cso.uiuc.edu with SMTP id AA20723 (5.67a/IDA-1.5 for ); Mon, 17 May 1993 04:29:39 -050 0 Received: by inet-gw-2.pa.dec.com; id AA26747; Mon, 17 May 93 02:30:08 -0700 Received: by vbormc.vbo.dec.com; id AA04298; Mon, 17 May 93 11:28:30 +0200 Date: Mon, 17 May 93 11:28:13 +0200 Message-Id: <9305170928.AA04298@vbormc.vbo.dec.com> From: eijk@cecehv.enet.dec.com (17-May-1993 1128) To: ycg9915@uxa.cso.uiuc.edu Status: R From: VBORMC::"corplst@nora.hd.uib.no" "CORPORA list" 6-OCT-1992 15:52:38.83 To: corpora@nora.hd.uib.no CC: Subj: Re: Non-English taggers and tagged corpora Send-date: Mon, 5 Oct 1992 5:11:13 UTC-0700 From: To: Message-ID: inbox:2332 01GPKT0XCWRM9AMWMI(a)CCIT.ARIZONA.EDU Subject: Re: Non-English taggers and tagged corpora The concordancer Letteratura Amica (or Literary Amiga in English) developed by Raffaele Cocchi of the U of Bologna tags and works in most European languages. He's still working on improving the allophones and algorithms for the speech function (this talks in its 9 [?] languages, too), but the concordancer part is well developed. His address: Via Toffano, 6; 40125 Bologna, Italy Macey Taylor maceytay@ccit.arizona.edu % ====== Internet headers and postmarks (see DECWRL::GATEWAY.DOC) ====== % Received: by vbormc.vbo.dec.com; id AA21389; Tue, 6 Oct 92 15:46:32 +0100 % Received: by crl.dec.com; id AA13506; Tue, 6 Oct 92 10:50:54 -0400 % Received: by nora.hd.uib.no (5.65c/1.34) % Date: Tue, 6 Oct 1992 12:29:49 +0100 % From: corplst@nora.hd.uib.no (CORPORA list) % Message-Id: <199210061129.AA05416@nora.hd.uib.no> % To: corpora@nora.hd.uib.no % Subject: Re: Non-English taggers and tagged corpora From eijk@cecehv.enet.dec.com Mon May 17 04:29:59 1993 Received: from inet-gw-2.pa.dec.com by uxa.cso.uiuc.edu with SMTP id AA20729 (5.67a/IDA-1.5 for ); Mon, 17 May 1993 04:29:57 -050 0 Received: by inet-gw-2.pa.dec.com; id AA26760; Mon, 17 May 93 02:30:19 -0700 Received: by vbormc.vbo.dec.com; id AA04327; Mon, 17 May 93 11:29:00 +0200 Date: Mon, 17 May 93 11:28:37 +0200 Message-Id: <9305170929.AA04327@vbormc.vbo.dec.com> From: eijk@cecehv.enet.dec.com (17-May-1993 1129) To: ycg9915@uxa.cso.uiuc.edu Status: R From: VBORMC::"ingria@BBN.COM" 7-OCT-1992 01:15:51.68 To: corplst@nora.hd.uib.no CC: Subj: Non-English taggers and tagged corpora From: Subject: Re: Non-English taggers and tagged corpora The concordancer Letteratura Amica (or Literary Amiga in English) developed by Raffaele Cocchi of the U of Bologna tags and works in most European languages. He's still working on improving the allophones and algorithms for the speech function (this talks in its 9 [?] languages, too), but the concordancer part is well developed. Some questions: (1) Does this work for all the EC languages? (2) What sorts of tags does it have for nouns and verbs? e.g. for a language with rich morphological Case, such as German or Modern Greek, one might expect the noun tags to include the Case information, whereas for English and Dutch, say, where only pronouns bear overt Case, the noun tags probably wouldn't. Similarly for verbs and aspect, mood, and voice. (3) How large are the lexicons for each language for the tagger functions? (4) How does the tagger deal with unknown words? Is it stochastic? Rule-based? Stochastic with knowledge-based overlay? His address: Via Toffano, 6; 40125 Bologna, Italy Does he have an EMail address? -30- Bob % ====== Internet headers and postmarks (see DECWRL::GATEWAY.DOC) ====== % Received: by vbormc.vbo.dec.com; id AA10321; Wed, 7 Oct 92 01:09:25 +0100 % Received: by enet-gw.pa.dec.com; id AA27734; Tue, 6 Oct 92 17:13:25 -0700 % Received: from IZAR.BBN.COM by nora.hd.uib.no (5.65c/1.34) % Message-Id: <199210062212.AA08519@nora.hd.uib.no> % To: corplst@nora.hd.uib.no % In-Reply-To: CORPORA list's message of Tue, 6 Oct 1992 12:29:49 +0100 <1992100 61129.AA05416@nora.hd.uib.no> % Subject: Non-English taggers and tagged corpora % Reply-To: ingria@BBN.COM % Date: Tue, 6 Oct 92 18:07:59 EDT % From: ingria@BBN.COM % Sender: ingria@BBN.COM From eijk@cecehv.enet.dec.com Mon May 17 04:30:04 1993 Received: from inet-gw-2.pa.dec.com by uxa.cso.uiuc.edu with SMTP id AA20745 (5.67a/IDA-1.5 for ); Mon, 17 May 1993 04:30:02 -050 0 Received: by inet-gw-2.pa.dec.com; id AA26755; Mon, 17 May 93 02:30:17 -0700 Received: by vbormc.vbo.dec.com; id AA04317; Mon, 17 May 93 11:28:50 +0200 Date: Mon, 17 May 93 11:28:26 +0200 Message-Id: <9305170928.AA04317@vbormc.vbo.dec.com> From: eijk@cecehv.enet.dec.com (17-May-1993 1129) To: ycg9915@uxa.cso.uiuc.edu Status: R From: VBORMC::"corplst@nora.hd.uib.no" "CORPORA list" 6-OCT-1992 23:55:27.92 To: corpora@nora.hd.uib.no CC: Subj: Non-English taggers and tagged corpora Send-date: Tue, 6 Oct 1992 9:32:24 UTC-0600 From: ted To: Message-ID: corpora:91 9210061532.AA21417(a)NMSU.Edu Subject: Non-English taggers and tagged corpora Date: Tue, 6 Oct 1992 13:24:29 +0100 From: corplst%nora.hd.uib.no (CORPORA list) Send-date: Mon, 5 Oct 1992 13:10:44 UTC-0500 From: (Jim Barnett) To: Cc: Message-ID: corpora:70 9210051810.AA04366(a)paintbrush.mcc.com Subject: Non-English taggers and tagged corpora ... kyoto's tagger juman is 93% accurate ... the first version of juman was pretty slow and not terribly accurate due to its small lexicon. the new version is rumored to remedy both defects. % ====== Internet headers and postmarks (see DECWRL::GATEWAY.DOC) ====== % Received: by vbormc.vbo.dec.com; id AA08065; Tue, 6 Oct 92 23:47:32 +0100 % Received: by crl.dec.com; id AA21798; Tue, 6 Oct 92 18:50:12 -0400 % Received: by nora.hd.uib.no (5.65c/1.34) % Date: Tue, 6 Oct 1992 22:22:51 +0100 % From: corplst@nora.hd.uib.no (CORPORA list) % Message-Id: <199210062122.AA08227@nora.hd.uib.no> % To: corpora@nora.hd.uib.no % Subject: Non-English taggers and tagged corpora From eijk@cecehv.enet.dec.com Mon May 17 04:30:19 1993 Received: from inet-gw-2.pa.dec.com by uxa.cso.uiuc.edu with SMTP id AA20750 (5.67a/IDA-1.5 for ); Mon, 17 May 1993 04:30:16 -050 0 Received: by inet-gw-2.pa.dec.com; id AA26790; Mon, 17 May 93 02:30:41 -0700 Received: by vbormc.vbo.dec.com; id AA04345; Mon, 17 May 93 11:29:09 +0200 Date: Mon, 17 May 93 11:28:57 +0200 Message-Id: <9305170929.AA04345@vbormc.vbo.dec.com> From: eijk@cecehv.enet.dec.com (17-May-1993 1129) To: ycg9915@uxa.cso.uiuc.edu Status: R From: VBORMC::"corplst@nora.hd.uib.no" "CORPORA list" 7-OCT-1992 22:04:27.51 To: corpora@nora.hd.uib.no CC: Subj: re: Non-English taggers and tagged corpora Send-date: Wed, 7 Oct 1992 17:16:40 UTC+0100 From: Patrick John Coppock To: Message-ID: corpora:107 204*C=no;PRMD=uninett;O=unit;OU=avh;S=patCoppock Subject: re: Non-English taggers and tagged corpora To: "eijk p"@cecehv.ENET.dec.com Subject: Non-English taggers and tagged corpora You ask: I was wondering whether there are people reading this list who are working on (or have references to) taggers for languages other than English. I am especially interested in German, French and Spanish. Iassume you have heard of the CHILDES project at Carnegie Mellon university. If not, then you might be interested in contacting them. The have amongst other things, a corpus of child language texts, coded according to the CHILDES system (Child Language Data Exchange System). There is also program- ware available for both Mac and IBM-DOS for working with corpuses tagged using this system. The coordinator of the project is Jeff McWhinney, and the e-mail address for info. is: childes@andrew.cmu.edu. It is possible to ftp stuff from the CHILDES corpus from poppy.psy.cmu.edu (ip 128.2.298.42) best wishes pat coppock dept of applied linguistics University of Trondheim 7055 DRAGVOLL Norway % ====== Internet headers and postmarks (see DECWRL::GATEWAY.DOC) ====== % Received: by vbormc.vbo.dec.com; id AA11349; Wed, 7 Oct 92 21:59:00 +0100 % Received: by crl.dec.com; id AA17842; Wed, 7 Oct 92 17:03:14 -0400 % Received: by nora.hd.uib.no (5.65c/1.34) % Date: Wed, 7 Oct 1992 18:14:58 +0100 % From: corplst@nora.hd.uib.no (CORPORA list) % Message-Id: <199210071714.AA12276@nora.hd.uib.no> % To: corpora@nora.hd.uib.no % Subject: re: Non-English taggers and tagged corpora From eijk@cecehv.enet.dec.com Mon May 17 04:30:24 1993 Received: from inet-gw-2.pa.dec.com by uxa.cso.uiuc.edu with SMTP id AA20760 (5.67a/IDA-1.5 for ); Mon, 17 May 1993 04:30:22 -050 0 Received: by inet-gw-2.pa.dec.com; id AA26795; Mon, 17 May 93 02:30:44 -0700 Received: by vbormc.vbo.dec.com; id AA04352; Mon, 17 May 93 11:29:13 +0200 Date: Mon, 17 May 93 11:29:05 +0200 Message-Id: <9305170929.AA04352@vbormc.vbo.dec.com> From: eijk@cecehv.enet.dec.com (17-May-1993 1129) To: ycg9915@uxa.cso.uiuc.edu Status: R From: VBORMC::"corplst@nora.hd.uib.no" "CORPORA list" 8-OCT-1992 01:31:04.49 To: corpora@nora.hd.uib.no CC: Subj: Non-English taggers and tagged corpora Send-date: Wed, 7 Oct 1992 12:18:02 UTC-0600 From: ted To: Message-ID: corpora:109 9210071818.AA11745(a)NMSU.Edu Subject: Non-English taggers and tagged corpora There is also programware available for both Mac and IBM-DOS for working with corpuses tagged using this system. what sort of tags are we talking about? i was looking for programs which would accept text as input and produce text+part of speech tags as output. i didn't think that the childes project had anything of that sort at all. % ====== Internet headers and postmarks (see DECWRL::GATEWAY.DOC) ====== % Received: by vbormc.vbo.dec.com; id AA14806; Thu, 8 Oct 92 01:25:49 +0100 % Received: by enet-gw.pa.dec.com; id AA14024; Wed, 7 Oct 92 17:29:57 -0700 % Received: by nora.hd.uib.no (5.65c/1.34) % Date: Wed, 7 Oct 1992 23:58:37 +0100 % From: corplst@nora.hd.uib.no (CORPORA list) % Message-Id: <199210072258.AA13521@nora.hd.uib.no> % To: corpora@nora.hd.uib.no % Subject: Non-English taggers and tagged corpora From eijk@cecehv.enet.dec.com Mon May 17 04:30:45 1993 Received: from inet-gw-2.pa.dec.com by uxa.cso.uiuc.edu with SMTP id AA20768 (5.67a/IDA-1.5 for ); Mon, 17 May 1993 04:30:42 -050 0 Received: by inet-gw-2.pa.dec.com; id AA26805; Mon, 17 May 93 02:30:59 -0700 Received: by vbormc.vbo.dec.com; id AA04367; Mon, 17 May 93 11:29:33 +0200 Date: Mon, 17 May 93 11:29:16 +0200 Message-Id: <9305170929.AA04367@vbormc.vbo.dec.com> From: eijk@cecehv.enet.dec.com (17-May-1993 1129) To: ycg9915@uxa.cso.uiuc.edu Status: R From: VBORMC::"corplst@nora.hd.uib.no" "CORPORA list" 8-OCT-1992 12:00:05.81 To: corpora@nora.hd.uib.no CC: Subj: Re: Non-English taggers and tagged corpora Send-date: Thu, 8 Oct 1992 9:05:55 UTC+0100 From: (Helmut Feldweg) To: (CORPORA list) Subject: Re: Non-English taggers and tagged corpora > i was looking for programs which would accept text as input and > produce text+part of speech tags as output. i didn't think that the > childes project had anything of that sort at all. > The CHILDES project has a part of speech tagger for English, a similar system for German is under development. These taggers require the data to be formatted according to the CHILDES transcript conventions. - Helmut Feldweg % ====== Internet headers and postmarks (see DECWRL::GATEWAY.DOC) ====== % Received: by vbormc.vbo.dec.com; id AA26899; Thu, 8 Oct 92 11:54:28 +0100 % Received: by enet-gw.pa.dec.com; id AA17567; Thu, 8 Oct 92 03:58:16 -0700 % Received: by nora.hd.uib.no (5.65c/1.34) % Date: Thu, 8 Oct 1992 09:55:43 +0100 % From: corplst@nora.hd.uib.no (CORPORA list) % Message-Id: <199210080855.AA14328@nora.hd.uib.no> % To: corpora@nora.hd.uib.no % Subject: Re: Non-English taggers and tagged corpora From eijk@cecehv.enet.dec.com Mon May 17 04:31:20 1993 Received: from inet-gw-2.pa.dec.com by uxa.cso.uiuc.edu with SMTP id AA20799 (5.67a/IDA-1.5 for ); Mon, 17 May 1993 04:31:17 -050 0 Received: by inet-gw-2.pa.dec.com; id AA26843; Mon, 17 May 93 02:31:26 -0700 Received: by vbormc.vbo.dec.com; id AA04383; Mon, 17 May 93 11:29:49 +0200 Date: Mon, 17 May 93 11:29:36 +0200 Message-Id: <9305170929.AA04383@vbormc.vbo.dec.com> From: eijk@cecehv.enet.dec.com (17-May-1993 1130) To: ycg9915@uxa.cso.uiuc.edu Status: R From: VBORMC::"feldweg@bach.sns.neuphilologie.uni-tuebingen.de" "Helmut Feldweg" 9-OCT-1992 20:31:05.99 To: rda%cogsci.edinburgh.ac.uk@alf.uib.no (Robert Dale) CC: corpora@nora.hd.uib.no, rda%cogsci.edinburgh.ac.uk@alf.uib.no Subj: Re: Non-English taggers and tagged corpora Sounds like CHILDES is not known to everybody in this list, so I'll give a short introduction here: CHILDES (Child Language Data Exchange System) comprises (a) a set of transcript conventions tailored for child language (b) a large collection of computerized language acquisition data, most of the data formated according to (a) (c) a set of computer programs for analyzing (b) (b) and (c) are available on CD-ROM from CMU (write to CHILDES@andrew.cmu.edu or to Brian MacWhinney (brian@andrew.cmu.edu) for a copy) and through anonymous ftp (poppy.psy.cmu.edu). Although CHILDES yields free access for research purposes, one is officially required to become a 'CHILDES-member' before using the data (by sending an informal note to Brian MacWhinney). There are also some European centers of CHILDES, one of which is located at the Max-Planck for Psycholinguistics in Nijmegen. Transcript conventions, software and data collection are thoroughly described in: Brian MacWhinney: The CHILDES Project: Tools for Analyzing Talk. Hillsdale: Lawrence Erlbaums, 1991. The ISBN for the paperback is 51-0-8058-1006-4 and the price is $29.95. Not described in that manual are recent software developments. One of the developments is a morphological tagger based on the ECAT parser of Roland Hauser. It takes utterances transcribed according to (a) as input and attaches a so-called mor-tier with full morphological analysis to it. The basic engine of the parser is language independent. Language-specific information is stored in separate files. CHILDES currently supplies such files for English. It is said that German versions are in preparation. Ambiguous forms are 'handled' in two ways: the interactive version of the parser generates a menu of choices and asks the user to select one of it, the non-interactive version writes all possible alternatives to the output and marks the forms as ambiguous. Brian MacWhinney should be able to provide more information on this. A preliminary version of the user manual for the newer programs is available via ftp as a packed MS-Word file for the Mac at poppy.psy.cmu.edu (128.2.248.4), directory clan/macintosh, file update.sit.hqx. Helmut Feldweg (formerly coordinator for the CHILDES and ESF-databases at the MPI for Psycholinguistics, Nijmegen) Seminar f"ur allgemeine Sprachwissenschaft, Universit"at T"ubingen Wilhelmstr. 113, D-7400 T"ubingen 1, Germany email: feldweg@mailserv.zdv.uni-tuebingen.de feldweg@bach.sns.neuphilologie.uni-tuebingen.de phone: +31 (0)7071 29-4279 % ====== Internet headers and postmarks (see DECWRL::GATEWAY.DOC) ====== % Received: by vbormc.vbo.dec.com; id AA10130; Fri, 9 Oct 92 20:21:34 +0100 % Received: by crl.dec.com; id AA18857; Fri, 9 Oct 92 15:23:24 -0400 % Received: from mailserv.zdv.uni-tuebingen.de by nora.hd.uib.no (5.65c/1.34) % Received: from bach.sns.neuphilologie.uni-tuebingen.de by mailserv.zdv.uni-tue bingen.de (4.1/ZDV-Uni-Tuebingen-1.0)id AA19683; Fri, 9 Oct 92 11:47:06 +0100 % Received: by bach.sns.neuphilologie.uni-tuebingen.de (4.1/SNS-1.0 )id AA28204; Fri, 9 Oct 92 11:47:04 +0100 % From: feldweg@bach.sns.neuphilologie.uni-tuebingen.de (Helmut Feldweg) % Message-Id: <9210091047.AA28204@bach.sns.neuphilologie.uni-tuebingen.de> % Subject: Re: Non-English taggers and tagged corpora % To: rda%cogsci.edinburgh.ac.uk@alf.uib.no (Robert Dale) % Date: Fri, 9 Oct 92 11:47:03 MET % Cc: corpora@nora.hd.uib.no, rda%cogsci.edinburgh.ac.uk@alf.uib.no % In-Reply-To: <634.9210090941@scott.cogsci.ed.ac.uk>; from "Robert Dale" at Oct 9, 92 10:41 am % X-Mailer: ELM [version 2.3 PL11] From corpora-request@uib.no Tue May 25 00:37:14 1993 From: eharate@strlall.strl.nhk.or.jp Date: Mon, 24 May 93 15:37:14 +0900 Subject: ATR Dialogue Database I N T E R O F F I C E M E M O R A N D U M $BF|IU(B : 24-May-1993 03:36pm JST $BH/?.(B : $B9>86(B $BZv>-(B EHARATE $B=jB0(B : $B2hA|8&5fIt(B $BEEOC(B : 03-5494-2308 $B08@h(B : Remote Addressee ( _CORPORA@NORA.HD.UIB.NO ) $ BEB(B $BI=Bj(B : ATR Dialogue Database I introduce a Japanese tagged corpus, ATR Dialogue Database (ADD) to colleagues . It is collected by simulated telephone and keyboard conversations. Tasks are international conference task and tourist agency task. Almost one million words are collected. Morphological analysis and syntactical analysis are done on them. ADD can be provided from ATR. Detailed information can be given from Mr. Noriyoshi Uratani ATR Interpreting Telecommunications Research Laboratories 2-2, Hikaridai, Seikacho, Sourakugun, Kyoto, 619-02, Japan tel. 81-7749-5-1357 fax. 81-7749-5-1308 email uratani@itl.atr.co.jp From corpora-request@uib.no Tue May 25 03:23:53 1993 Date: Tue, 25 May 93 07:23:53 EDT From: Eleanor Olds Batchelder Subject: Query: Japanese Tagset To: corpora@hd.uib.no, linguist@tamvm1.tamu.edu As part of a project to develop a stochastic lexical analyzer for Japanese, we are trying to decide on an appropriate set of part-of-speech labels for Japanese text. If you are currently processing Japanese text for any purpose, could you tell us: a) What is the goal of your project? b) What tags are you currently using? c) Are they successful for your purposes? If not, why not? Thanks, Eleanor Olds Batchelder, CUNY From corpora-request@uib.no Tue May 25 15:07:28 1993 From: E S Atwell Date: Tue, 25 May 93 14:07:28 +0100 To: corpora@hd.uib.no Subject: tagged spoken corpus Wim Peters asks: >Could any of you tell me if there exists a tagged version of the >London-Lund corpus? I spoke to various researchers from Lund last week at the ICAME (International Computer Archive of Modern English) conference in Zurich, and asked the same question! I gathered that part of the LLC had been tagged, but NOT all of it. However, some Lund researchers have now switched their attentions to the Spoken English Corpus, also available from ICAME in Bergen, as this IS tagged (with a tagset very similar to those of LOB or Brown corpora). Once the Lancaster/Leeds project to upgrade SEC to MARSEC is finished, the Spoken English Corpus will be available in several parallel (aligned) versions: digitised acoustic signal, phonetic/phonemic transcription, graphemic transcription (ie `normal' ascii), prosodic annotations (ie stress boundaries pitch movements etc), wordtags, syntax trees. For more details of MARSEC contact Peter Roach, peter@psyc.leeds.ac.uk Eric &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& Eric Steven Atwell National Coordinator, Higher Education Funding Councils' KBS Initiative Director, Centre for Computer Analysis of Language And Speech (CCALAS) Artificial Intelligence Division, School of Computer Studies phone: +44 532 335761 Leeds University FAX: +44 532 335468 Leeds LS2 9JT Email: eric@scs.leeds.ac.uk England &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& From corpora-request@uib.no Thu May 27 15:54:03 1993 Date: Thu, 27 May 1993 14:54:03 +0100 From: renie@cfdvax.univ-bpclermont.fr To: corpora@hd.uib.no Subject: joining To: the CORPORA list Subject: subscription I would like to subscribe to your list ? This is my address: renie@cfdvax.univ-bpclermont.fr Thanks. Delphine Renie Departement de Linguistique Universite Clermont 2 34 avenue Carnot 63037 Clermont-Fd Cedex. FRANCE From corpora-request@uib.no Thu May 27 18:06:09 1993 Date: Thu, 27 May 93 16:06:09 +0200 From: ide@grtc.cnrs-mrs.fr (Nancy Ide) To: empiricists@CSLI.Stanford.EDU, corpora@hd.uib.no tei-l@uicvm.bitnet,lexical@NMSU.Edu,linguist@tamsun.tamu.edu,\ humanist@brownvm.brown.edu,humbul@mail.rl.ac.uk,hcf1dahl@UCSBUXA.BITNET,\ weischedel@bbn.com,ln@frmop11.bitnet,nl-kr@cs.rpi.edu,tsipanel Subject: For publication: Text Software Initiative The Text Software Initiative ---------------------------- An international effort to promote the development and use of free text software The widespread availability of large amounts of electronic text and linguistic data in recent years has dramatically increased the need for generally available, flexible text software. Commercial software for text analysis and manipulation covers only a fraction of research needs, and it is often expensive and hard to adapt or extend to fit a particular research problem. Software developed by individual researchers and labs is often experimental and hard to get, hard to install, under-documented, and sometimes unreliable. Above all, most of this software is incompatible. As a result, it is not at all uncommon for researchers to develop tailor-made systems that replicate much of the functionality of other systems and in turn create programs that cannot be re-used by others, and so on in an endless software waste cycle. The reusability of data is a much-discussed topic these days; similarly, we need "software reusability", to avoid the re-inventing of the wheel characteristic of much language-analytic research in the past three decades. The Text Software Initiative (TSI) is committed to solving this problem by working to o establish and publish guidelines and standards for the development of text software; o promulgate and coordinate the development of free TSI- conformant software. The scope of the TSI covers all areas of analysis and manipulation of all kinds of texts (written or spoken, mono-lingual or multi- lingual parallel, etc.), including markup of physical and logical text features, linguistic analysis and annotation, browsing and retrieval, statistical analysis, and other text-related tasks in research in computational linguistics, humanities computing, terminology and lexicography, speech, etc. The TSI software development effort is distributed, that is, anyone can contribute on a voluntary basis. This means that tools will be developed according to the contributors' priorities; however, the TSI is ultimately working towards the development of a comprehensive text handling system. To ensure software compatibility and reusability and enable distributed development, the TSI is committed to: o design and publish program interface conventions o determine and publish guidelines for programming style and documentation o stress separation of code and linguistic data to ensure (natural) language independence o emphasize breaking high-level text-handling tasks into more primitive, reusable functions o provide a library of primitive text-handling tools o maintain a task list and set priorities o circulate information such as progress reports, revisions to the standard, availability of new software, etc. o set up a mechanism for testing and evaluation o maintain mailing lists for comments, bug reports, suggestions, etc. The TSI works in relation with other standardization groups, notably the Text Encoding Initiative and the Expert Advisory Group on Language Engineering Standards (EAGLES). All TSI software is free in the sense defined in the Free Software Foundation's General Public License, which guarantees the freedom to copy, redistribute, and modify software, and protects this freedom by requiring those who pass on the software to include the rights to further redistribute it and see and change the code. Distribution of TSI software is accomplished in relation with other dissemination groups such as the Free Software Foundation, RELATOR, and the Linguistic Data Consortium. The TSI does not provide technical support, but organizes a network of voluntary consultants and support people. PROJECT COORDINATORS Nancy Ide, Vassar College, Poughkeepsie, New York, USA ide@cs.vassar.edu Jean Veronis, Universite de Provence/CNRS, Aix-en-Provence, France veronis@grtc.cnrs-mrs.fr GENERAL ADVISORY BOARD Susan Armstrong, ISSCO, Geneva Mark Liberman, Linguistic Data Consortium, University of Pennsylvania Makoto Nagao, Kyoto University Mark Olsen, ARTFL Project, University of Chicago Richard Stallman, Free Software Foundation, Cambridge, Massachusetts Donald Walker, Bellcore, Morristown New Jersey Antonio Zampolli, Istituto di Linguistica Computazionale, Pisa The TSI also includes a TECHNICAL ADVISORY BOARD of software developers.