The Consortium for Lexical Research (CLR) Rio Grande Research Corridor Computing Research Laboratory New Mexico State University Box 30001, Las Cruces, NM 88003 lexical@crl.nmsu.edu (505) 646-5466 Fax: (505) 646-6218 Work in computational linguistics has reached the point where the performance of many natural language processing systems is limited by a ``lexical bottleneck''. That is, such systems could handle much more text and produce much more impressive application results were it not for the fact that their lexicons are too small. The Association for Computational Linguistics proposed that a Consortium for Lexical Research (CLR) be established, with funding from DARPA. The CLR was set up in July of 1991 and is sited at the Computing Research Laboratory, New Mexico, USA, under its Director, Yorick Wilks, Associate Director Louise Guthrie, and an ACL committee consisting of Roy Byrd, Ralph Grishman, Mark Liberman and Don Walker Any individual or organization wishing to make initial contact with the CLR and review its procedures, holdings, agreements etc. should send email to the address above (or write to the mailing address or fax). OBJECTIVE The objective of the Consortium for Lexical Research is to act as a clearing house, in the US and internationally, for lexical data and software. It shares lexical data and tools used to perform research on machine-readable dictionaries and lexicons, as well as communicating the results of that research, thus accelerating the scale and speed of the development of natural language understanding programs via standard lexicons and software. A basic premise of the proposal for cooperation on lexical research is that the research must be ``precompetitive''. That is, the CLR does not have as its goal the creation of commercial products. The goal of precompetitive research is to augment our understanding of what lexicons contain and, specifically, to build computational lexicons having those contents. Members of the Consortium contribute to a repository and withdraw resources from it in order to perform their research. There is no requirement that withdrawals be compensated by contributions in kind. Members are charged an annual fee to help support the cost of running the CLR. The task of the CLR is primarily to facilitate research, making available to the whole natural language processing community certain resources now held only by a few groups that have special relationships with companies or dictionary publishers. There is also an underlying theoretical assumption or hope: that the contents of major lexicons are very similar, and that some neutral, or ``polytheoretic,'' form of the information they contain can be at least a research goal, and would be a great boon if it could be achieved. The CLR as far as is practically possible accepts contributions from any source, regardless of theoretical orientation, and makes them available as widely as possible for research. APPROACH We have set up publicity networks to attract interested donors of materials and members. From there we have defined agreements for donors and members, a fee structure and set up computer networking facilities to carry out donations and withdrawls of materials. A major activity of the CLR is to negotiate agreements with ``providers'' on reassuring and advantageous terms to both suppliers and researchers. Major funders of work in this area in the US have indicated interest in making participation in the CLR a condition for financial support of research. CLR RESOURCES AND SERVICES The Computing Research Lab (CRL) has a range of machines appropriate for advanced computing on dictionaries (including the construction of large-scale matrices): DARPA-supported access to a Connection Machine, and a Sequent Symmetry, and an IBM-ACE parallel machine, as well as network of UNIX workstations. The Consortium has access to an appropriate range of large-scale storage machines, and capacities for accepting and providing materials by network, tape and CD. The CLR archives include two main areas: a public area and one for members only. These are the repository for such lexical items as word lists; published dictionaries; specialized terminology; statistical data; synonyms, antonyms, hypernym, pertainyms,etc.; and phrase lists. They also include tools for lexical data base management, lexical query, text analysis, dictionary encoding, and dictionary definition sense taggers. Repository management involves cataloging and storing material in disparate formats, and providing for their retransmission (with conversion, where appropriate tools exist). In addition, a library of documentation describing the repository's contents and containing research papers resulting from projects that use the material is maintained. A brief description of the services provided is as follows: a. CLR provides a catalog of, and acts as a clearinghouse for, utilities programs which have been written for existing online lexical data. b. It also provides information for access to repositories of corpus-manipulation tools held elsewhere. c. CLR compiles a list of known mistakes, misprints, etc. that occur in each of the major published sources (dictionaries etc.). d. CRL distributes a monthly newsletter highlighting materials available in the consortium's archives. PROGRESS Progress during the year has been achieved in the three areas that correspond to these data bases: . Building the contact base . Finalizing, membership provider agreements . Setting up the online database archives of the consortium Building the contact base has entailed publicizing the Consortium and its purposes to the research community. A list of addresses was compiled. Printed and email announcements were composed and distributed, with an email address for responses. The announcement was posted in relevant newsletters and journals. To date there have been three large-scale mailouts. Response has been enthusiastic and continuous. Conference presentations and personal contacts concerning the Consortium have included ACH/ALLC and others, in the US, Europe, Japan, and elsewhere. Particular attention has been directed to reaching core researchers, building the current mailing list to over 500. The mail directed to lexical@nmsu.edu or to the Consortium staff themselves has been answered individually, with queries about what the people are interested in and what they might like to contribute to the archives. The responses indicate that there is a great variety of lexically related software needed and available. Setting up procedures for the receiving of materials and their legal protections has led to formulation of drafts and membership and provider agreements. The agreements have been finalized and legally approved at NMSU and memberships have been accepted (see below). The major problem which has meant an enormous amount of negotiation with major publishers, has been creating a general form of provider agreement that captures the interests of the major dictionary publishers in a general way, not tailored to each. As we report below, there has been substantial progress. Facilities for receiving and providing archival materials have also been set up. The directories and file transfer procedures are in place. Besides online access and deposit of materials, tape and diskette have been anticipated, all in a number of formats. Heavy security has been set up for heavily encumbered materials. Software for handling and classifying correspondence has been written which will permit cross-classification and sorting of member entries to match user needs and user offerings. Written in-house reports of CLR operations have been made regularly. Other software which will enable handling of materials in varied scripts is also under development, so that materials with a variety of orthographies can be transmitted, etc. (Scripts incl. Japanese, Chinese, Korean, Cyrillic, some Indic, and other scripts. This is an item of current interest in the lexical research community.) One of the goals of the Consortium is to make electronic versions of dictionaries and thesauri available within the research community, and discussions and visits have continued with Oxford University Press, Collins Publishers, and Longman Group Limited. In brief, we now have arrangements with Harper-Collins which facilitate the purchase of their machine readable dictionaries by members. We expect to reach that stage soon with Longmans and Oxford. All this has been slower than we hoped. As these negotiations are in final stages, we are turning to the major US publishers. Materials in the CLR archives are secured by a ``protection in depth'' scheme. On the most public level, freely distributable materials are available via anonymous ftp. We maintain only a log of recent contacts for these materials. For the protection of lightly encumbered materials, we provide members of the Consortium with individual ftp accounts by which they can access archived material. These materials are kept separate from the publicly accessible materials, and are protected by standard ftp accounting and permission software. These accounts are only valid for ftp transfer, and their passwords are changed regularly by the Consortium. On the highest level of security, members who have received permission from the supplier of heavily encumbered material are given a special temporary ftp account which allows them access to encrypted versions of the heavily encumbered material. To obtain the material illegally, not only would the normal file permissions scheme have to be subverted, but a highly secure cryptographic system must be defeated. As an additional security measure, the files are periodically re-encrypted with freshly generated random passwords. At no time is an unencrypted version of the material stored in the ftp accessible archives. CLR is now the center of distribution of the data required by participants in the Fifth Message Understanding Conference (MUC-5). These research groups are all working on Information Extraction from actual texts. The performance of their systems will be evaluated in August 1993. CLR's facilities provided a secure and carefully monitored means of distributing the large volumes of data, such as gazetteers, rules, training texts, etc., required to build the IE systems. The Consortium for Lexical Research currently has in its public archives contributions of 145 different packages for lexical use. Of these 145, eighty are restricted to MUC-5, eleven are restricted to ``members-only'' (lightly encumbered) access and the remainder are available to anyone (unencumbered). We have placed in the heavily encumbered category of the archive a recent version of the Alvey tools. JUMAN, the segmenter and part-of-speech tagger for Japanese and the Xerox part-of-speech tagger have both been placed in the lightly encumbered category, for members-only access. The materials in the MUC-5 area include a database and tools for assessing the message understanding software designed and used by MUC members. Contributions include thesauri (Roget's Thesaurus and WordNet), dictionaries (Collins English-Spanish Bilingual Dictionary and Longman's Dictionary of Contemporary English), wordlists (Gazetteer, Proper Names, and the Standard Industrial Classification Manual), technical reports (all software and lexica include technical reports relevant to the materials being investigated), morphological analyzers (ENGLEX), lexical parsers (SGML parser for text processing), typesetting software (Indian, Arabic, Korean, Vietnamese, Japanese, Chinese and Tibetan fonts), dictionary interface tools (BYU's Morphogen), text analyzers (Interlinear text processor from SIL), and a phonological programming language. Approximately 50 other contributions are in various stages of negotiations at this time, including the text of German, Greek, and Latin Vulgate Bibles, the American Heritage Dictionary, POPX, a Russian-English dictionary of political terms, and a large Chinese-English dictionary. The CLR currently has 48 member organizations which include 22 domestic universities, 4 foreign universities, 8 government agencies and 18 commercial companies(including Apple, Microsoft, Xerox). MUC-5 participants have all joined the CLR, providing valuable software and materials. Currently 4 more memberships are pending signature by NMSU and 8 organizations have indicated that they would like to join, but we have not yet received membership agreements from them. Our most recent and major events are two workshops. The first was a CLR workshop in January of 1992 which brought publishers, researchers, funders and users of lexical materials for three days to New Mexico, with the support of the ACL and the NSF. The major issues of lexicon-reusability and the problems of copyright/ownership of materials were discussed extensively and full notes of the discussions and presentations at the workshop will be distributed very soon as a technical memorandum (The First Workshop at the Consortium for Lexical Research. MCCS-92-243 from Computing Research Laboratory, Box 3CRL, NMSU, Las Cruces, NM 88001, price $7). The second workshop entitled U.S./European Cooperation took place in January 1993. The workshop was sponsored jointly by NSF and the European Commission to discuss international cooperation of lexical computation. Twenty-five researchers participated in the workshop. (MCCS-93-254 from Computing Research Laboratory, Box 3CRL, NMSU, Las Cruces, NM 88001, price $5). FUTURE PLANS Our plans for the coming year are to expand membership and holdings steadily over the year, and progress toward our long-term goal of establishing the Consortium as a self-supporting entity. We hope to do this by signing agreements with other dictionary publishers to make their products available through the CLR and by more actively seeking contributions of software or data from researchers. Our membership drive will focus on obtaining more international members, and members from the community of researchers and language specialists who may not have everyday access to the internet. The Consortium for Lexical Research ftp site: clr.nmsu.edu [128.123.1.12] ---------------------------------------- The Consortium for Lexical Research is housed in the Computing Research Laboratory of New Mexico State University, Las Cruces, New Mexico. Please email suggestions and questions to lexical@crl.nmsu.edu The Consortium for Lexical Research ================================= The Consortium for Lexical Research is designed to serve as a repository for software and resources of importance to the natural language processing research community. Sharable resources, and the task of centralizing lexical data and tools, are of foremost concern in lexical research and computational linquistics. It is our objective to help alleviate the repeated re-creation of basic software tools, and to assist in making essential data sources more generally available. CLR maintains a public ftp site, and a separate library of materials only for members of CLR. Currently CLR has about 60 members, mostly academic institutions, and almost every major natural language processing center in the U.S. belongs. Access to the members only materials is strictly regulated by password and userid. Archive Information =================== What does CLR house at this FTP site The easiest way to become familiar with CLR is to use the 'get' command to retrieve our catalog. At the top level you can find the file "catalog.ps" for a postscript version, or "catalog" for a simple ascii version. The catalog lists holdings alphabetically and has a paragraph long description of each item. After the paragraph description the catalog tells how to find additional information for that item in an "info" file. For example: More Info: info/0114. To obtain this information, change to the "info" directory and 'get' the file "0114". What is membership in the CLR CLR archives certain materials exclusively for members of the Consortium for Lexical Research. Members are researchers in the computational linguistics and natural language processing community. There are three membership fee categories; academic, corporate and government. However, it is possible to provide materials in exchange for membership. If your company or organization has data or software that they are willing to deposit in the CLR, membership fees can be waived partially or completely. Please request information about becoming a member from lexical@crl.nmsu.edu. How to get more information about CLR. The file README.whatis.clr" has a description of our activities and goals. The newsletter directory has past issues of newsletters. * How to upload files to the archive To deposit software in the CLR, wrap up the package or files using the program most suitable for the target audience: on Unix use tar, for PC's use ZIP, and on Mac's use Stuffit or Compactor. Send email to lexical@crl.nmsu.edu informing us of the arrival. Then ftp the file to the directory incoming/. FTP ACCESS ============= * Anonymous ftp: host: ftp clr.nmsu.edu [128.123.1.12] user: anonymous pass: your email address type: binary * Members ftp: host: ftp clr.nmsu.edu [128.123.1.12] user: assigned userid (eg: clrnmm) pass: assigned password (eg: abc09j) type: binary FTP COMPRESSION CONVENTIONS ========================== Files with the suffix .Z are compressed with Unix compress. Those with the suffix .z or .gz are compressed with GNU gzip. Gzip compresses better. This ftp server allows file conversions and transfer of directories in the following manner: Original Requested File Name Name Result ------------------------------------------------------------------ File.Z File File is decompressed before sending File File.Z File is compressed DIR DIR.tar Directory DIR is tar'ed DIR DIR.tar.z Direcroty DIR is tar'ed and gzip'ed CLR Mailing List ==================== The Archive here is growing rapidly. If you are interested in receiving the CLR newsletter then please email your request to lexical@crl.nmsu.edu. Here is a brief layout of the CLR site. Every sub-directory is not listed; rather this synopsis will help you in "exploring" the site. However, the best way to locate materials is still to 'get' the catalog. WITHIN THE DIRECTORY CLR catalog | Catalog and catalog.ps are identical, catalog.ps | except for .ps being a postscript version. catalog - short | Catalog-short lists file names and contents. info/ | The info files are a helful auxilliary to | the paragraph descriptions of the catalog. ls - lR | These are the conventional ls -lR files ls - lR.Z lexica/ | This directory contains various types of | wordlists, Wordnet, dictionaries, etc. members-only/ corpora | This area of the site is reserved for lexica | members of the CLR. The lexica directory resources | has dictionaries, wordlists, the gazeteer, tools | the SICM terms manual, etc. The tools | area has the sub-directory ling-analysis | which holds parsers and morphology | programs. Resources is a service to | members with pointers to lexical | resources; sites, publications, etc. multiling/ General | The multi-lingual section has a wide arabic | variety of text processing tools, including chinese | word processors, fonts, transliteration french | programs, etc. Also. there are language gaelic | instruction aids and tutorials. indian | italian | japanese| korean | russian | tibetan | vietnamese newsletter/ | All past copies of newsletters, in ascii | and postscript format. tools/ | This is a large directory with tools for | concordances, dictionary maintencnace, sgml. | etc. The ling-analysis area has grammar | builders, parsers, and morphology programs.