This is a description of the corpora held by the Arbeitsbereich Linguistik, University of M"unster. The data are in accordance with the survey questionaire sent out by the Center for Electronic Texts in the Humanities. If you need further information, please contact: Prof. Dr. Wolf Paprott'e University of Muenster - Arbeitsbereich Linguistik Huefferstr. 27 W-4400 MUenster e-mail: paprott@dmswwu1a.bitnet or, for technical details, Lothar Lemnitzer (same address) e-mail: lothar@morrison.uni-muenster.de DATA STRUCTURE language source text type size in running words size in MB representation format availability conversion tools used current uses THE GERMAN CORPUS german FAZ newspaper texts ca. 80 mill. ca. 210 SGML, DTD In general, NO. Access for specific tasks possible conversion from typesetting tape to SGML representation. Implemented in C and lex. statistical analyses, lexicographic tasks (lemma selection, extraction of example sentences) german ZEIT newspaper texts ca. 14 mill ca. 90 SGML, according to own DTD In general, NO. Access for specific tasks possible conversion from typesetting tape to SGML representation. Implemented in C and lex. statistical analyses, lexicographic tasks (lemma selection, extraction of example sentences) german VDI-Nachrichten newspaper texts 192 000 1.5 SGML, according to own DTD YES conversion from typesetting tape to SGML representation. Implemented in C and lex. Incomplete statistical analyses, lexicographic tasks (lemma selection, extraction of example sentences) german die Bibel - Elberfelder Ausgabe prose 780 000 4.5 plain ASCII files YES none german Siemens technical documentations 100 000 0.5 plain ASCII files YES none statistical analyses, lexicographic tasks (lemma selection, extraction of example sentences) Furthermore, we collect scientific works and papers. Up to now, we hold: 1 dissertation in medicine 2 M.A. theses some linguistic articles THE ENGLISH CORPUS english Oxford Text Archive Brown Corpus, misc. 1. mill ? plain ASCII files YES none none english Longman Longman Lancaster corpus, misc. ca. 10 mill. ? plain ASCII files YES none none english St. James Bible prose 832 000 4.5 plain ASCII files YES none none english Dylan texts prose, interviews 10 000 0.06 plain ASCII files YES none none english Hansard corpus (English part) discussion protocols, legal/political texts ? ? plain ASCII files YES. none none english Usenet - discussion about Japanese politics scientific prose 200 000 1.3 plain ASCII files YES none none english Usenet - discussions - politics and recreation prose 180 000 1.2 plain ASCII files YES none none english Usenet - science fiction discussion prose 4.15 mill 26 plain ASCII files YES none none english Usenet - discussions - telecommunications prose 3.5 mill 22.5 plain ASCII files YES none none english Usenet - discussions - telecommunications prose 3.5 mill 22.5 plain ASCII files YES none none english ACL/DCI Cd-ROM Newspaper texts, abstracts, dictionary 65 mill ? almost all texts in SGML YES none statistical analyses, lexicographic tasks (lemma selection, extraction of example sentences) SPANISH CORPUS spanish el diario vasco newspaper texts 830 000 6.5 plain ASCII files YES none statistical analyses, lexicographic tasks (lemma selection, extraction of example sentences) spanish el pais newspaper texts ca. 24 mill ca. 140 SGML-like NO conversion from typesetting tape to SGML representation. Implemented in C and lex. Incomplete statistical analyses, lexicographic tasks (lemma selection, extraction of example sentences) spanish abc newspaper texts ? 3 plain ASCII files NO none none spanish private scientific prose 40 000 0.22 plain ASCII files YES none statistical analyses, lexicographic tasks (lemma selection, extraction of example sentences) spanish private fiction (contemporary) 750 000 5 plain ASCII files YES none statistical analyses, lexicographic tasks (lemma selection, extraction of example sentences) FRENCH CORPUS french ECI hansard corpus (French part) ca 20 mill ? plain ASCII files YES. none statistical analyses, lexicographic tasks (lemma selection, extraction of example sentences) ITALIAN CORPUS italian Univ. of Pisa newspaper texts, prose 370 000 1.4 plain ASCII files YES. none lexicographic tasks (lemma selection) Material of two Italian newspapers (il messagiero, 24ore) is under work now, so not yet available