Return-Path: Received: from GLINDA.OZ.CS.CMU.EDU by A.GP.CS.CMU.EDU id aa12420; 28 Mar 94 12:12:23 EST Received: from sneetches.vantage.gte.com by GLINDA.OZ.CS.CMU.EDU id aa16470; 28 Mar 94 12:11:40 EST Received: from Seuss.Vantage.GTE.COM by sneetches.vantage.gte.com (4.1/SMI-4.1) id AA00938; Mon, 28 Mar 94 12:09:43 EST Received: by Seuss.Vantage.GTE.COM (4.1/SMI-4.1) id AA04296; Mon, 28 Mar 94 12:11:59 EST Date: Mon, 28 Mar 94 12:11:59 EST From: "Sigurd P. Crossland" Message-Id: <9403281711.AA04296@Seuss.Vantage.GTE.COM> To: Mark.Kantrowitz@GLINDA.OZ.CS.CMU.EDU From: Sig@Seuss.Vantage.GTE.COM Subject: American language standardized dictionary for text compression Newsgroups: comp.ai,comp.ai.nat-lang,comp.compression,comp.compression.research,sci.crypt Organization: GTE As an aid to those involved in natural language parsing, dictionary compression, or textual encryption, I have been collecting and compiling a lengthy list of words. It is expected that a comprehensive standardized dictionary will eventually result. This dictionary should contain most common American words, abbreviations, hyphenations, and even incorrect spellings. The word lists are compiled from a number of sources: commercial news services, UseNet news postings, existing dictionaries, name lists, company lists, UNIX man pages, project Gutenberg's E-texts, project Wordnet, received mailings, etc. The texts are parsed and the words sorted by length into files for storage efficiency and by ASCII collating sequence within files for retrieval performance. By definition, 'words' must begin and end with an alphabetic character and may contain one of the characters "/-&+'248" embedded within the string. The words are supposed to be normalized to lower case except where an unusual capitalization occurs. There is a bug in the parser at the moment which allows for both upper and lower cased variations of some words. An anonymous ftp server has been built on wocket.vantage.gte.com which contains the following files in the pub/standard_dictionary directory: words bytes -r--r--r-- 4497 Feb 24 11:00 README -r--r--r-- 8552448 Jan 28 12:00 dic-0194.tar -r--r--r-- 4058075 Jan 28 12:02 dic-0194.tar.Z -r--r--r-- 8880128 Feb 24 10:39 dic-0294.tar -r--r--r-- 4220442 Feb 24 10:41 dic-0294.tar.Z -r--r--r-- 3285891 Feb 28 12:45 dic-0294.tar.gz -r--r--r-- 10403840 Mar 28 10:43 dic-0394.tar -r--r--r-- 4950681 Mar 28 10:45 dic-0394.tar.Z -r--r--r-- 3846113 Mar 28 11:18 dic-0394.tar.gz -r--r--r-- 3818781 Mar 28 11:05 dic-0394.zip -r--r--r-- 1269760 Aug 16 1993 dic-0893.tar -r--r--r-- 523393 Aug 16 1993 dic-0893.tar.Z -r--r--r-- 421239 Aug 16 1993 dic-0893.zip -r--r--r-- 3186688 Sep 17 1993 dic-0993.tar -r--r--r-- 1503561 Sep 17 1993 dic-0993.tar.Z -r--r--r-- 7479296 Oct 26 17:29 dic-1093.tar -r--r--r-- 3516519 Oct 26 17:32 dic-1093.tar.Z -r--r--r-- 8273920 Dec 17 11:58 dic-1293.tar -r--r--r-- 3918385 Dec 17 11:59 dic-1293.tar.Z -r--r--r-- 1067 4268 Mar 28 10:40 length02.txt -r--r--r-- 22790 113950 Mar 28 10:40 length03.txt -r--r--r-- 59156 354934 Mar 28 10:40 length04.txt -r--r--r-- 96155 673082 Mar 28 10:40 length05.txt -r--r--r-- 130085 1040743 Mar 28 10:40 length06.txt -r--r--r-- 141446 1273007 Mar 28 10:41 length07.txt -r--r--r-- 152579 1525780 Mar 28 10:41 length08.txt -r--r--r-- 110207 1212268 Mar 28 10:41 length09.txt -r--r--r-- 87648 1051762 Mar 28 10:41 length10.txt -r--r--r-- 65937 857170 Mar 28 10:41 length11.txt -r--r--r-- 47946 671243 Mar 28 10:41 length12.txt -r--r--r-- 32891 493352 Mar 28 10:41 length13.txt -r--r--r-- 21969 351504 Mar 28 10:41 length14.txt -r--r--r-- 14385 244545 Mar 28 10:41 length15.txt -r--r--r-- 9126 164268 Mar 28 10:41 length16.txt -r--r--r-- 5853 111207 Mar 28 10:41 length17.txt -r--r--r-- 3721 74420 Mar 28 10:41 length18.txt -r--r--r-- 2435 51135 Mar 28 10:41 length19.txt -r--r--r-- 1545 33990 Mar 28 10:41 length20.txt -r--r--r-- 1027 23621 Mar 28 10:41 length21.txt -r--r--r-- 690 16560 Mar 28 10:41 length22.txt -r--r--r-- 455 11375 Mar 28 10:41 length23.txt -r--r--r-- 292 7592 Mar 28 10:41 length24.txt -r--r--r-- 193 5211 Mar 28 10:41 length25.txt -r--r--r-- 121 3388 Mar 28 10:41 length26.txt -r--r--r-- 83 2407 Mar 28 10:41 length27.txt -r--r--r-- 1 30 Mar 28 10:41 length28.txt -r--r--r-- 0 0 Mar 28 10:41 length29.txt -r--r--r-- 0 0 Mar 28 10:41 length30.txt -r--r--r-- 0 0 Mar 28 10:41 length31.txt -r--r--r-- 1 34 Mar 28 10:41 length32.txt 1009804 Total -r--r--r-- 11521 Aug 13 1993 tarread.com The most recent compilation being dic-0394.tar is composed of the 31 text files and may be restored on an MS-DOS computer using the tarread.com utility program. Any words for inclusion in future dictionaries should be submitted to my E-Mail address directly or placed in the /pub/incoming directory. Please compare your dictionaries with standard Unix 'words' and submit only the differences. Many thanks to those that have submitted the 140,000 words during the last month. Take care. - Sig Sigurd P. Crossland Advanced Technology Lab Telephone: (703) 818-8504 GTE Facsimile: (703) 802-3110 15000 Conference Center Drive Internet: sig@seuss.vantage.gte.com Chantilly, VA 22021 Home: (703) 818-8942