Newsgroups: sci.lang
Path: cantaloupe.srv.cs.cmu.edu!rochester!cornellcs!newsstand.cit.cornell.edu!news.tc.cornell.edu!news3.cac.psu.edu!howland.erols.net!netcom.com!wilbaden
From: wilbaden@netcom.com (W.Baden)
Subject: Re: 1000 commonest words
Message-ID: <wilbadenE3pC0A.8x5@netcom.com>
Organization: NETCOM On-line Communication Services (408 261-4700 guest)
X-Newsreader: TIN [version 1.2 PL1]
References: <851126512.348@dejanews.com> <NEWTNews.852005868.25395.eli_perl@dialup.netvision.net.il> <5ah000$mcf@lyra.csx.cam.ac.uk>
Date: Wed, 8 Jan 1997 18:14:34 GMT
Lines: 52
Sender: wilbaden@netcom16.netcom.com

: In article <NEWTNews.852005868.25395.eli_perl@dialup.netvision.net.il>,
:  <eli_perl@netvision.net.il> wrote:
: . . .
: >> Does anyone know where I can get a simple list of the 1,000 most 
: >> frequently used words in English or some Western European language?  I 
: >> think it would be quite helpful in the initial stages of learning a 
: >> language.  
: . . .
: >The most common (frequent) 1000 words are not necessarily the most useful
: >ones for learning a language. In order to get a list of the most frequent
: >words, one can scan a text with a computer program (I have constructed such
: >a program myself) which lists all the words appearing in the text along 
: >with the number of occurrences. However, in order for the results to have
: >any statistical significance beyond a very small number of basic words, 
: >one needs an enormous body of text to analyse. Furthermore, beyond a small 
: >number of basic words, the most common words would vary greatly depending
: >on the subject of the text.


At vaxsar.vassar.edu in directory pub/nlp, pub/nlp.dir, or something 
like that, there's a file brown.top, which has the top 5000 (plus a
little bit) most frequent words according to the untagged "Brown
Corpus".

The Brown Corpus is described in Francis and Kucera, _Frequency
Analysis of English Usage_, 1982, ISBN 0-395-32250-2.

"The data base from which the frequency list and other material
in this book are derived is the tagged version of the Standard
Corpus of Present-Day American English, commonly known as the
Brown Corpus.  This corpus was compiled and prepared for
computer use at Brown University in 1963-1964, under the
direction of W. Nelson Francis, on a grant from the U.S. Office
of Education.

"The corpus consists of approximately 1,014,000 graphic words of
running text, all of which was first printed in the United
States in the year 1961."

(File brown.top has at least one error, 507 WEPT, which duplicates
507 WENT.)

In the same directory there are tagged and untagged versions
of the 8000 (plus a little) most frequent words from the 
LOB Corpus.  The LOB Corpus was harvested in Britain about the 
same time (I think) as the Brown Corpus.

Bag everything and check them out.  It will make you appreciate
the first response.
--
Wil Baden Costa Mesa, California
"The dog does not thank the master for the bone."
