Newsgroups: alt.usage.english,sci.lang
Path: cantaloupe.srv.cs.cmu.edu!bb3.andrew.cmu.edu!newsfeed.pitt.edu!newsflash.concordia.ca!news.nstn.ca!ott.istar!istar.net!van.istar!west.istar!n1van.istar!van-bc!unixg.ubc.ca!info.ucla.edu!agate!howland.erols.net!torn!sq!lee
From: lee@sq.com (Liam R. E. Quin)
Subject: Re: English word frequency
Message-ID: <1996Aug11.025246.7455@sq.com>
Organization: SoftQuad Inc., Toronto, Canada
References: <4uf2cv$716@yama.mcc.ac.uk> <4ugdnb$p6p@netnews.upenn.edu> <4uhjvq$4fh@news.inforamp.net>
Date: Sun, 11 Aug 1996 02:52:46 GMT
Lines: 89

Clive Young (clive.young@umist.ac.uk) wrote:
> I'm looking for frequency-based English vocabulary lists.
> What are the 1000/2000/3000 most commonly used English words?
> Are such lists available on the Web somewhere? 

hughett@galton.psycha.upenn.edu (Paul Hughett) wrote:
> I don't know of any such lists on-line but Nation (1990) cites several
> such lists in printed form and evaluates their relative utility.  She
> also describes some of the hazards of taking such lists too literally.
>   Nation, I. S. P., ``Teaching and Learning Vocabulary'',
>   Newbury House Publishers, 1990

You might also like to see
    Susan Armstrong (Ed.), ``Using Large Corpora'', MIT Press 1993
which discusses some of the issues relating to generating and using such
lists and other word-related information from sources of written or
transcribed texts.

Tom Collins  <tcollins@inforamp.net> wrote:
> For some years I have been using the Collins COBUILD series of 
> dictionaries,which give you information on word frequency. 
> THese dictionaries are based on a 200 million word computer 
> database and focus on the most commonly-used words-in all their 
> various forms, which is why I find it so useful. The latest 
> version of the dictionary uses a 1-5 rating method to show 
> relative frequency. 

I don't want to say anything against COBUILD, which by all accounts is
excellent work, but I should say that if you're going to go as far as
the 3,000 most common words, you really do have to choose your domain
very carefully.  Decide on whether you want `bland, unrecognisable English',
`Literary English', `Newspaper English', `Technical English' and so forth.
Decide whether you want American English reflected in your corpus, and if
so to what extent.  What about computer terms?

In the King James Bible, `Shalt' is a fairly common word, but although it
was probably generally common in 17th C. English, it isn't common today.

With a 200 million word corpus, you'll find probably start to find a number
of idioms that are common in spoken English just starting to be statistically
significant; compare `strong man' which is a statistically significant
collocation (t=2.0, but only 6 occurrences) in the 47 million word AP 1991
corpus [Armstrong, op.cit., Table 8 p.19, and text on p.18].

The best thing to do may be to measure the word frequencies in the actual
data in which one is most interested.

Text manipulation programs such as WordCruncher and TACT, and programming
tools such as Unix shell, awk and perl are often used for this sort of work.
There's a simple shell script given by Doug McIlroy of AT&T Bell Labs that
looks something like (from memory)
    tr -cs '[a-zA-Z]' '\012' < input  | # words one per line
    tr '[A-Z]' '[a-z]' | # convert to lower case
    sort | # collate repetitions together
    uniq -c | # count multiple occurrences
    sort -nr | # sort by frequency
    sed 400q > output  # take the most common 4000

I'll leave the proper handling of words such as can't and o'clock to the
interested reader :-)

This approach does not attempt any morphological analysis or stemming.
Another approach would be to use Porter's Algorithm (see any book on
information retrieval, e.g. Salton or Frakes et.al.) to bring word forms
more or less together (but without any tectual analysis or etymological
exactitude).

Finally, note that word lists like this are generally not very useful for
spelling checkers (which is what many people want them for).  Study the
publicly available ispell program instead...  although short lists are
usually _much_ better for spelling checkers than long ones!  E.g. the
Shorter Oxford Wordlist (don't ask me for it, do a web search, it's out
there) contains every word used by Milton... including the ones he used by
mistake, as misspellings... :-)

Commercial spelling checkers usually do have fairly large vocabularies --
typically in excess of 10,000 words -- but a lot of that is proper nouns,
such as American cities, names of competing firms -- e.g. `Interleaf' :-) --
common first names, and often a great quantity of domain-specific terms,
such as names of illnesses (Influenza, Percival, Priscilla), computing
terms, and animals.  Such as Fruitbats and Meerkats these days :-)

Lee

-- 
Liam Quin, SoftQuad Inc    | lq-text freely available Unix text retrieval
lee@sq.com +1 416 239 4801 | FAQs: Metafont fonts, OPEN LOOK UI, OpenWindows
SGML: http://www.sq.com/   |`Consider yourself... one of the family...
The barefoot programmer    | consider yourself... At Home!' [the Artful Dodger]
