Newsgroups: comp.ai.nat-lang
Path: cantaloupe.srv.cs.cmu.edu!bb3.andrew.cmu.edu!nntp.sei.cmu.edu!news.cis.ohio-state.edu!math.ohio-state.edu!howland.erols.net!news.mathworks.com!rill.news.pipex.net!pipex!uunet!in3.uu.net!142.77.1.4!news.uunet.ca!torfree!af137
From: af137@torfree.net (Al Aab)
Subject: word frequencies in French, Spanish, German
Message-ID: <E66Mz3.F41.0.queen@torfree.net>
Organization: Toronto Free-Net
X-Newsreader: TIN [version 1.2 PL2]
Date: Tue, 25 Feb 1997 23:37:50 GMT
Lines: 53

[ Article crossposted from alt.comp.editors.batch ]
[ Author was Al Aab ]
[ Posted on Mon, 24 Feb 1997 23:10:15 GMT ]

[ Article crossposted from comp.ai.nat-lang,sci.lang ]
[ Author was Liam R. E. Quin ]
[ Posted on Sun, 23 Feb 1997 01:10:28 GMT ]

Jacques Guy  <j.guy@trl.telstra.com.au> wrote:
>Jennifer Hodgdon wrote:
> 
>> I am working for a small company that has a need for word lists
>> with information on how frequent the words are in French, Spanish,
>> and German
>
>This sort of question occurs time and again. The answer is simple,
>and valid for all languages.

This depends on the sample size very heavily.  For example,
in the 5 megabytes or so of the King James Bible, words like
shall and god and lord occur fairly frequently (I think shall is
in the top ten, from memory).

Even with a large sample, the particular words that are the most
frequent are no doubt broadly similar accross languages, but
you must expect differences.  For example, "a", "and" and "the"
would not be individually so common in languages that spell such
words differently depending on their antecedent (e.g. l/la in French).

If this is for compression, adaptive techniques such as those
of Lempel & Ziv are often more effective anyway.

However, the most common 10% of words represent anywhere from 25% to
90% of text,depending on the complexity of the text.  The distribution
is exponential, as Mandelprot and Zipf showed in the 1930s.

Lee

-- 
Liam Quin, lee@sq.com         | lq-text freely available Unix text retrieval
Senior Technical Consultant   | FAQs: Metafont fonts, OPEN LOOK UI, OpenWindows
SoftQuad Inc. +1 416 544-9000 | xfonttool (Unix xfontsel in XView)
http://www.softquad.com/      | the barefoot programmer
-- 
=-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
al aab, seders moderator                                      sed u soon 
               it is not zat we do not see the  s o l u t i o n          
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+
-- 
=-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-
al aab, seders moderator                                      sed u soon 
               it is not zat we do not see the  s o l u t i o n          
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-+
