Newsgroups: comp.lang.basic.visual,comp.lang.c++,comp.lang.c,comp.lang.pascal,comp.lang.tcl,comp.lang.ada,comp.lang.smalltalk,comp.lang.perl,comp.lang.asm.x86,comp.lang.fortran,comp.lang.postscript,comp.lang.java,comp.lang.clipper,comp.lang.forth,comp.lang.cobol,comp.lang.rexx,comp.lang.eiffel,comp.lang.python,comp.lang.lisp,comp.lang.scheme
Path: cantaloupe.srv.cs.cmu.edu!rochester!udel!news.mathworks.com!newsfeed.internetmci.com!in2.uu.net!brighton.openmarket.com!decwrl!waikato!comp.vuw.ac.nz!actrix.gen.nz!dkenny
From: dkenny@atlantis.actrix.gen.nz (Des Kenny)
Subject: Re: Language "ranking" based on posts to users groups
Keywords: raw data, statistics, meaning, illusion, confusion
Distribution: inet
Message-ID: <DJvMs5.Ht1@actrix.gen.nz>
Sender: Des Kenny
Summary: What does this really mean?
Organization: Actrix - Internet Services
Date: Wed, 20 Dec 1995 08:40:53 GMT
References: <4at4t0$j5j@garden.csc.calpoly.edu>
X-Nntp-Posting-Host: atlantis.actrix.gen.nz
Lines: 161
Xref: glinda.oz.cs.cmu.edu comp.lang.c++:165853 comp.lang.c:167972 comp.lang.tcl:40078 comp.lang.ada:38905 comp.lang.smalltalk:32441 comp.lang.asm.x86:15085 comp.lang.fortran:35963 comp.lang.postscript:38151 comp.lang.java:11394 comp.lang.clipper:9046 comp.lang.forth:25367 comp.lang.cobol:7003 comp.lang.rexx:13669 comp.lang.eiffel:12196 comp.lang.python:7530 comp.lang.lisp:20257 comp.lang.scheme:14617

In article <4at4t0$j5j@garden.csc.calpoly.edu>,
Dan Stubbs <dstubbs@garden.csc.calpoly.edu> wrote:
> The following table shows the number of posts to the "top 20" groups in
> comp.lang.*. Note that these numbers reflect adding up the posts for
> the various pascal and basic.visual groups. The java group has been moving
> up the list rather rapidly, and is undoubtedly in 11th place by now.
> 
>      1   73,453   basic.visual
>      2   71,298   c++
>      3   55,148   c
>      4   43,430   pascal
>      5   19,758   tcl
>    ----------------------------
>      6   16,033   ada
>      7   15,419   perl
>      8   15,132   smalltalk
>      9   14,390   asm.x86
>     10   13,039   fortran
>    ----------------------------
>     11   10,438   postscript
>     12   10,388   java
>     13    8,758   clipper
>     14    7,310   forth
>     15    6,832   cobol
>    ----------------------------
>     16    6,012   rexx
>     17    5,412   eiffel
>     18    5,380   python
>     19    5,212   lisp
>     20    4,262   scheme
>    ----------------------------
> 

  You have clearly put a lot of work into producing this raw data.

  We should all thank you for your efforts.

  Now do you think you are brave enough to tell us what this all means?

  This is very very raw data. 

  If you can not supply any meaning or interpretation of the data then it 
  practically useless to anyone.

  If you do not classify the contents of the posts into more detail the data 
  has no great value.

  I already suggested a very simple classification scheme:-
  There are three classes of post {capability, help, noise}.

  You can probably go into more detailed classification if you think
  the data warrants more detailed classification.

  Or invent a better clasification scheme of your own to break the content
  of the posts down into meaningful classes.

  If data is not interpreted to be used to refute a hypothesis it is of 
  no real value; and simply gives the illusion of informing people of 
  something that may be of use in their decision making processes.

  Such illusions lead to confusion and poor decision making.

  If the data is no use in decision making it is no use at all!
  It just becomes noise to be filtered out.

  Unless, you consider that this data belongs to the entertainment category
  of information. There is always a market for entertainment!

  You may not have realised when you started this counting exercise that
  you have opened a very old and smelly can of worms!

  The prospect of scanning thousands of fragments of natural language text 
  to classify their meaning to human beings takes you into the mysterious 
  and forbidding realms of "Artificial Intelligence",  and you will notice 
  that the typical languages used for natural language interpretation come
  right near the bottom of your list!   Languages such as Lisp and its 
  variants have been used to attempt to wring some meaning out of text 
  fragments, with varying degrees of success. Human language is still too 
  weird for the average computer architecture and a whole battery of AI 
  languages. You have not even mentioned declarative AI languages such as 
  Prolog. Nor have you mentioned functional languages such as Miranda.

  Even artificial neural networks struggle to discover what
  we humans call 'meaning' on anything but very small domains. 

  So by opening this can of worms, to mix metaphores, you are on the horns 
  of a dilemma!

  You can supply masses of raw data to people using some counting
  algorithms, but the probability of your discovering any meaning 
  in this data is vanishingly small! 
  
  I have just realised that I may classify this data in the amusement category.
  
  You have provided us with some good amusement for the holiday season.

  Thank you!


  If you can seriously solve this problem you should probably get a dozen
  Phd's. So I hope you succeed , we will all benefit from your discoveries.

  Some AI people have been trying to extract meaning out of raw text for 
  twenty years or more. Philosophers have been trying for thousands of 
  years, but that is another story!   

  
   I have seen one pattern in your data that may amuse you.

   Let me propose a hypothesis:
-------------------------------------------------------------------------
   The intelligence of the language is inversely proportional to the 
   number of posts made for that language.
-------------------------------------------------------------------------

   Now there is a hypothesis/theory that you or someone else might like 
   to refute.

   You might like to think about a corollary to this hypothesis:
   hint: it involve the words 'intelligence', 'computer language(s)' and 
   'people'.
 
   Have fun! 
----------------------------------------------------------------------------

  PS. On a more sombre note, you could attempt some sort of statistical 
  sampling techniques on the data and use the humble human eyes and 
  brains as prime (primal) technologies.

  You will need some sampling criteria to stratify the data.
  You are going to need to do a lot of experiments until you discover
  a reasonably repeatable way of selecting a representative sample of 
  your data.

  Then you might try to read individual posts and classify them according
  the {capability, help, noise} classification or a better one.

  It may be possible to use some AI techniques to do some filtering.

  There are lots of texts on statistical sampling theory, and probably
  numerous maths graduates could help you.

  If you take this seriously you may discover some very interesting
  things along the journey.

  Happy holidays


  My Best Regards

  Des Kenny
  Director
  Objective Methods Ltd
  PO Box 17356
  Wellington
  New Zealand
  Phone: 64 4 21 610 220
  Fax:   64 4 476 9237
  Email: dkenny@swell.actrix.gen.nz
   
  
