Newsgroups: sci.lang
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news2.near.net!news.mathworks.com!solaris.cc.vt.edu!insosf1.infonet.net!internet.spss.com!markrose
From: markrose@spss.com (Mark Rosenfelder)
Subject: Re: Optimal Artificial Languages   
Message-ID: <D3L9vM.A0r@spss.com>
Sender: news@spss.com
Organization: SPSS Inc
References: <9502011612.A25302@pwinet.upj.com>
Date: Mon, 6 Feb 1995 17:28:33 GMT
Lines: 29

In article <9502011612.A25302@pwinet.upj.com>,
GHSTEELE <ghsteele@pwinet.upj.com> wrote:
> markrose@spss.com (Mark Rosenfelder)                                     
[MR] Redundancy *is* a form or error correction.  The information content of
[MR] English text has been estimated at 1 bit per word, which makes it possible 
[MR] to understand evn f th sgnl s sgnfcnty dgradd. 
                                                                              
[TM] What is your definition here of 'word' and 'bit'?

[MR] Bit: same as everyone else's; what's the problem? 
[MR] Word: Probably Shannon took words as the things divided by white space.

>Not to quibble too much, but not "everyone else's."  As a computer scientist 
>not a linguist my definition of bit is a binary digit (either a 1 or a 0) [...]

>I infer from the previous discussion that the linguist's definition cannot 
>match mine.  It would be not be meaningful that each English word carried at 
>most the information "yes" or "no."  My curiosity is piqued.  Just what _is_ 
>the linguist's definition of "bit" and "word".

"One bit per word" was a typo; I meant "one bit per letter", and I didn't
even notice the mistake when Tim asked for clarification.  Sorry for the
confusion.

The reference is to Claude Shannon's experiments, where subjects were given
a text a letter at a time, and asked to guess the next letter.  Very often
the next letter is certain (and so adds no new information); other times 
there are a number of alternatives.  The average turned out to be around
1 bit of new information per word.  
