Newsgroups: alt.usage.english,sci.lang
Path: cantaloupe.srv.cs.cmu.edu!nntp.club.cc.cmu.edu!goldenapple.srv.cs.cmu.edu!rochester!cornellcs!newsstand.cit.cornell.edu!portc01.blue.aol.com!newsxfer3.itd.umich.edu!cpk-news-hub1.bbnplanet.com!news.bbnplanet.com!rill.news.pipex.net!pipex!blackbush.xlink.net!news-ge.switch.ch!swidir.switch.ch!CERN.ch!hpplus04.cern.ch!flavell
From: "Alan J. Flavell" <flavell@mail.cern.ch>
Subject: Re: Character sets (was: A.D.)
In-Reply-To: <ctQFqCj030n@sktb.demon.co.uk>
X-Sender: flavell@hpplus04.cern.ch
X-Nntp-Posting-Host: hpplus04.cern.ch
Content-Type: TEXT/PLAIN; charset=US-ASCII
Message-ID: <Pine.HPP.3.95a.970401112543.4123A-100000@hpplus04.cern.ch>
Sender: news@news.cern.ch (USENET News System)
Organization: speaking for myself and not for CERN
Comment: I hate unsolicited commercial email - boycott companies that use it - and reserve the right to bill for use of resources.
References: <E7s2K4.8q2@acli.interlog.com> <cTMbbaj030n@sktb.demon.co.uk> <5hk32l$glq@thoth.portal.ca> <cTrh1Wj030n@sktb.demon.co.uk> <5hkrsn$2ps@thoth.portal.ca> <ct9nUvj030n@sktb.demon.co.uk> <Pine.A41.3.95a.970331153847.25756A-100000@sp065> <ctQFqCj030n@sktb.demon.co.uk>
Mime-Version: 1.0
Date: Tue, 1 Apr 1997 10:21:33 GMT
Lines: 122

On Mon, 31 Mar 1997, Paul L. Allen wrote:

> > > In fact, on web pages it is entirely valid to enter top-bit-set characters
> > > directly as HTTP is 8-bit clean.  
> > 
> > Well, that statement is accurate on one of two conditions:
> 
> No, that statement is unconditionally accurate.

I'm sorry: when you said "to enter top-bit-set characters" I thought you
were talking about entering them from the platform's native keyboard,
using the platform's native editing tools etc. 

> > - either: the platform uses a code that contains at least iso-8859-1
> > (that could be e.g unix, MS-Windows, etc., in a Latin-1 locale at least)
> 
> That is a requirement for any user agent which conforms to RFC 1866 - HTML
> Levels 1 and 2.

I'm sorry, I thought you were talking about a provider who was entering
WWW material for serving out via a server.  Well, it seems to me that
you have done no more than restating what I said, in different words,
although you now shifted to considering the client platform whereas I
was considering the provider's platform - but the same principles apply
to both. 

> > - or: the platform uses a code that is different but fully compatible
> > with iso-8859-1 (e.c DOS CP850, EBCDIC CECP1047)
> 
> It is entirely up to the browser *how* it gets the printable characters of
> ISO 8859/1. 

No disagreement there either.  And it's entirely up to the server
*how* it gets the iso-8859-1 codes onto the 'net.

>  it *is* entirely
> valid to enter top-bit-set characters directly into web pages

But only if the entry platform has the appropriate arrangements for
serving out the characters of iso-8859-1.  In other words, I was trying
to restate the practical conditions for a server to be compliant with
RFC1866.  I'm not sure where you think we are in disagreement, because
(at the points where we are addressing the same topic) I believe we're
actually saying the same thing.

> A browser which does not display web pages using the ISO 8859/1 character
> set (whether by using an ISO 8859/1 font or remapping) is *broken* and
> unusable.  It makes no difference if you use &Ntilde; &#209; or type in
> an N-tilde (from ISO 8859/1) directly - a browser must display all three
> as N-tilde.

But if your platform is using CP850 or EBCDIC CECP 1047, then the last
thing that you should do is to "type in an N-tilde from iso-8859-1
directly".  You would type it in using your platform's own native code,
and your server would map it into iso-8859-1 when sending it to the
'net.

It's quite impractical to, for example, work in iso-8859-1 coding on a
VM/CMS (EBCDIC) platform, and quite unnecessary to attempt it, since
VM/CMS HTTPDs are designed to convert the native code to iso-8859-1 for
sending to the 'net, and VM/CMS browsers are designed to convert
iso-8859-1 from the 'net to EBCDIC for display on that platform's
screens. 

> > _AND_ the documents are served out by a server that understands how to map
> > the platform's own code into iso-8859-1 for transfer over the 'net.
> 
> This requirement is an invention of yours.

Tell that to Rick Troth

>  The server does no re-mapping
> whatsoever 

I think it would be helpful if you would take a look at some successful
VM/CMS-based WWW servers.  They certainly don't expect their documents
to be supplied in iso-8859-1 coding, and they certainly do send
iso-8859-1 coding to the 'net.

> > The Mac can be used for this purpose by using a modified version
> > of the Mac native code (a modification that's also used for Mac <-> 'net
> > code mapping in fine Internet programs like Fetch).  Fourteen characters
> > of the Mac native code are swapped in favour of fourteen that are needed
> > to make up the iso-8859-1 repertoire.
> 
> That is entirely up to the authors of browsers for the Mac. 

Of course.

>  Of course, if you use a text editor on a Mac to write
> web pages then you're going to have problems with the character set.

Yes, that was my point.  But these "problems" are understood, and
solutions are already in place.

> BTW, if you can mail me details of the Mac native encoding I'd appreciate
> it.

Might I suggest starting here
http://ppewww.ph.gla.ac.uk/~flavell/iso8859/iso8859-mac.html

and following the link to Andre' Pirard's article.  The Mac <->
iso-8859-1 mappings that are tabulated there (from 1992) have become
a de-facto Internet standard, incorporated into many fine Internet
apps, such as Fetch, several newsreaders, NCSA Mac Mosaic, MacWeb,
etc.

A much stronger "proof by example" is of course EBCDIC-based WWW
software, as I said above.  I'm merely stating the obvious (and
reinforcing what you yourself said) when I say that the HTML and HTTP
standards relate _ONLY_ to what is exchanged over the 'net.  They
prescribe nothing about how it shall be handled on the actual platforms
at each end, other than requiring that those platforms need at least
_some_ way of representing every displayable character of the iso-8859-1
repertoire.

I'm sorry, I hadn't intended to post a vast response, particularly as I
think we're both saying the same thing, albeit expressed in different
ways - but I did want to clear up what seem to be a couple of
misunderstandings. 


