Newsgroups: alt.usage.english,sci.lang
Path: cantaloupe.srv.cs.cmu.edu!nntp.club.cc.cmu.edu!goldenapple.srv.cs.cmu.edu!das-news2.harvard.edu!news.dfci.harvard.edu!camelot.ccs.neu.edu!news.mathworks.com!EU.net!CERN.ch!hpplus06.cern.ch!flavell
From: "Alan J. Flavell" <flavell@mail.cern.ch>
Subject: Re: Character sets (was: A.D.)
In-Reply-To: <cUa1Qlj030n@sktb.demon.co.uk>
X-Sender: flavell@hpplus06.cern.ch
X-Nntp-Posting-Host: hpplus06.cern.ch
Content-Type: TEXT/PLAIN; charset=US-ASCII
Message-ID: <Pine.HPP.3.95a.970402102358.17875B-100000@hpplus06.cern.ch>
Sender: news@news.cern.ch (USENET News System)
Organization: speaking for myself and not for CERN
Comment: I hate unsolicited commercial email - boycott companies that use it - and reserve the right to bill for use of resources.
References: <5hk32l$glq@thoth.portal.ca> <cTrh1Wj030n@sktb.demon.co.uk> <5hkrsn$2ps@thoth.portal.ca> <ct9nUvj030n@sktb.demon.co.uk> <Pine.A41.3.95a.970331153847.25756A-100000@sp065> <Pine.HPP.3.95a.970401112543.4123A-100000@hpplus04.cern.ch> <cUa1Qlj030n@sktb.demon.co.uk>
Mime-Version: 1.0
Date: Wed, 2 Apr 1997 09:04:25 GMT
Lines: 85

On Tue, 1 Apr 1997, Paul L. Allen wrote:

>  Obviously if you're not using ISO 8859/1
> in your editor there are problems with entering characters directly - it's
> still not invalid though, just a pain.

Well, from having worked with such platforms I think I can say that it
seems second-nature that text files (and HTML files _are_ text files) 
get character-code-mapped between their storage encoding and their net
encoding.  The Internet software that's used on such platforms (VM/CMS,
Mac) usually takes this for granted when dealing with text files; the
exception would be DOS and its wretched Code Pages, but that's normally
considered to be a weakness in the early software design of DOS FTP
software, rather than a deliberate design intention.  You'll see that
MS Kermit understands the problem, and has settings to transfer text
files "correctly" (in my sense).

It was a problem at first that there was no agreed EBCDIC counterpart
of iso-8859-1, and no agreed Mac counterpart either; but both of
those issues were resolved some years back, as documented in Pirard's
note.

>  Or not depending if FTP upload to
> the server preserves code position or does a translation.

FTP of text files on those platforms (Mac, IBM mainframes resp.)  _does_
involve a character code translation.  It would actually be perverse
(take it from me, I've worked on such platforms!) to attempt to work on
such a platform in a non-native character code (i.e iso-8859-1 in this
context).  Certainly you need some way to represent the iso-8859-1
_repertoire_, but you don't have to use the actual code values of
iso-8859-1.

If you're only looking at the mainframe as a data repository, for
arbitrary data files that you aren't proposing to work on, process, view
etc. on the mainframe, then your agenda could well be feasible, I don't
dispute that.  But for an actual user of a mainframe (or a Mac, possibly
even MS-DOS) I'd generally recommend working in their own native code,
and resolving the discrepancies precisely at the interface to the 'net. 

> I really hope not.  Truly.  I would hope that the conversion would be
> done at the FTP stage, where the protocols already allow for EBCDIC
> conversion, rather than putting the burden on the server.

Some misunderstanding here - conversion _is_ done at the FTP stage: if
you upload a text file onto the mainframe it would be normal to convert
it into EBCDIC for storage!  The server would convert it again to
iso-8859-1 for serving it out.

And correspondingly for text files transferred to a Mac (take a 
look at what Fetch does, for example).

This byte-to-byte mapping is entirely trivial, I assure you.  The
resources used are very small, in comparison to the work needed to
send the data out over TCP/IP.

>  In fact if
> you're using EBCDIC then you cannot type in some ASCII characters directly
> either.

There are many EBCDIC "code pages"  in existence unfortunately -
CECP1047 (see Pirard's writeup) has proper assignments for the entire
iso-8859-1 repertoire.  A fortiori it certainly does include all the
"ASCII"  (in the sense of US-ASCII, 7-bit) repertoire.

The Mac is a harder nut to crack, but the substitution of the
problematical 14 characters in its native code (that are not needed for
the 8859-1 repoertoire) in favour of the missing 8859-1 characters has
been documented since 1992, as I showed you, and is in widespread use
on the 'net.

Admittedly, in the case of the Mac, some users prefer to use specialised
iso-8859-1 fonts, and work in iso-8859-1 code in the Mac, but this
then causes them problems when they try to interchange material with
other applications, and with other Mac users who are working with the
native Mac storage code.  The mass-market browsers have exacerbated
this problem by failing to conform to the existing de-facto Internet
standard (i.e the one documented by Pirard: a pity they didn't follow
the good example set by NCSA Mac Mosaic, which incorporates this
de-facto standard.) 

Anyhow, thanks for reaching a reasonable agreement on this topic, and
I hope the discussion has been useful for at least a few of the
bystanders. 

