www.delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/09/21/06:38:19

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Mon, 21 Sep 2009 12:37:58 +0200
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Subject: Re: The C locale
Message-ID: <20090921103758.GE20981@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
References: <416096c60908300959i1e0084b1xc8f6e65e792b035d AT mail DOT gmail DOT com> <20090831005258 DOT GG2068 AT ednor DOT casa DOT cgf DOT cx> <416096c60909012329l2f25e735yc07145b8d6698cda AT mail DOT gmail DOT com> <3f0ad08d0909020656v7d9fce6ft4afea63ed363b9a9 AT mail DOT gmail DOT com> <416096c60909071308qc5ff057sbe9cb1dbc270554f AT mail DOT gmail DOT com> <20090908193456 DOT GC17515 AT calimero DOT vinschen DOT de> <416096c60909081449r1fe024dbm7b82a3719be05e9e AT mail DOT gmail DOT com>
MIME-Version: 1.0
In-Reply-To: <416096c60909081449r1fe024dbm7b82a3719be05e9e@mail.gmail.com>
User-Agent: Mutt/1.5.19 (2009-02-20)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On Sep  8 22:49, Andy Koppe wrote:
> ps:
> > Maximum 1.5 compatibility (what for and how long?)  vs. maximum
> > default usability in the long run (at least I hope so).
> 
> Compatibilty for users upgrading to 1.7, who are used to being able to
> use the non-ASCII chars in their ANSI codepage, which is usually all
> they care about. And who have files encoded in that codepage, while
> being blissfully unaware what stuff like "LC_CTYPE" or "CP1251" means.
> And who are therefore going to complain about Cygwin 1.7 breaking
> their files.
> 
> Using UTF-8 throughout is a worthwhile aim of course, but it's a bumpy
> road to get there, with lots of apps not yet ready. Moreover, is there
> actually any other OS where the "C" locale uses UTF-8? Afaik, Linuxes
> just set LANG to *.UTF-8 somewhere in the startup scripts.

Back from vacation I re-read this thread now and I have to say I just
don't know what is the best course of action here.

The idea to use UTF-8 for filename and console operations by default was
to get the least problems converting from UTF-16 to multibyte, so that
readdir() always returns a valid filename.  Since the filename is
supposed to be just a NUL-terminated stream of bytes, the application
shouldn't care what the filename looks like, it should just always use
it as is.  In contrast to Linux filesystems, where the filename actually
*is* a simple byte stream, we have to convert the filename back and
forth from and to UTF-16.

As for the conversion of filenames, you get the same problem on Linux if
the filename contains non-ASCII bytes and these bytes are not a valid
multibyte character in the current locale.

Referring to another of your mails in this thread:

> A user with such a setup who upgrades to 1.7 will find that things
> will no longer work as before, since filenames are translated to UTF-8
> whereas the console now seems to use ISO-8859-1 (presumably via the mb
> functions) by default. Hence a file called 'b\344h' in Explorer (with
> a-umlaut in the middle), will show as 'bäh' instead.

That's because the console uses the ascii conversion by default which
is the newlib implementation just passing through all bytes unconverted,
even the >=0x80 ones.  That's ISO-8859-1 conincidentally.  However, that
means the console uses the same conversion as the application.  Only the
filename conversion uses UTF-8.

> And if you try to create 'b\344h' in Cygwin 1.7, you actually get a file
> called 'b', because the '\344' (0xE4) in ISO-8859-1 turns into an
> encoding error when interpreted as UTF-8, and the name simply seems to
> be truncated at that point.

Yes, that *is* a problem.

> I see two good solutions:
> - Use the default Windows codepage for filenames, console, and
> multibyte functions. This is what happens already if you specifiy a
> locale with a language but no charset, e.g. "en". Maximum 1.5
> compatibility.

Hmm, yes, that might be an option.  Allowing the C.UTF-8 locale
could workaround the remaining problems.

> - Use UTF-8 throughout. Full Unicode support out-of-the box.

What means "throughout"?  Do you want ASCII multibyte conversion to 
use UTF-8 as well?  Of course that will still result in problems if
a shell script has a filename hardcoded in, say, CP1252.

> And a cheap'n'nasty one:
> - Restrict the multibyte functions and console to 7-bit ASCII. Still
> means it's inconsistent with the filename conversions, but at least
> non-ASCII characters wouldn't show up wrongly. Instead, they wouldn't
> show at all.

I remember having seen this on Linux as well in some GUI applications.

Apart from that, the fourth solution is to stick to the current
implementation to use UTF-8 for filenames by default and relaxed ASCII
(ISO-8859-1) as provided by newlib for everything else.

The problem is, I don't know for sure what the best appraoch is, and it
seems nobody except you and Iwamuro are actually interested to discuss
this.  And you both have a contrary opinion in this matter.

Personally I have no problem with the current approach.  I understand
the potential problems, but, as usual, solving it one way results in
problems in another scenario and vice versa.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019