www.delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/09/21/17:20:55

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-1.9 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40,SPF_PASS
X-Spam-Check-By: sourceware.org
MIME-Version: 1.0
In-Reply-To: <20090921103758.GE20981@calimero.vinschen.de>
References: <416096c60908300959i1e0084b1xc8f6e65e792b035d AT mail DOT gmail DOT com> <20090831005258 DOT GG2068 AT ednor DOT casa DOT cgf DOT cx> <416096c60909012329l2f25e735yc07145b8d6698cda AT mail DOT gmail DOT com> <3f0ad08d0909020656v7d9fce6ft4afea63ed363b9a9 AT mail DOT gmail DOT com> <416096c60909071308qc5ff057sbe9cb1dbc270554f AT mail DOT gmail DOT com> <20090908193456 DOT GC17515 AT calimero DOT vinschen DOT de> <416096c60909081449r1fe024dbm7b82a3719be05e9e AT mail DOT gmail DOT com> <20090921103758 DOT GE20981 AT calimero DOT vinschen DOT de>
Date: Mon, 21 Sep 2009 22:20:40 +0100
Message-ID: <416096c60909211420g4ac8ea93l80fc1f00dcd5c0f3@mail.gmail.com>
Subject: Re: The C locale
From: Andy Koppe <andy DOT koppe AT gmail DOT com>
To: cygwin AT cygwin DOT com
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

2009/9/21 Corinna Vinschen:
> Back from vacation I re-read this thread now and I have to say I just
> don't know what is the best course of action here.

I'm afraid I can only reiterate what I said previously.

Let's use the Windows "ANSI" codepage as the character set for the C
locale, for both the conversion functions and filenames. This means
CP1252 on Western systems, CP1251 on Cyrillic ones, CP932 on Japanese
ones, and so on.

This way, the non-ASCII needs of most users are covered
out-of-the-box, and compatibility with Cygwin 1.5 and users'
ANSI-encoded files is ensured. Applications that still assume that a
byte and a character are the same thing work correctly (except that
they'll treat East Asian doublebyte chars as two characters, but a
different default charset won't cure that).

Filenames created on the Cygwin side show up correctly in Explorer.
Windows filenames show up correctly in Cygwin as long as they're
limited to the ANSI codepage. The ^N encoding nevertheless ensures
that UTF-16 characters outside that codepage are uniquely represented.

Beyond that, encourage maintainers to make their applications
UTF-8-capable and encourage users to choose a UTF-8 locale. Consider
adding a locale setting to setup.exe that gets written to cygwin.bat.


> The idea to use UTF-8 for filename and console operations by default was
> to get the least problems converting from UTF-16 to multibyte, so that
> readdir() always returns a valid filename.

But the ^N scheme does ensure that for any charset anyway, doesn't it?


> As for the conversion of filenames, you get the same problem on Linux if
> the filename contains non-ASCII bytes and these bytes are not a valid
> multibyte character in the current locale.

Yes, but Cygwin does actually have a big advantage here. Unlike Linux,
where the filename encoding is basically undefined, we *know* that
Windows filenames are always encoded as UTF-16. Therefore, the Cygwin
file functions do have the chance to always translate filenames
correctly into the application's locale.

And with any locale except "C" and "POSIX",  this is working very
well, due to your great work implementing all the difficult bits such
as the ^N and 0xDC?? encodings and UTF-16 surrogates (and
notwithstanding the issue with translating 0xDC??s to charsets other
than UTF-8).


>> I see two good solutions:
>> - Use the default Windows codepage for filenames, console, and
>> multibyte functions. This is what happens already if you specifiy a
>> locale with a language but no charset, e.g. "en". Maximum 1.5
>> compatibility.
>
> Hmm, yes, that might be an option. =C2=A0Allowing the C.UTF-8 locale
> could workaround the remaining problems.

Not sure that the C.UTF-8 locale is necessary for that, but it would
be nice to have, and it's easy to implement.


>> - Use UTF-8 throughout. Full Unicode support out-of-the box.
>
> What means "throughout"? =C2=A0Do you want ASCII multibyte conversion to
> use UTF-8 as well?

Yep, that was the idea, but later on I realised that it's not a good
one, because too many applications still assume that a byte and a
character are the same thing. For example, start nano in a UTF-8
locale, enter a few umlauts, and move the cursor around, and you'll
see some weird effects. Similarly, filenames with non-ASCII chars will
corrupt midnight commander's display.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019