www.delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/05/13/10:39:56

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-0.6 required=5.0 tests=AWL,BAYES_00,J_CHICKENPOX_41,SARE_SUB_ENC_UTF8,SPF_PASS
X-Spam-Check-By: sourceware.org
Date: Wed, 13 May 2009 16:39:24 +0200
To: cygwin AT cygwin DOT com
Subject: Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8
From: "Matthias Andree" <matthias DOT andree AT gmx DOT de>
MIME-Version: 1.0
References: <3f0ad08d0905121029j119c8a7ep41d3a261d8bea338 AT mail DOT gmail DOT com> <20090512173741 DOT GZ21324 AT calimero DOT vinschen DOT de> <20090513142953 DOT GI21324 AT calimero DOT vinschen DOT de>
Message-ID: <op.utvhnyxl1e62zd@balu>
In-Reply-To: <20090513142953.GI21324@calimero.vinschen.de>
User-Agent: Opera Mail/9.64 (Win32)
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

Am 13.05.2009, 16:29 Uhr, schrieb Corinna Vinschen  
<corinna-cygwin AT cygwin DOT com>:

> On May 12 19:37, Corinna Vinschen wrote:
>> On May 13 02:29, IWAMURO Motonori wrote:
>> > I propose that the filename encoding in C locale uses UTF-8 instead  
>> of SO/UTF-8.
>> >
>> > There are three reasons:
>>
>> That's an interesting thought.  Do you have a patch and, if so, did you
>> try it?  Does it, for instance, help for the issue reported in the
>> thread starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html?
>
> After examining the issue Lenik reported in the above thread, I'm at
> a loss how to solve this problem in a generic way.
>
> The problem is that the filename changes dependent on the character
> set used in $LANG.  The reason is that every time a multibyte filename
> has to be generated, it has to be converted from UTF-16 to multibyte.
>
> For instance, taking one of the filename from Lenik's example.  It's
> stored on the filesystem as the UTF-16 sequence \u684c \u9762.  If I set
> LANG to en_US.UTF-8, a readdir(2) call returns the multibyte sequence
>
>  0xe6 0xa1 0x8c 0xe9 0x9d 0xa2
>
> If I set LANG to en_US.GBK, `ls' returns the filename
>
>  0xd7 0xc0 0xc3 0xe6
>
> And in case LANG=C, `ls' returns
>
>  0x0e 0xe6 0xa1 0x8c 0x0e 0xe9 0x9d 0xa2
>
> So, dependent on the character set setting in the application, the idea
> of the filename differs.  That's not exactly helpful for interoperability
> between different applications.
>
> I can think of two potential solutions to fix this problem:
>
> (1) Always return filenames in UTF-8 encoding and pretend that UTF-8
>     is the way files are stored on disk.  That results in unchangable
>     filenames which are always valid.
>    But what if an application sets LANG="xxxx.SJIS" and tries to create
>     a file using SJIS character encoding?  Should the file be created
>     using the SJIS->UTF-16 conversion or should open fail with EILSEQ?
>     That's not good.

Why would it have to interpreted as all? Aren't filenames just opaque  
strings - with exceptions, say, for / and NUL to UNIX kernels?

>
> (2) If none of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, then
>     Cygwin uses the LC_CTYPE setting which corresponds to the current
>     codepage.  If one of $LC_ALL/$LC_CTYPE/$LANG is set in the  
> environment,
>     Cygwin uses that to convert pathnames.  If the application uses
>     setlocale, Cygwin uses that setting to convert pathnames.
>
>     One problem can't be solved this way:  If an application fetches
>     and stores a filename, then switches the locale, and then tries
>     to use the filename in another system call, the filename is
>     potentially broken.
>
> Any better ideas?

Just questions to kindle some brainstorming:

- why do you need to touch the filename at all? I haven't read all of it.  
Is the UTF-16 on disk and we need to work around UTF-16 being intractable  
as C string?

- some applications in the GNOME ballpark, for instance Gnumerica, do  
something like "treat as Unicode" and fall back to  
SOME_ENVIRONMENT_VARIABLE specified encoding (perhaps as a colon-separated  
list - not sure)

- adding to my interspersed comment above: isn't the issue more about  
*presentation* of filenames to the user than internal workings? To me the  
main issue appears to be that filenames should look alike in a Cygwin  
application and in a native Windows application. I'd assume that  
applications can get really confused if you change file names behind their  
back.

- if you speak of UTF-8, do you want to normalize file names? (I'd think  
you do.) Which normalization form will you choose? NFC (canonical) or NFD  
(compatibility)?

-- 
Matthias Andree

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019