X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-0.6 required=5.0 tests=AWL,BAYES_00,J_CHICKENPOX_41,SARE_SUB_ENC_UTF8,SPF_PASS X-Spam-Check-By: sourceware.org Date: Wed, 13 May 2009 16:39:24 +0200 To: cygwin AT cygwin DOT com Subject: Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8 From: "Matthias Andree" Content-Type: text/plain; format=flowed; delsp=yes; charset=iso-8859-15 MIME-Version: 1.0 References: <3f0ad08d0905121029j119c8a7ep41d3a261d8bea338 AT mail DOT gmail DOT com> <20090512173741 DOT GZ21324 AT calimero DOT vinschen DOT de> <20090513142953 DOT GI21324 AT calimero DOT vinschen DOT de> Content-Transfer-Encoding: 7bit Message-ID: In-Reply-To: <20090513142953.GI21324@calimero.vinschen.de> User-Agent: Opera Mail/9.64 (Win32) X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Am 13.05.2009, 16:29 Uhr, schrieb Corinna Vinschen : > On May 12 19:37, Corinna Vinschen wrote: >> On May 13 02:29, IWAMURO Motonori wrote: >> > I propose that the filename encoding in C locale uses UTF-8 instead >> of SO/UTF-8. >> > >> > There are three reasons: >> >> That's an interesting thought. Do you have a patch and, if so, did you >> try it? Does it, for instance, help for the issue reported in the >> thread starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html? > > After examining the issue Lenik reported in the above thread, I'm at > a loss how to solve this problem in a generic way. > > The problem is that the filename changes dependent on the character > set used in $LANG. The reason is that every time a multibyte filename > has to be generated, it has to be converted from UTF-16 to multibyte. > > For instance, taking one of the filename from Lenik's example. It's > stored on the filesystem as the UTF-16 sequence \u684c \u9762. If I set > LANG to en_US.UTF-8, a readdir(2) call returns the multibyte sequence > > 0xe6 0xa1 0x8c 0xe9 0x9d 0xa2 > > If I set LANG to en_US.GBK, `ls' returns the filename > > 0xd7 0xc0 0xc3 0xe6 > > And in case LANG=C, `ls' returns > > 0x0e 0xe6 0xa1 0x8c 0x0e 0xe9 0x9d 0xa2 > > So, dependent on the character set setting in the application, the idea > of the filename differs. That's not exactly helpful for interoperability > between different applications. > > I can think of two potential solutions to fix this problem: > > (1) Always return filenames in UTF-8 encoding and pretend that UTF-8 > is the way files are stored on disk. That results in unchangable > filenames which are always valid. > But what if an application sets LANG="xxxx.SJIS" and tries to create > a file using SJIS character encoding? Should the file be created > using the SJIS->UTF-16 conversion or should open fail with EILSEQ? > That's not good. Why would it have to interpreted as all? Aren't filenames just opaque strings - with exceptions, say, for / and NUL to UNIX kernels? > > (2) If none of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, then > Cygwin uses the LC_CTYPE setting which corresponds to the current > codepage. If one of $LC_ALL/$LC_CTYPE/$LANG is set in the > environment, > Cygwin uses that to convert pathnames. If the application uses > setlocale, Cygwin uses that setting to convert pathnames. > > One problem can't be solved this way: If an application fetches > and stores a filename, then switches the locale, and then tries > to use the filename in another system call, the filename is > potentially broken. > > Any better ideas? Just questions to kindle some brainstorming: - why do you need to touch the filename at all? I haven't read all of it. Is the UTF-16 on disk and we need to work around UTF-16 being intractable as C string? - some applications in the GNOME ballpark, for instance Gnumerica, do something like "treat as Unicode" and fall back to SOME_ENVIRONMENT_VARIABLE specified encoding (perhaps as a colon-separated list - not sure) - adding to my interspersed comment above: isn't the issue more about *presentation* of filenames to the user than internal workings? To me the main issue appears to be that filenames should look alike in a Cygwin application and in a native Windows application. I'd assume that applications can get really confused if you change file names behind their back. - if you speak of UTF-8, do you want to normalize file names? (I'd think you do.) Which normalization form will you choose? NFC (canonical) or NFD (compatibility)? -- Matthias Andree -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/