Mail Archives: cygwin/2009/05/13/10:39:56
Am 13.05.2009, 16:29 Uhr, schrieb Corinna Vinschen
<corinna-cygwin AT cygwin DOT com>:
> On May 12 19:37, Corinna Vinschen wrote:
>> On May 13 02:29, IWAMURO Motonori wrote:
>> > I propose that the filename encoding in C locale uses UTF-8 instead
>> of SO/UTF-8.
>> >
>> > There are three reasons:
>>
>> That's an interesting thought. Do you have a patch and, if so, did you
>> try it? Does it, for instance, help for the issue reported in the
>> thread starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html?
>
> After examining the issue Lenik reported in the above thread, I'm at
> a loss how to solve this problem in a generic way.
>
> The problem is that the filename changes dependent on the character
> set used in $LANG. The reason is that every time a multibyte filename
> has to be generated, it has to be converted from UTF-16 to multibyte.
>
> For instance, taking one of the filename from Lenik's example. It's
> stored on the filesystem as the UTF-16 sequence \u684c \u9762. If I set
> LANG to en_US.UTF-8, a readdir(2) call returns the multibyte sequence
>
> 0xe6 0xa1 0x8c 0xe9 0x9d 0xa2
>
> If I set LANG to en_US.GBK, `ls' returns the filename
>
> 0xd7 0xc0 0xc3 0xe6
>
> And in case LANG=C, `ls' returns
>
> 0x0e 0xe6 0xa1 0x8c 0x0e 0xe9 0x9d 0xa2
>
> So, dependent on the character set setting in the application, the idea
> of the filename differs. That's not exactly helpful for interoperability
> between different applications.
>
> I can think of two potential solutions to fix this problem:
>
> (1) Always return filenames in UTF-8 encoding and pretend that UTF-8
> is the way files are stored on disk. That results in unchangable
> filenames which are always valid.
> But what if an application sets LANG="xxxx.SJIS" and tries to create
> a file using SJIS character encoding? Should the file be created
> using the SJIS->UTF-16 conversion or should open fail with EILSEQ?
> That's not good.
Why would it have to interpreted as all? Aren't filenames just opaque
strings - with exceptions, say, for / and NUL to UNIX kernels?
>
> (2) If none of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, then
> Cygwin uses the LC_CTYPE setting which corresponds to the current
> codepage. If one of $LC_ALL/$LC_CTYPE/$LANG is set in the
> environment,
> Cygwin uses that to convert pathnames. If the application uses
> setlocale, Cygwin uses that setting to convert pathnames.
>
> One problem can't be solved this way: If an application fetches
> and stores a filename, then switches the locale, and then tries
> to use the filename in another system call, the filename is
> potentially broken.
>
> Any better ideas?
Just questions to kindle some brainstorming:
- why do you need to touch the filename at all? I haven't read all of it.
Is the UTF-16 on disk and we need to work around UTF-16 being intractable
as C string?
- some applications in the GNOME ballpark, for instance Gnumerica, do
something like "treat as Unicode" and fall back to
SOME_ENVIRONMENT_VARIABLE specified encoding (perhaps as a colon-separated
list - not sure)
- adding to my interspersed comment above: isn't the issue more about
*presentation* of filenames to the user than internal workings? To me the
main issue appears to be that filenames should look alike in a Cygwin
application and in a native Windows application. I'd assume that
applications can get really confused if you change file names behind their
back.
- if you speak of UTF-8, do you want to normalize file names? (I'd think
you do.) Which normalization form will you choose? NFC (canonical) or NFD
(compatibility)?
--
Matthias Andree
--
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Problem reports: http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ: http://cygwin.com/faq/
- Raw text -