X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=0.0 required=5.0 tests=AWL,BAYES_50,J_CHICKENPOX_41,SARE_MSGID_LONG40,SARE_SUB_ENC_UTF8,SPF_PASS X-Spam-Check-By: sourceware.org MIME-Version: 1.0 In-Reply-To: <20090513142953.GI21324@calimero.vinschen.de> References: <3f0ad08d0905121029j119c8a7ep41d3a261d8bea338 AT mail DOT gmail DOT com> <20090512173741 DOT GZ21324 AT calimero DOT vinschen DOT de> <20090513142953 DOT GI21324 AT calimero DOT vinschen DOT de> Date: Thu, 14 May 2009 01:03:59 +0900 Message-ID: <3f0ad08d0905130903o5cf0330enc8025bc92e94225c@mail.gmail.com> Subject: Re: [1.7] Proposal: the filename encoding in C locale uses UTF-8 instead of SO/UTF-8 From: IWAMURO Motonori To: cygwin AT cygwin DOT com Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Hi. My idea is as follows: 1) separate mbtowc/wctomb function entries to library usage and system usage. (__mbtowc/__wctomb & __sys_mbtowc/__sys_wctomb) 2) If call setlocale(LC_CTYPE) by locale !=3D "C", then lib =3D=3D sys. 3) If call setlocale(LC_CTYPE) by locale =3D=3D "C", then sys is set by LC_ALL/LC_CTYPE/LANG. If LC_ALL/LC_CTYPE/LANG are not set, use UTF-8 converter. Cygwin startup call setlocale(LC_CTYPE, "C") at winsup/cygwin/dcrt0.cc. I think that the result is as follows: 1) LANG=3DC lib =3D ascii converter, sys =3D UTF-8 converter. 2) LANG=3Dxx_XX.ENCODING & not call setlocale. lib =3D ascii converter, sys =3D ENCODING converter. 3) LANG=3Dxx_XX.ENCODING & call setlocale(LC_ALL, ""). lib =3D ENCODING converter, sys =3D ENCODING converter. I think that [cat `read_dir_entry_and_print_app`] works correctly above all. I am writing this patch and test code now. > One problem can't be solved this way: =A0If an application fetches > and stores a filename, then switches the locale, and then tries > to use the filename in another system call, the filename is > potentially broken. If the application switches the encoding while processing, I think that the problem is a responsibility of the application. 2009/5/13 Corinna Vinschen : > On May 12 19:37, Corinna Vinschen wrote: >> On May 13 02:29, IWAMURO Motonori wrote: >> > I propose that the filename encoding in C locale uses UTF-8 instead of= SO/UTF-8. >> > >> > There are three reasons: >> >> That's an interesting thought. =A0Do you have a patch and, if so, did you >> try it? =A0Does it, for instance, help for the issue reported in the >> thread starting at http://cygwin.com/ml/cygwin/2009-05/msg00245.html? > > After examining the issue Lenik reported in the above thread, I'm at > a loss how to solve this problem in a generic way. > > The problem is that the filename changes dependent on the character > set used in $LANG. =A0The reason is that every time a multibyte filename > has to be generated, it has to be converted from UTF-16 to multibyte. > > For instance, taking one of the filename from Lenik's example. =A0It's > stored on the filesystem as the UTF-16 sequence \u684c \u9762. =A0If I set > LANG to en_US.UTF-8, a readdir(2) call returns the multibyte sequence > > =A00xe6 0xa1 0x8c 0xe9 0x9d 0xa2 > > If I set LANG to en_US.GBK, `ls' returns the filename > > =A00xd7 0xc0 0xc3 0xe6 > > And in case LANG=3DC, `ls' returns > > =A00x0e 0xe6 0xa1 0x8c 0x0e 0xe9 0x9d 0xa2 > > So, dependent on the character set setting in the application, the idea > of the filename differs. =A0That's not exactly helpful for interoperabili= ty > between different applications. > > I can think of two potential solutions to fix this problem: > > (1) Always return filenames in UTF-8 encoding and pretend that UTF-8 > =A0 =A0is the way files are stored on disk. =A0That results in unchangable > =A0 =A0filenames which are always valid. > > =A0 =A0But what if an application sets LANG=3D"xxxx.SJIS" and tries to cr= eate > =A0 =A0a file using SJIS character encoding? =A0Should the file be created > =A0 =A0using the SJIS->UTF-16 conversion or should open fail with EILSEQ? > =A0 =A0That's not good. > > (2) If none of $LC_ALL/$LC_CTYPE/$LANG is set in the environment, then > =A0 =A0Cygwin uses the LC_CTYPE setting which corresponds to the current > =A0 =A0codepage. =A0If one of $LC_ALL/$LC_CTYPE/$LANG is set in the envir= onment, > =A0 =A0Cygwin uses that to convert pathnames. =A0If the application uses > =A0 =A0setlocale, Cygwin uses that setting to convert pathnames. > > =A0 =A0One problem can't be solved this way: =A0If an application fetches > =A0 =A0and stores a filename, then switches the locale, and then tries > =A0 =A0to use the filename in another system call, the filename is > =A0 =A0potentially broken. > > Any better ideas? > > > Corinna > > -- > Corinna Vinschen =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Please, send mails re= garding Cygwin to > Cygwin Project Co-Leader =A0 =A0 =A0 =A0 =A0cygwin AT cygwin DOT com > Red Hat > > -- > Unsubscribe info: =A0 =A0 =A0http://cygwin.com/ml/#unsubscribe-simple > Problem reports: =A0 =A0 =A0 http://cygwin.com/problems.html > Documentation: =A0 =A0 =A0 =A0 http://cygwin.com/docs.html > FAQ: =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 http://cygwin.com/faq/ > > --=20 IWAMURO Motnori -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/