X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-1.7 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40,SPF_PASS X-Spam-Check-By: sourceware.org MIME-Version: 1.0 In-Reply-To: <20090924073441.GA30267@calimero.vinschen.de> References: <416096c60908300959i1e0084b1xc8f6e65e792b035d AT mail DOT gmail DOT com> <416096c60909012329l2f25e735yc07145b8d6698cda AT mail DOT gmail DOT com> <3f0ad08d0909020656v7d9fce6ft4afea63ed363b9a9 AT mail DOT gmail DOT com> <416096c60909071308qc5ff057sbe9cb1dbc270554f AT mail DOT gmail DOT com> <20090908193456 DOT GC17515 AT calimero DOT vinschen DOT de> <416096c60909081449r1fe024dbm7b82a3719be05e9e AT mail DOT gmail DOT com> <20090921103758 DOT GE20981 AT calimero DOT vinschen DOT de> <416096c60909211420g4ac8ea93l80fc1f00dcd5c0f3 AT mail DOT gmail DOT com> <3f0ad08d0909240003j435818e7h6f7cde2e26188f7e AT mail DOT gmail DOT com> <20090924073441 DOT GA30267 AT calimero DOT vinschen DOT de> Date: Thu, 24 Sep 2009 18:37:09 +0900 Message-ID: <3f0ad08d0909240237s518de248jee409b731711404a@mail.gmail.com> Subject: Re: The C locale From: IWAMURO Motonori To: cygwin AT cygwin DOT com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com 2009/9/24 Corinna Vinschen : > On Sep 24 16:03, IWAMURO Motonori wrote: >> 2009/9/22 Andy Koppe : >> > Let's use the Windows "ANSI" codepage as the character set for the C >> > locale, for both the conversion functions and filenames. This means >> > CP1252 on Western systems, CP1251 on Cyrillic ones, CP932 on Japanese >> > ones, and so on. >> >> I oppose the approach (the ANSI codepage is used at C locale) because >> CP932 (the codepage for Japanese) is hostile to the UNIX-like tools. >> >> The reason is that the CP932 format contains a lot of meta characters >> as follows. >> >> single character of CP932: >> /[\x00-\x7F\xA0-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]/ > > I don't understand. Are you saying that the single character in CP932 > consists of 12 bytes? As far as I can see, CP932 is S-JIS, which > is a just a simple double byte character set. What am I missing. - CP932 (Shift_JIS) has 1byte character and 2bytes character. - The range of 1byte character is 0x00-0x7F and 0xA0-0xDF. - The range of first byte of 2byte character is 0x80-0x9F and 0xE0-0xFC. - The range of second byte of 2byte character is 0x40-7E and 0x80-0xFC. This includes "[", "\", "]", "^", "`", "{", "|", "}". A lot of problems of the tools (don't see locale and use escaped string, globbing or regexp) are caused by the last fact. - Can't open file or directory. - Destroy filenames. - Lost files. For example: Case1: The CP932 byte sequence of "=E9=A0=85=E7=9B=AE=E8=A1=A8.xls" is 8D 8= 0 96 DA 95 *5C* (=3D=3D'\') 2E 78 6C 73. When this character string is treated as a character string with the escape without locale, 0x5C disappears. Case2: When use regexp of /=E3=82=B9=E3=83=9D=E3=83=83=E3=83=88/, I expect = that it matches the character strings including "=E3=82=B9=E3=83=9D=E3=83=83=E3=83=88". But, th= e tools (don't see locale) treat as /=E3=82=B9\x83|=E3=83=83=E3=83=88/ because the byte sequence of "= =E3=82=B9=E3=83=9D=E3=83=83=E3=83=88" is 83 58 83 *7C* (=3D=3D'|') 83 62 83 67. As a result, the strings not expected are matched. Case3: When use glob of "=E3=83=87=E3=83=BC=E3=82=BF0[0-9].dat", it treated= as "=E3=83=87\x81[\x83^0[0-9].dat". As a result, the files expected are not matched. --=20 IWAMURO Motnori -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple