X-Recipient: archive-cygwin@delorie.com
X-SWARE-Spam-Status: No, hits=-2.0 required=5.0	tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,T_TO_NO_BRKTS_FREEMAIL
X-Spam-Check-By: sourceware.org
MIME-Version: 1.0
In-Reply-To: <AANLkTil9K6g8VzziQFm_HD_UcrKpKxxp8L6XEOtOJ0T3@mail.gmail.com>
References: <AANLkTinfzh_OsXWlI-xzEgl5QEn6zBR-_ikaXInnu-Ps@mail.gmail.com>	<4BF55DF8.2090007@towo.net>	<AANLkTikH39ppClmi9z_TnZ3GJeIbs0ZuhxWm2yNiGbvs@mail.gmail.com>	<AANLkTini_UcjRIl2pofwHHkoW7tAWWtY2EoqOw4AEjxC@mail.gmail.com>	<AANLkTil9K6g8VzziQFm_HD_UcrKpKxxp8L6XEOtOJ0T3@mail.gmail.com>
Date: Sat, 29 May 2010 06:16:04 +0100
Message-ID: <AANLkTin9EXynUminGr5mwjqqqMX4Kocds9FQc3k4ccSU@mail.gmail.com>
Subject: Re: LANG=ja_JP.Shift_JIS
From: Andy Koppe <andy.koppe@gmail.com>
To: cygwin@cygwin.com
Cc: rushojp <rushojp@gmail.com>
Content-Type: text/plain; charset=UTF-8
X-IsSubscribed: yes
Mailing-List: contact cygwin-help@cygwin.com; run by ezmlm
Precedence: bulk
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie.com@cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe@cygwin.com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin@cygwin.com>
List-Help: <mailto:cygwin-help@cygwin.com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner@cygwin.com
Mail-Followup-To: cygwin@cygwin.com
Delivered-To: mailing list cygwin@cygwin.com

On 22 May 2010 14:27, rushojp wrote:
>> So why do you need to set it to ja_JP.Shift_JIS if ja_JP.CP932 and
>> ja_JP.SJIS do the same thing?
>
> There is no serious reason.
> I think IANA name is more famous.

Fair enough, but I think it would be misleading to use the official
IANA name for what's a (slightly) different charset.


> @centos5.5
> $ echo -ne '\x5c ~ \x81\x60'|iconv -f Shift_JIS -t UTF-16LE|hexdump
> 0000000 00a5 0020 203e 0020 301c
> 000000a
> $ echo -ne '\x5c ~ \x81\x60'|iconv -f SJIS -t UTF-16LE|hexdump
> 0000000 00a5 0020 203e 0020 301c
> 000000a
> $ echo -ne '\x5c ~ \x81\x60'|iconv -f CP932 -t UTF-16LE|hexdump
> 0000000 005c 0020 007e 0020 ff5e
> 000000a
> $ echo -ne '\x5c ~ \x81\x60'|iconv -f Windows-31J -t UTF-16LE|hexdump
> 0000000 005c 0020 007e 0020 ff5e
> 000000a
>
> @cygwin-1.7
> $ echo -ne '\x5c ~ \x81\x60'|iconv -f Shift_JIS -t UTF-16LE|hexdump
> 0000000 00a5 0020 203e 0020 301c
> 000000a
> $ echo -ne '\x5c ~ \x81\x60'|iconv -f SJIS -t UTF-16LE|hexdump
> 0000000 00a5 0020 203e 0020 301c
> 000000a
> $ echo -ne '\x5c ~ \x81\x60'|iconv -f CP932 -t UTF-16LE|hexdump
> 0000000 005c 0020 007e 0020 301c
> 000000a

Looks as expected to me. Iconv's charset names are independent of the
locale charset names, but it is unfortunate that "SJIS" means
"Shift_JIS" to iconv whereas it means "CP932" to the locale system.
That's why I called the SJIS->CP932 mapping "dodgy", but we need to
keep it for compatibility (and convenience). Importantly,
nl_langinfo(CODESET) returns "CP932" both for ja_JP.CP932 and
ja_JP.SJIS, so that programs that use the CODESET string in iconv end
up with the correct encoding.


> $ echo -ne '\x5c ~ \x81\x60'|iconv -f Windows-31J -t UTF-16LE|hexdump
> iconv: conversion from Windows-31J unsupported
> iconv: try 'iconv -l' to get the list of supported encodings

I had to look that one up: "Windows-31J" is the official IANA name for
CP932. I guess it should be added to Cygwin's iconv. (But how did they
come up with that name?)

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

