www.delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/06/03/10:28:29

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Wed, 3 Jun 2009 16:27:55 +0200
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Subject: Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line
Message-ID: <20090603142755.GM23519@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
References: <3f0ad08d0905290813m39999f81q918e94e3c960eb3f AT mail DOT gmail DOT com> <4A200287 DOT 8030403 AT sidefx DOT com> <3f0ad08d0905290852xe41338alfda89c622f92f677 AT mail DOT gmail DOT com> <4A200BC0 DOT 9010704 AT sidefx DOT com> <e2480c70905291142o2bcc65ccw2287d175dbd09dd5 AT mail DOT gmail DOT com> <4A204149 DOT 2050009 AT sidefx DOT com> <e2480c70905291337g6c8bcca7xd0baba79c84629db AT mail DOT gmail DOT com> <4A2051E5 DOT 6060600 AT sidefx DOT com> <20090602205440 DOT GF23519 AT calimero DOT vinschen DOT de> <4A26782C DOT 9040207 AT sidefx DOT com>
MIME-Version: 1.0
In-Reply-To: <4A26782C.9040207@sidefx.com>
User-Agent: Mutt/1.5.19 (2009-02-20)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On Jun  3 09:18, Edward Lam wrote:
> Corinna Vinschen wrote:
>> The question is, what do you expect?  [...]
> [...]
> Wikipedia has several suggestions on how to handle invalid UTF-8 byte  
> sequences (http://en.wikipedia.org/wiki/UTF-8). Personally, I favor the  
> rule that uses the replacement character.

Chris implemented using the invalid code point solution.  The discussion
in http://www.mail-archive.com/linux-utf8 AT nl DOT linux DOT org/msg00080.html
supports this solution.  What's missing so far is the way back, from
an invalid single second half of a surrogate pair in the 0xDCxx range
back to the correct byte value.  I'm just looking into that.

> > How is anybody supposed to know that the file which consists
> > of the single byte 0xa9 has *any* meaning at all?  Why should it be
> > the copyright sign, of all things?
>
> What I was attempting to do was to have NO conversion. In the
> real case that I into this, the "bug.exe" was the one to properly
> interpret what the byte 0xA9 meant from the command line. Yes, I know
> there are several workarounds.

The command line is always converted to UTF-16 when calling a native
Win32 application.  If we don't do it (because we call CreateProcessA),
Windows would do it.  As matters stand, we have to convert ourselves,
because we must call CreateProcessW.  Either way, the problem persists.
We just don't know what the correct conversion is for the given input.
We have to rely on a correct setting of $LC_ALL/$LANG/$LC_CTYPE.

>> If we default to the ANSI codepage, you will have the same problem,
>> just upside down.  In both cases you will have even more problems if
>> you start using characters not available in your default codepage.
>
> This is where I disagreed with Alexey. What we're really arguing here is  
> whether which default will run into the least problems for the most  
> common usage. This is subjective of course.

Definitely.  The "right" solution is always only right for a given value
of right.  What if the user has set LANG to, say, ja_JP.eucJP?  That
user of course expects that the stuff on the command line is converted
to UTF-16 using the eucJP encoding.  Everything else would just be very
surprising.

What's left as questionable is the LANG=C default case.  Due to the
discussion from the last month we now use UTF-8 as default encoding,
because it's the only encoding which covers all (valid) characters.
Sure, we could also convert the command line using the current ANSI
codepage as Windows does it when calling CreateProcessA in this case.

Maybe we should do that for testing?  Anybody having a strong opinion
here?


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019