Date: Wed, 3 Jun 2009 13:01:01 -0400
From: Christopher Faylor <cgf-use-the-mailinglist-please AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Subject: Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command 	line
Message-ID: <20090603170101.GB29603@ednor.casa.cgf.cx>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
References: <e2480c70905291142o2bcc65ccw2287d175dbd09dd5 AT mail DOT gmail DOT com> <4A204149 DOT 2050009 AT sidefx DOT com> <e2480c70905291337g6c8bcca7xd0baba79c84629db AT mail DOT gmail DOT com> <4A2051E5 DOT 6060600 AT sidefx DOT com> <20090602205440 DOT GF23519 AT calimero DOT vinschen DOT de> <4A26782C DOT 9040207 AT sidefx DOT com> <20090603142755 DOT GM23519 AT calimero DOT vinschen DOT de> <20090603160225 DOT GA27039 AT ednor DOT casa DOT cgf DOT cx> <20090603161158 DOT GB23419 AT calimero DOT vinschen DOT de> <4A26AB1D DOT 1090404 AT sidefx DOT com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4A26AB1D.1090404@sidefx.com>
User-Agent: Mutt/1.5.19 (2009-01-05)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
Precedence: bulk
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com

On Wed, Jun 03, 2009 at 12:55:57PM -0400, Edward Lam wrote:
>Corinna Vinschen wrote:
>> On Jun  3 12:02, Christopher Faylor wrote:
>>> On Wed, Jun 03, 2009 at 04:27:55PM +0200, Corinna Vinschen wrote:
>>>> On Jun  3 09:18, Edward Lam wrote:
>>>>> Corinna Vinschen wrote:
>>>>>> The question is, what do you expect?  [...]
>>>>> [...]
>>>>> Wikipedia has several suggestions on how to handle invalid UTF-8 byte  
>>>>> sequences (http://en.wikipedia.org/wiki/UTF-8). Personally, I favor the  
>>>>> rule that uses the replacement character.
>>>> Chris implemented using the invalid code point solution.  The discussion
>>>> in http://www.mail-archive.com/linux-utf8 AT nl DOT linux DOT org/msg00080.html
>>>> supports this solution.  What's missing so far is the way back, from
>>>> an invalid single second half of a surrogate pair in the 0xDCxx range
>>>> back to the correct byte value.  I'm just looking into that.
>>> The way back was not, AFAIK, needed for Cygwin programs.  I don't think
>>> there is a valid way back for Windows programs.
>> 
>> The way back is not needed for the argv handling in Cygwin, but it
>> gets necessary if you converted to UTF-16 in other circumstances.
>> It's not much of a problem since the way back is a no-brainer, in
>> contrast to the conversion to UTF-16.
>
>What is the current state of affairs in cygwin 1.7.0-48? Is the invalid 
>code point solution currently being used when converting the command 
>line to UTF-16 when spawning non-cygwin processes? What I'm trying to 
>understand is where the command line truncation is taking place, in the 
>parent or child process.
>
>If the truncation is happening in the child process because of the 
>invalid code point, then perhaps we should consider using the 
>replacement character solution when spawning non-cygwin child processes. 
>IMHO, having a bad character is better than having a truncated command 
>line. At least, the problem (invalid UTF-8) then becomes more obvious.

As Corinna said above: "Chris implemented using the invalid code point
solution"

That's what is in Cygwin's CVS and in the latest snapshot.

cgf

--
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
Problem reports:       http://cygwin.com/problems.html
Documentation:         http://cygwin.com/docs.html
FAQ:                   http://cygwin.com/faq/