X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Wed, 3 Jun 2009 16:27:55 +0200 From: Corinna Vinschen To: cygwin AT cygwin DOT com Subject: Re: 1.7.0-48: [BUG] Passing characters above 128 from bash command line Message-ID: <20090603142755.GM23519@calimero.vinschen.de> Reply-To: cygwin AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com References: <3f0ad08d0905290813m39999f81q918e94e3c960eb3f AT mail DOT gmail DOT com> <4A200287 DOT 8030403 AT sidefx DOT com> <3f0ad08d0905290852xe41338alfda89c622f92f677 AT mail DOT gmail DOT com> <4A200BC0 DOT 9010704 AT sidefx DOT com> <4A204149 DOT 2050009 AT sidefx DOT com> <4A2051E5 DOT 6060600 AT sidefx DOT com> <20090602205440 DOT GF23519 AT calimero DOT vinschen DOT de> <4A26782C DOT 9040207 AT sidefx DOT com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4A26782C.9040207@sidefx.com> User-Agent: Mutt/1.5.19 (2009-02-20) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On Jun 3 09:18, Edward Lam wrote: > Corinna Vinschen wrote: >> The question is, what do you expect? [...] > [...] > Wikipedia has several suggestions on how to handle invalid UTF-8 byte > sequences (http://en.wikipedia.org/wiki/UTF-8). Personally, I favor the > rule that uses the replacement character. Chris implemented using the invalid code point solution. The discussion in http://www.mail-archive.com/linux-utf8 AT nl DOT linux DOT org/msg00080.html supports this solution. What's missing so far is the way back, from an invalid single second half of a surrogate pair in the 0xDCxx range back to the correct byte value. I'm just looking into that. > > How is anybody supposed to know that the file which consists > > of the single byte 0xa9 has *any* meaning at all? Why should it be > > the copyright sign, of all things? > > What I was attempting to do was to have NO conversion. In the > real case that I into this, the "bug.exe" was the one to properly > interpret what the byte 0xA9 meant from the command line. Yes, I know > there are several workarounds. The command line is always converted to UTF-16 when calling a native Win32 application. If we don't do it (because we call CreateProcessA), Windows would do it. As matters stand, we have to convert ourselves, because we must call CreateProcessW. Either way, the problem persists. We just don't know what the correct conversion is for the given input. We have to rely on a correct setting of $LC_ALL/$LANG/$LC_CTYPE. >> If we default to the ANSI codepage, you will have the same problem, >> just upside down. In both cases you will have even more problems if >> you start using characters not available in your default codepage. > > This is where I disagreed with Alexey. What we're really arguing here is > whether which default will run into the least problems for the most > common usage. This is subjective of course. Definitely. The "right" solution is always only right for a given value of right. What if the user has set LANG to, say, ja_JP.eucJP? That user of course expects that the stuff on the command line is converted to UTF-16 using the eucJP encoding. Everything else would just be very surprising. What's left as questionable is the LANG=C default case. Due to the discussion from the last month we now use UTF-8 as default encoding, because it's the only encoding which covers all (valid) characters. Sure, we could also convert the command line using the current ANSI codepage as Windows does it when calling CreateProcessA in this case. Maybe we should do that for testing? Anybody having a strong opinion here? Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/