X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-1.9 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40 X-Spam-Check-By: sourceware.org MIME-Version: 1.0 In-Reply-To: <493F5820D3F64434A76F433604C79D4A@pleaset> References: <493F5820D3F64434A76F433604C79D4A AT pleaset> Date: Tue, 16 Mar 2010 07:19:05 +0000 Message-ID: <416096c61003160019p24e58433x4a969c0f99068fa6@mail.gmail.com> Subject: Re: filenames with characters that have the high bit set From: Andy Koppe To: dbyron AT dbyron DOT com, cygwin AT cygwin DOT com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com David Byron: > I've read http://cygwin.com/faq/faq-nochunks.html#faq.using.unicode and > http://cygwin.com/cygwin-ug-net/setup-locale.html but I'm still stumped. > > My cygwin.bat now contains: > > @echo off > > C: > chdir C:\utils\cygwin\bin > set LANG=3Den_US.UTF-8 > bash --login -I > > And my ~/.inputrc contains: > > set meta-flag on > set convert-meta off > set input-meta on > set output-meta on Makes plenty of sense. But note that meta-flag is a synonym for input-meta, so you can remove one of them. > $ echo $LC_ALL > en_US Hang on, where did that come from? LC_ALL overrides any other locale variables including LANG. Specifying a locale without a charset means that Cygwin 1.7.1 looks up your ANSI codepage. Assuming you're on a US system, this means you're getting CP1252, not UTF-8. (Note besides: Cygwin 1.7.2 changes to a Linux-compatible scheme for locales without explicit charset instead, where you'd get ISO-8859-1 instead.) > $ echo $LANG > en_US.UTF-8 > > For the rest of this post, assume is "foo" with U+00E9= (e > with acute accent) at the end. > > $ test -f ; echo $? > > prints 1 when really does exist....depending on how I = try > to represent U+00E9 on the command line > > $ ls foo > > adds the actual accented character to the command line (whether set > show-all-if-ambiguous on is in ~/.inputrc or not). =C2=A0Then I press ret= urn and > ls prints the filename. =C2=A0Then if I go through command history and ch= ange > "ls" to "test -f" and add the "; echo $?" I get the right answer from tes= t. > So far so good. > > But, if I I try to do what > http://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-unusual > says, the test command always fails, and ls doesn't print the filename. = =C2=A0I'm > not really sure how to get hex code 0x18 through bash and to > ls/test/whatever properly. > > =C2=A0This what I tried: > > $ ls "foo\x18" > $ ls "foo\x18\xc3\xa9" > $ ls "foo\x18\xc3\xa9*" > > Note that 0xC3A9 is the UTF-8 encoding of U+00E9. There's a bunch of things wrong here. Due to the LC_ALL setting above, the U+00E9 is encoded as \xE9, not \xC3\xA= 9. The \x18 scheme is only used for codepoints that can not be represented in the selected character set, yet U+00E9 can be represented CP1252. By definition, any Unicode codepoint can be represented in UTF-8, so the \x18 scheme is never used when that is selected. Bash does not interpret \x specially when it appears in double quotes (or single quotes or unquoted): $ echo "\x18" \x18 To enable C-style backslash interpretation, you need to use $'...' quoting. Finally, it would appear that bash does not complete partial UTF-8 sequences, which makes sense, as it's probably dealing with wide characters internally. > But all get me nothing. =C2=A0Replacing "ls" with "test -f" gives me the = same > nothing. =C2=A0Replacing \x with \X doesn't change anything either. > > Perhaps interesting is that if I pipe the ls command built with tab > completion that actually prints the filename to "od -c" I see > Then for kicks I tried: > > $ touch "\x18"; echo $? > 0 Have a look in your root directory. There should be a file called x18 there. > Can someone give me a hand coming up with a command line where I can build > up filenames that contain characters that have the high bit set (as well = as > any non-ascii character really)? Just type them in. The 'US International' keyboard layout might be useful here. See http://en.wikipedia.org/wiki/Keyboard_layout#US-International. Otherwise, use $'...', and lose the unnecessary \x18s. Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple