www.delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2010/03/16/02:19:21

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-1.9 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40
X-Spam-Check-By: sourceware.org
MIME-Version: 1.0
In-Reply-To: <493F5820D3F64434A76F433604C79D4A@pleaset>
References: <493F5820D3F64434A76F433604C79D4A AT pleaset>
Date: Tue, 16 Mar 2010 07:19:05 +0000
Message-ID: <416096c61003160019p24e58433x4a969c0f99068fa6@mail.gmail.com>
Subject: Re: filenames with characters that have the high bit set
From: Andy Koppe <andy DOT koppe AT gmail DOT com>
To: dbyron AT dbyron DOT com, cygwin AT cygwin DOT com
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

David Byron:
> I've read http://cygwin.com/faq/faq-nochunks.html#faq.using.unicode and
> http://cygwin.com/cygwin-ug-net/setup-locale.html but I'm still stumped.
>
> My cygwin.bat now contains:
>
> @echo off
>
> C:
> chdir C:\utils\cygwin\bin
> set LANG=3Den_US.UTF-8
> bash --login -I
>
> And my ~/.inputrc contains:
>
> set meta-flag on
> set convert-meta off
> set input-meta on
> set output-meta on

Makes plenty of sense. But note that meta-flag is a synonym for
input-meta, so you can remove one of them.

> $ echo $LC_ALL
> en_US

Hang on, where did that come from? LC_ALL overrides any other locale
variables including LANG. Specifying a locale without a charset means
that Cygwin 1.7.1 looks up your ANSI codepage. Assuming you're on a US
system, this means you're getting CP1252, not UTF-8. (Note besides:
Cygwin 1.7.2 changes to a Linux-compatible scheme for locales without
explicit charset instead, where you'd get ISO-8859-1 instead.)


> $ echo $LANG
> en_US.UTF-8
>
> For the rest of this post, assume <special_filename> is "foo" with U+00E9=
 (e
> with acute accent) at the end.
>
> $ test -f <special_filename>; echo $?
>
> prints 1 when <special_filename> really does exist....depending on how I =
try
> to represent U+00E9 on the command line
>
> $ ls foo<tab>
>
> adds the actual accented character to the command line (whether set
> show-all-if-ambiguous on is in ~/.inputrc or not). =C2=A0Then I press ret=
urn and
> ls prints the filename. =C2=A0Then if I go through command history and ch=
ange
> "ls" to "test -f" and add the "; echo $?" I get the right answer from tes=
t.
> So far so good.
>
> But, if I I try to do what
> http://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-unusual
> says, the test command always fails, and ls doesn't print the filename. =
=C2=A0I'm
> not really sure how to get hex code 0x18 through bash and to
> ls/test/whatever properly.
>
> =C2=A0This what I tried:
>
> $ ls "foo\x18<tab>"
> $ ls "foo\x18\xc3\xa9<tab>"
> $ ls "foo\x18\xc3\xa9*"
>
> Note that 0xC3A9 is the UTF-8 encoding of U+00E9.

There's a bunch of things wrong here.

Due to the LC_ALL setting above, the U+00E9 is encoded as \xE9, not \xC3\xA=
9.

The \x18 scheme is only used for codepoints that can not be
represented in the selected character set, yet U+00E9 can be
represented CP1252. By definition, any Unicode codepoint can be
represented in UTF-8, so the \x18 scheme is never used when that is
selected.

Bash does not interpret \x specially when it appears in double quotes
(or single quotes or unquoted):

$ echo "\x18"
\x18

To enable C-style backslash interpretation, you need to use $'...' quoting.

Finally, it would appear that bash does not complete partial UTF-8
sequences, which makes sense, as it's probably dealing with wide
characters internally.

> But all get me nothing. =C2=A0Replacing "ls" with "test -f" gives me the =
same
> nothing. =C2=A0Replacing \x with \X doesn't change anything either.
>
> Perhaps interesting is that if I pipe the ls command built with tab
> completion that actually prints the filename to "od -c" I see
> Then for kicks I tried:
>
> $ touch "\x18"; echo $?
> 0

Have a look in your root directory. There should be a file called x18 there.

> Can someone give me a hand coming up with a command line where I can build
> up filenames that contain characters that have the high bit set (as well =
as
> any non-ascii character really)?

Just type them in. The 'US International' keyboard layout might be
useful here. See
http://en.wikipedia.org/wiki/Keyboard_layout#US-International.

Otherwise, use $'...', and lose the unnecessary \x18s.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019