www.delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/09/21/14:54:26

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-1.3 required=5.0 tests=AWL,BAYES_00,J_CHICKENPOX_15,J_CHICKENPOX_23,SARE_MSGID_LONG40,SPF_PASS
X-Spam-Check-By: sourceware.org
MIME-Version: 1.0
In-Reply-To: <20090921161014.GI20981@calimero.vinschen.de>
References: <h8bk5a$big$1 AT ger DOT gmane DOT org> <416096c60909101512l6e42ab72l4ba5fd792363eefd AT mail DOT gmail DOT com> <h8p50e$im8$1 AT ger DOT gmane DOT org> <20090921161014 DOT GI20981 AT calimero DOT vinschen DOT de>
Date: Mon, 21 Sep 2009 19:54:07 +0100
Message-ID: <416096c60909211154u5ddd5869v986011aa4ee13d57@mail.gmail.com>
Subject: Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
From: Andy Koppe <andy DOT koppe AT gmail DOT com>
To: cygwin AT cygwin DOT com
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

2009/9/21 Corinna Vinschen:
>> % cat t.c
>> int main() {
>> =C2=A0 =C2=A0 fopen("a-\xF6\xE4\xFC\xDF", "w"); //ISO-8859-1
>> =C2=A0 =C2=A0 fopen("b-\xF6\xE4\xFC\xDFz", "w");
>> =C2=A0 =C2=A0 fopen("c-\xF6\xE4\xFC\xDFzz", "w");
>> =C2=A0 =C2=A0 fopen("d-\xF6\xE4\xFC\xDFzzz", "w");
>> =C2=A0 =C2=A0 fopen("e-\xF6\xE4\xFC\xDF\xF6\xE4\xFC\xDF", "w");
>> =C2=A0 =C2=A0 return 0;
>> }
>
> Ok, I see what happens. =C2=A0The problem is that the mechanism which is
> supposed to handle invalid multibyte sequences handles the first such
> byte, but misses to reset the multibyte shift state after the byte has
> been handled. =C2=A0Basically, resetting the shift state after such a
> sequence has been encountered fixes that problem.

Great!


> Unfortunately this is only the first half of a solution. =C2=A0This is wh=
at
> `ls' prints after running t:
>
> =C2=A0$ ls -l --show-control-chars
> =C2=A0total 21
> =C2=A0-rw-r--r-- 1 corinna vinschen =C2=A0 =C2=A0 0 Sep 21 17:35 a-=C3=B6=
=C3=A4=C3=BC=C3=9F
> =C2=A0-rw-r--r-- 1 corinna vinschen =C2=A0 =C2=A0 0 Sep 21 17:35 c-=C3=B6=
=C3=A4=C3=BC=C3=9Fzz
> =C2=A0-rw-r--r-- 1 corinna vinschen =C2=A0 =C2=A0 0 Sep 21 17:35 d-=C3=B6=
=C3=A4=C3=BC=C3=9Fzzz
> =C2=A0-rw-r--r-- 1 corinna vinschen =C2=A0 =C2=A0 0 Sep 21 17:35 e-=C3=B6=
=C3=A4=C3=BC=C3=9F=C3=B6=C3=A4=C3=BC=C3=9F
>
> But this is what ls prints when setting $LANG to something "non-C":
>
> =C2=A0$ setenv LANG en =C2=A0 =C2=A0 =C2=A0(implies codepage 1252)
> =C2=A0$ ls -l --show-control-chars
> =C2=A0ls: cannot access a-=C3=B6=C3=A4=C3=BC=C3=9F: No such file or direc=
tory
> =C2=A0ls: cannot access c-=C3=B6=C3=A4=C3=BC=C3=9Fzz: No such file or dir=
ectory
> =C2=A0ls: cannot access d-=C3=B6=C3=A4=C3=BC=C3=9Fzzz: No such file or di=
rectory
> =C2=A0ls: cannot access e-=C3=B6=C3=A4=C3=BC=C3=9F=C3=B6=C3=A4=C3=BC=C3=
=9F: No such file or directory
> =C2=A0total 21
> =C2=A0-????????? ? ? =C2=A0 =C2=A0 =C2=A0 ? =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0? =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0? a-=
=C3=B6=C3=A4=C3=BC=C3=9F
> =C2=A0-????????? ? ? =C2=A0 =C2=A0 =C2=A0 ? =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0? =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0? c-=
=C3=B6=C3=A4=C3=BC=C3=9Fzz
> =C2=A0-????????? ? ? =C2=A0 =C2=A0 =C2=A0 ? =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0? =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0? d-=
=C3=B6=C3=A4=C3=BC=C3=9Fzzz
> =C2=A0-????????? ? ? =C2=A0 =C2=A0 =C2=A0 ? =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0? =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0? e-=
=C3=B6=C3=A4=C3=BC=C3=9F=C3=B6=C3=A4=C3=BC=C3=9F

Btw, the same thing will happen with en.C-ISO-8859-1 or C.ASCII too.


> As you might know, invalid bytes >=3D 0x80 are translated to UTF-16 by
> transposing them into the 0xdc00 - 0xdcff range by just or'ing 0xdc00.
> The problem now is that readdir() will return the transposed characters
> as if they are the original characters.

Yep, that's where the bug is. Those 0xDC?? words represent invalid
UTF-8 bytes. They do not represent CP1252 or ISO-8859-1 characters.

Therefore, when converting a UTF-16 Windows filename to the current
charset, 0xDC?? words should be treated like any other UTF-16 word
that can't be represented in the current charset: it should be encoded
as a ^N sequence.


> ls uses some mbtowc function
> to create a valid widechar string, and then uses the resulting widechar
> string in some wctomb function to call stat().

It's not 'ls' that does that conversion. On the POSIX side, filenames
are simply sequences of bytes, hence 'ls' would be very wrong to do
any conversion between readdir() and stat().

No, it's stat() itself converting the CP1252 sequence "a-=C3=B6=C3=A4=C3=BC=
=C3=9F" to
UTF-16, which yields L"a-=C3=B6=C3=A4=C3=BC=C3=9F". This does not contain t=
he 0xDC??
codepoints that the actual filename contained, hence stat() fails.


> So it looks like the current mechanism to handle invalid multibyte
> sequences is too complicated for us. =C2=A0As far as I can see, it would =
be
> much simpler and less error prone to translate the invalid bytes simply
> to the equivalent UTF-16 value. =C2=A0That creates filenames with UTF-16
> values from the ISO-8859-1 range.

This won't work correctly, because different POSIX filenames will map
to the same Windows filename. For example, the filenames "\xC3\xA4"
(valid UTF-8 for a-umlaut) and "\xC4" (invalid UTF-8 sequence that
represents a-umlaut in 8859-1), will both map to Windows filename
"U+00C4", i.e a-umlaut in UTF-16. Furthermore, after creating a file
called "\xC4", a readdir() would show that file as "\xC3\xA4".

Note also that invalid UTF-8 sequences would be much less of an issue
if the C locale didn't mix UTF-8 filenames with a ISO-8859-1 console.
They'd still occur e.g. when unpacking a tarball with ISO-encoded
filenames while a UTF-8 locale is active. However, that sort of
situation is not handled well on Linux either.

Regards,
Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019