www.delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/09/10/18:12:49

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-1.9 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40,SPF_PASS
X-Spam-Check-By: sourceware.org
MIME-Version: 1.0
In-Reply-To: <h8bk5a$big$1@ger.gmane.org>
References: <h8bk5a$big$1 AT ger DOT gmane DOT org>
Date: Thu, 10 Sep 2009 23:12:38 +0100
Message-ID: <416096c60909101512l6e42ab72l4ba5fd792363eefd@mail.gmail.com>
Subject: Re: [1.7] Invalid UTF8 while creating a file -> cannot delete?
From: Andy Koppe <andy DOT koppe AT gmail DOT com>
To: cygwin AT cygwin DOT com
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

2009/9/10 Lapo Luchini:
> But the real problem with that test is not really what shows and how,
> the biggest problem is that it seems that filenames created with a
> "wrong" filename are quite limited in usage and can't seemingly be delete=
d.
>
> % export LANG=3Den_EN.UTF-8
> % cat t.c
> #include <stdio.h>
> int main() {
> =C2=A0 =C2=A0fopen("a-\xF6\xE4\xFC\xDF", "w"); //ISO-8859-1
> =C2=A0 =C2=A0fopen("b-\xC3\xB6\xC3\xA4\xC3\xBc\xC3\x9F", "w"); //UTF-8
> =C2=A0 =C2=A0return 0;
> }
> % gcc -o t t.c
> % mkdir test ; cd test ; ../t ; cd ..
> % ls -l test
> ls: cannot access test/a-=E2=96=92=E2=96=92=E2=96=92: No such file or dir=
ectory
> total 0
> -????????? ? ? =C2=A0 =C2=A0? =C2=A0 =C2=A0? =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0? a-=E2=96=92=E2=96=92=E2=96=92
> -rw-r--r-- 1 lapo None 0 2009-09-10 21:19 b-=C3=B6=C3=A4=C3=BC=C3=9F
> % find test
> test
> test/a-???
> test/b-=C3=B6=C3=A4=C3=BC=C3=9F
> % find test -delete
> find: cannot delete `test/a-\366\344\374': No such file or directory

Hmm, we've lost the \xDF somewhere, and I'd guess it was when the
filename got translated to UTF-16 in fopen(), which would explain what
you're seeing:

'find' reads the filename correctly, invokes remove() on it, which
translates it to UTF-16 again, whereby we lose a second byte, so we're
down to a-\366\344, which can't be deleted because it doesn't exist.

> =C2=A0 =C2=A0remove("a-\xF6\xE4\xFC\xDF");

Now here we start with the full name again, so if we lose the last
byte we get what's actually on disk, hence the call succeeds.

Bytes that don't contribute to valid UTF-8 characters get mapped to a
certain subrange of UTF-16 low surrogates at 0xDC80, which is a clever
trick for encoding such bytes into UTF-16 and get them out again after
decoding.

I stared at the code for this in sys_cp_mbstowcs for a bit, but
haven't spotted where those missing byte might have gone.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019