www.delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2025/07/24/09:42:13

DMARC-Filter: OpenDMARC Filter v1.4.2 delorie.com 56ODgCOm1444723
Authentication-Results: delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com
Authentication-Results: delorie.com; spf=pass smtp.mailfrom=cygwin.com
DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 56ODgCOm1444723
Authentication-Results: delorie.com;
dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=e0/7Xg5H
X-Recipient: archive-cygwin AT delorie DOT com
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 6C2873857BBB
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com;
s=default; t=1753364530;
bh=S5YLTX6pY/ECDgVCdjF23Pm5kJDrvPDQa0xcY35jsIA=;
h=Date:Subject:To:References:Cc:In-Reply-To:List-Id:
List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe:
From:Reply-To:From;
b=e0/7Xg5HBSt064p7yAxTSqN0HAMIgqVDixmYs99NHHpj9En7hvfMlXXuFKpTdi/wi
IrWh20iC1S94asxpq2vwnYadFZ5TL4VTT6ppHSbx07sMYspxasjBJE62pBVi+ugi4k
Eq6iZKxCGKTLXcKBhzGYCojxzuBbzCNWE8hNgdxo=
X-Original-To: cygwin AT cygwin DOT com
Delivered-To: cygwin AT cygwin DOT com
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 3508E3858C56
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 3508E3858C56
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1753364502; cv=none;
b=F9FwCpJWawVBzUwY812/Q4Bbt9rraypphVRA9M/jaxs48ElHqeINfhzmNMpSuHmgpivIAHR54t+hLA1U2ZVwE7Tvv6pmhsSxVonyPsFMIEV1DL5bweYQF0gLQjKJMwljF3jXyEfWHDvHofhlxhjkOjgiwQHzh1uTCq/zV/KOgFM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
t=1753364502; c=relaxed/simple;
bh=QfYAsGajUHmoHNJeH+iApMg4K7sYEpGoyKtSIcK6VzI=;
h=DKIM-Signature:Message-ID:Date:MIME-Version:Subject:To:From;
b=DyhUdqvh3kE2byb/17Ubk96aJ0QZTO+MNxGlfy/4jh0xTo6VQEqm8YTYtn4PsuWffzoydL68NSWmHYiABpBx0AZrqlRt+n0rzhwfxVwDov8A+2Yk347+SsWy5R9NAkYzm3p//QGwQOG2YcRkHV7gilgmEfLCX17Jgd5ZpTf8hEo=
ARC-Authentication-Results: i=1; server2.sourceware.org
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 3508E3858C56
X-UI-Sender-Class: 55c96926-9e95-11ee-ae09-1f7a4046a0f6
Message-ID: <b0a32549-77da-4c0f-b118-79617800faea@towo.net>
Date: Thu, 24 Jul 2025 15:41:10 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: readdir() returns inaccessible name if file was created with
invalid UTF-8
To: cygwin AT cygwin DOT com
References: <aFxRfI4NdZ8y5IlK AT calimero DOT vinschen DOT de>
<f78c615c-aefe-b3d0-aada-5f9d0cf73a0a AT t-online DOT de>
<aF5y15iQ840LxLYJ AT calimero DOT vinschen DOT de>
<ca205dbd-907f-4552-9e5c-2cb0050f83a3 AT towo DOT net>
<aH-MtwqARmDmLwoo AT calimero DOT vinschen DOT de>
<91f26856-72b0-483b-8d04-bd90a27b6be0 AT towo DOT net>
<4ab2c1b7-3164-4556-ba36-29814ecf5766 AT towo DOT net>
<68f65634-8f4e-436b-ba6a-d30bdf882aaa AT towo DOT net>
<aICVBQzWUiCYwnL2 AT calimero DOT vinschen DOT de>
<11282182-60d1-4841-bf78-5ef78cf30060 AT towo DOT net>
<aIILWiKsr99DOaI8 AT calimero DOT vinschen DOT de>
Cc: Christian Franke <Christian DOT Franke AT t-online DOT de>,
Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
Autocrypt: addr=towo AT towo DOT net; keydata=
xsDNBGNaf3QBDACVevqudcTSevLThXKQPU1QpaDxtGuYjtwmr7i9wXxVGih4Y4oxOJN4PYlu
KBX9IVAI4651dA+xYtXuyIkWOPZWyyzkGKavQOn3Q7dk09oj7bh2IwOndpxXXde337D408EQ
bQEGbMHr9lOWhSAideowzgCeFIvGTf2AovbPh97HpexJn1/HCRiRAhTNlrkS1DByUgCAeEMK
fEr6aGM/Ou29MT+eTnQwOIZTnl9Z9LxM2FtqqMH3MycC7I2OoW3XXhuL8BPQdyJUjWa0/J11
Oo5jFkRXtWenIns6jGn18oW72jnDmo9jXwwS+iZWAV6Y51nhD7jSC+3xs9ORmPCdtHUSpTr1
zh67UueUJ3DUUNVuA25Hn/9EJMJ2L60BGUEr88NEB6pcZhmcwdkurAQeYT6t+frzBz2ctsoN
BoxP/Xc02yd+z7hXWRRMrJWh9WHlQHA3Z4FfmyNhyPhs3MgKTJ1E9QfzGquigAmF3/k/Dc1m
7cSOKhGYhpEJdSpdXccJFKkAEQEAAc0cVGhvbWFzIFdvbGZmIDx0b3dvQHRvd28ubmV0PsLB
BwQTAQgAMRYhBHUiRKsHn5d8BpWdP8bz0e72Bp0CBQJjWn93AhsDBAsJCAcFFQgJCgsFFgID
AQAACgkQxvPR7vYGnQKSMAv8Di+8MXB2mcfsemRdShfLLKcLOv+d0CXAtPVaY3XKxbKpRvC9
+AAT5wIHYjQft77/b2y87vGIh+nQ5hKLtNtQPSDtqG/Igkb5jAXpLi28fSUzgM96DvARmwve
5wSnAU3prxH+Y63YpOpslEcGMRoEtYCDy1ANMYPcEZT/YvDd4CplyyEai4VYrw3/LsESDYlY
GK6uMQzZ1jl2cNOUFu6BwLUeZIcwaqGto8n4R4nbf4jxUEpa21bWBPqE+Jf49uipjPr/iJ72
5HbdWuuCfyTTJEJjfNEBigWP2RXM9iNDcO61V3aEjh76tThfBK2MMlLWfZkQaQziu24x8R4B
I0efJYWBX2Sv2qnsH/EWj7FUIZjRqGG7LnWHLShfG6yjSOTOWYi8BbsvoftpaLWgZX28aGX4
uzuSZ5L0caXh/pr/gSgqoH/YbuFIgqtQH4seOBgTybd22Vpe78rnc+8450pN8qwchHAZaJka
UxS0SpYxXzXmHUKILA4C43s0U/z2Mez9zsDNBGNaf3cBDADeJ7paMrb6f1+k8wM7tyk0/Ded
KX/pOejt/D20Ceerw2iL/4tUmBL+A3ic2yjiSFUSsEfHwgCVwKrn4MwZtkesdiphm2lk6xWc
k1ENCQy44QwQT6UZ/mHWYWcj5LS6ua183x1zdn9iF3lv150nm/ssw56D7USz/ap1Vh0lf5te
D+CIheGLocVDqxWiu7rHP8jKRWFgq/+OU6HKX8p2Yv1oYsykh9qF2bFzawLDS+S1VbfRicfD
G0RtceL/BAf7b6UE5u9TGdfrFEa2TKZeS/FS/ViKUfwsXQIki1sWt2FQENbuDY28vxyR46ZZ
0gixDCFUoBw5pkmOGVQa+1RQYrRqlN4X0CAgp7mFVeEHl5NTgiL1bemkQVmHOUDG+CzNg+Lk
UGoedAtT672l3JjrnSs4j8zNshpgV2OfAhAC+V9XvqCjMnxzVfXkVlbuWpPfUWQeFclLGg8P
agpQUE0Ux+VV4DoeQCxYEnRCf/n7n+IRfILj5+2l6Zw4M7zSu6ii0tUAEQEAAcLA9gQYAQgA
IBYhBHUiRKsHn5d8BpWdP8bz0e72Bp0CBQJjWn97AhsMAAoJEMbz0e72Bp0CQr4L/REdT0SF
mbapnZIe92THCdtAUgwEv8VdNiNFBJelz8P/fuXuNPtisYvQQD4e64zpWe2UC4Cxo9DUk/pW
6Qci1xaXRKEiSPjHdSGGVB1PFIcqiS75GCf/ga/Dnfsy0Y4Uh6OGTQnkvZLBCe3vvcVLDQ7F
PuV79zA9/eOeOW6aGoO6bq/wH+z96f9LyTITkQDy07fm6JYTGuzAoJE2AEboU1mgbtlx+tAa
QFkpAQkp2g1Vhc3A7k4vntlHOrjMC+uVFh7QTGFfIlLRF6izUjSe6EZ06LErzlIiE05RP3yF
FSRWidW0wze26peYlxYVgH1+T9wMTW2oiTBybfAMHBAxUP7Gr1WUo/oJEr0srWhatz8AwydP
y7NwFbdpYn0NcFBaIlLW/JL11Eovwlivow+oGpzGFuuzSuflp2q9s2JWtn4EhW0kEs93D0LP
iuJWvRaCZ6aD3uF3FMW8wyVWZYsLrzune2jH8w/uKMprDEOGOm+BcyhEFedTyY1ygbZKl+0G kQ==
In-Reply-To: <aIILWiKsr99DOaI8@calimero.vinschen.de>
X-Provags-ID: V03:K1:gNOy8wWNN/bGiqLk7uM5KtybP0L7MzueHh9NMQqGFEECoih/vos
mZlRIfj4CRfpmCX+hHUPBHbV/8T7w/bXvj7L9XCCYx+I3bZqeVldErGLQMOGi1xhhVsTJVl
zqhL1FSt23URmvuIDHOjp6S3YecBcCPQnHDQ3ud+o83Q1WmADPvkD1luSxXQX1q1clCctwG
LumYi8pQ3Qwsz1nBtaRiA==
UI-OutboundReport: notjunk:1;M01:P0:mJcfsnQB/jw=;/Fk/4FaCzuZAAEG3LlkIWLJHh68
N3b93xchtGKKfNZvi/4AkCn47ru2yb8loD3/5HueX+CdGeZ3jrSoyxi3NoBxve+KFw1jlniU5
OGcpD+tRhd+2KlSt1SqP9sFRyI/TtQL7/uIyaUQ0y49aJwcE1SRcH+MoyXfl/RwbiDwdqWmcs
7mA4fdTpWZgYgVyjW+39PhQFOyCmsg9/nCaek+GGKMafSjyJfDJ9auwl8QLvE2V4rIZ+7rUOh
OMBYPNtgeym+PnpvIK1XJNNqkCqRad5mTwb+QTQXHiVAbtuQz8k/V6H3xhLHjMRTBRIBy+y86
NFdgt7ErEYaVCEYso0mamMfaljxxdQyu6Gc/HZI0qJYieI/0vjXJnoLmAkN5hEjNZqQgJfQ7l
7sfQTJCfNQ3y3nAacmgzHaVMcJ6lENP8wk9PL6BmJfBDEngb1GPDOX8OddxLgZYBdbFxSaHNu
d2YuJFKfG25MCxCCIkQGTKCKaeuTFVAsaR5gTnFKNjMB2OHRd82YqM+XNmM4ilRAmwMe+oGAk
NfmA4AOu0/kzkQDxFtIlgoa36V6hEz5SKN2Fvpz/hbSKQP6rzEg6VsioBfyCS5B2V4REzm7i7
8Vo4DTiCfl/pqXCj+1of8KKO+aF1D2uFBYBmQrhOuZuFCf+DY99d3g5S5k0DIvIh8blDxHfwc
JiHMMRSM++dgkbpa9nyXRkEY5FQWQgqMcvZ3bc2x9ODNZZVkiXyJRjmQhh7PraxiOGPhMXq39
Xz7rEioo2jVfoMJPMB2J3R6+bBLkrBn1vaO0AwWsbVWGjM57Q+KJR6zF671rwByfJTM50ccRu
xCALxRNxkigXHUmP5/DBxvy/YxR5o462cFcdvj29rYF6/jfywarowB6ADWWis34M0JoaFRT53
VtRdoNgX/iPDKAvrMvrPaCBLj2mZvZyAg8fRd7Htd2GJR9gQa7SD5d4Oy9MWTBqmSwu9Nf5B9
pXQ4TmuWqt4U+/2HoR3JmMykLGva5eqOS461oKi29y6i0V0ImEC3b4yUl4jOquJcy4PcOGxm5
onBlERsGElOiwDR0TBxrDqiTmUU4xfxYysqGENluFP9m+010RWsrpfgyX5EgTe/VRN3KbgKgm
louzW7DiRp6PS2hJPTAZnUuBRQACcYIYFekFI41bKzuy0cXOIcGVXUxwX6UPJbAcDTl3ohtYB
8rVdUI7yeGHObvm5YpD5dk+uoc/rzKGm+lXgZVZZWkZwgkUOcUEuu8+rZu8qTCllBD/EbVBrg
VZA6DguNidIH/NpSaz5/s9dpxRQomv3a2qKXGoJuKthCdgvkds7h9nRwC1tbhc7VnpiX6ecmT
Bh1fzTOZZXZHX3gNtcxfseI1tkYVrWiJkkgHiSGFwEhy+oR1U+DcTXFOTCfJ/ulIxs1s88HsY
/6MTcmWGAPQnt2TqakWFQZTY20SB+xAxm9pyHbTeCEzpqDcTIKgin7FzSz7pK7r/Zahtjjgcu
HTpoWI/psn1SfoiFfqEoX2qh/xn4snKzuTGOEki+vso82PEKqbCfyfI89OikL11BbvlQbKv9P
JtFOSaCdfF7XsroAKoyrhgYPu/M5X+8qYIZq7gN36jb8i89MrF2Yr07MhbgbsepdcP/QEsGlJ
tIDt9DBF0A=
X-BeenThere: cygwin AT cygwin DOT com
X-Mailman-Version: 2.1.30
List-Id: General Cygwin discussions and problem reports <cygwin.cygwin.com>
List-Unsubscribe: <https://cygwin.com/mailman/options/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=unsubscribe>
List-Archive: <https://cygwin.com/pipermail/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-request AT cygwin DOT com?subject=help>
List-Subscribe: <https://cygwin.com/mailman/listinfo/cygwin>,
<mailto:cygwin-request AT cygwin DOT com?subject=subscribe>
From: Thomas Wolff via Cygwin <cygwin AT cygwin DOT com>
Reply-To: Thomas Wolff <towo AT towo DOT net>
Errors-To: cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com
Sender: "Cygwin" <cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com>

Am 24.07.2025 um 12:30 schrieb Corinna Vinschen:
> Hi Thomas, hi Christian,
>
> On Jul 23 17:50, Thomas Wolff via Cygwin wrote:
>> Am 23.07.2025 um 09:53 schrieb Corinna Vinschen via Cygwin:
>>> On Jul 23 05:44, Thomas Wolff via Cygwin wrote:
>>> What bugs me is that we have the choice between a broken mbrtowc on
>>> one side and a chance to generate broken filenames on the other side.
>> I did not look into those details, but while characters to be handled by a
>> terminal come sequentially as a stream, filenames can be handled as a
>> compound string, isn't that easier to check?
>>
>>> I think we should actually revert fa272e05bbd0 ("wcstombs: also call
>>> __WCTOMB on terminating NUL if output buffer is NULL") and see if we can
>>> fix the filename issue in the Cygwin functions for filename conversion
>>> alone.
>>>
>>> Any ideas appreciated.
> I think I have a fix.  I reverted fa272e05bbd0 so mbrtowc is operating
> as before.  This should fix mintty.
>
> As for the filename problem, I had another look into the _sys_wcstombs
> and _sys_mbstowcs functions.
>
> It occured to me that the algorithm how to handle an invalid MB sequence
> is upside down when it comes to invalid UTF8 4 byte sequences.
>
> Consider a simple broken 2-byte UTF8 sequence like 0xc2 0x7f.  This
> sequence is converted to a byte sequence in the private use area like this:
>
>    0xc2 0x7f -> 0xf0c2 0x007f
>
> So the first byte of the sequence is wrong, so it's converted to 0xf0xx.
> At this point, we reset the mbstate and try the mbtowc conversion again
> with byte 2.  Byte 2 is now a valid single byte.  Hence 0xf0c2 0x007f.
> Also
>
>    0xc2 0xff -> 0xf0c2 0xf0ff
>
> because 0xc2 0xff is not valid and 0xff is not a valid lead byte.
>
> Now consider a broken 3 byte sequence.  Same as above:
>
>    0xe0 0xa0 0x7f -> 0xf0e0 0xf0a0 0x7f
>
> Now the 4 byte sequence with a broken 4th byte:
>
>    0xf0 0x90 0x80 0x7f -> 0xd800 0xf07f
>
> What's wrong here is the fact that the broken sequence results in
> a valid high surrogate and the trailing 4th byte is treated as the
> broken sequence.
>
> But in fact the leading three bytes are the broken sequence.  The
> current algorithm doesn't catch that, because it's already done
> and handled.  So the innocent 4th byte has to take the punch.
>
> I added a patch to _sys_mbstowcs:
> - note the fact we already got a high surrogate
> - if the next underlying mbtowc call returns an error, backtrack
>    to the high surrogate in the output string and overwrite it with
>    a per-byte sequence in the private use area
> - reset mbstate
> - retry the next byte after the broken sequence
>
> As far as my testing goes, all cases with broken filenames should
> work now.  The upcoming test release 3.7.0-0.261.gf21fbcaf583e
> will contain the patch.
>
> However, there's one problem left.  I added a FIXME comment to
> _sys_wcstombs:
>
>     FIXME? The conversion of invalid bytes from the private use area
>     like we do here is not actually necessary.  If we skip it, the
>     generated multibyte string is not identical to the original multibyte
>     string, but it's equivalent in the sense, that another mbstowcs will
>     generate the same wide-char string.  It would also be identical to
>     the same string converted by wcstombs.  And while the original
>     multibyte string can't be converted by mbstowcs, this string can.
>
> What does that mean?  Consider this UTF8 input string:
>
>    0xf0 0x90 0x80 0x2e
>
>    mbstowcs:     returns -1
>    sys_mbstowcs: f0f0 f090 f080 002e
>
> Let's convert it back to multibyte:
>
>    sys_wcstombs: 0xf0 0x90 0x80 0x2e
>    wcstombs:     0xef 0x83 0xb0 0xef 0x82 0x90 0xef 0x82 0x80 0x2e
>
> So while sys_wcstombs has special code converting the string back to its
> original MB string, wcstombs converts to the CESU-8 representation.
>
> This is transparent.  If we convert this CESU-8 string back to
> wide-char, the resulting wide-char strings are the same:
>
>    mbstowcs:     f0f0 f090 f080 002e
>    sys_mbstowcs: f0f0 f090 f080 002e
>
> So the question here is, shall we keep the special case converting
> private use area bytes back to their original byte encoding?
>
> Or shall simply go along with CESU-8 when converting back to multibyte
> to keep the string the same as with wcstombs?
>
> Exempt from this are the characters not valid in a DOS filename.
> These will always be converted if we create wide-char filenames.
Sounds like a fair solution with only minor glitches. Poor 4th byte but 
thanks a lot anyway.
About the latter decision, if there's no strong bias otherwise, I'd 
prefer to drop special handling (but don't take my vote, I don't care so 
much about that).
Thomas


> Thanks,
> Corinna


-- 
Problem reports:      https://cygwin.com/problems.html
FAQ:                  https://cygwin.com/faq/
Documentation:        https://cygwin.com/docs.html
Unsubscribe info:     https://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019