DMARC-Filter: OpenDMARC Filter v1.4.2 delorie.com 55RAVDld1395510 Authentication-Results: delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com Authentication-Results: delorie.com; spf=pass smtp.mailfrom=cygwin.com DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 55RAVDld1395510 Authentication-Results: delorie.com; dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=nrHGR8sl X-Recipient: archive-cygwin AT delorie DOT com DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 5B3D63858C50 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; s=default; t=1751020271; bh=YARuUEpaOtjrkzU1wyK1/UElrqbGrpvuhriQgY5oHp0=; h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=nrHGR8sl/iY06BRAs3t2GdWp1gwE8HdWmDUqhegTqqDAHJWafKvrqBpUAW1DnE9Zy sVPg80P3nvQg29IG5QkjD1B/zCfPThNhcABIAwPttvv8hrRX3RAcpFlB4QNxJr4LwN 4Fe1oQ8L2F7o9b1ARUAPh9Pa0YBGnbr2rsTozdPc= X-Original-To: cygwin AT cygwin DOT com Delivered-To: cygwin AT cygwin DOT com DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0DDE53858C50 Date: Fri, 27 Jun 2025 12:30:47 +0200 To: cygwin AT cygwin DOT com Subject: Re: readdir() returns inaccessible name if file was created with invalid UTF-8 Message-ID: Mail-Followup-To: cygwin AT cygwin DOT com References: <96f2253b-791b-b8a0-97dd-8d257eefb9b1 AT t-online DOT de> <03c4fae7-7322-572c-ae72-52e300f0b438 AT t-online DOT de> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: X-BeenThere: cygwin AT cygwin DOT com X-Mailman-Version: 2.1.30 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Corinna Vinschen via Cygwin Reply-To: cygwin AT cygwin DOT com Cc: Corinna Vinschen Content-Type: text/plain; charset="utf-8" Errors-To: cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com Sender: "Cygwin" Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 55RAVDld1395510 Hi Christian, On Jun 26 19:07, Christian Franke via Cygwin wrote: > Corinna Vinschen via Cygwin wrote: > > On Jun 25 16:59, Christian Franke via Cygwin wrote: > > > On Sun, 15 Sep 2024 19:47:11 +0200, Christian Franke wrote: > > > > If a file name contains an invalid (truncated) UTF-8 sequence, open() > > > > does not refuse to create the file. Later readdir() returns a different > > > > name which could not be used to access the file. > > > > > > > > Testcase with U+1F321 (Thermometer): > > > > > > > > $ uname -r > > > > 3.5.4-1.x86_64 > > > > > > > > $ printf $'\U0001F321' | od -A none -t x1 > > > >  f0 9f 8c a1 > > > > > > > > $ touch 'file1-'$'\xf0\x9f\x8c\xa1''.ext' > > > > > > > > $ touch 'file2-'$'\xf0\x9f\x8c''.ext' > > > > > > > > $ touch 'file3-'$'\xf0\x9f\x8c' > > > > > > > > $ ls -1 > > > > ls: cannot access 'file2-.?ext': No such file or directory > > > > ls: cannot access 'file3-': No such file or directory > > > > 'file1-'$'\360\237\214\241''.ext' > > > > file2-.?ext > > > > file3- > > > > [...] > > I don't know exactly where this happens, but the input of the > > conversion is invalid UTF-8 because it's missing the 4th byte. > > There's no way to represent these filenames on Windows > > filesystems storing filenames as UTF-16 values. > > > > So the problem here is that the conversion somehow misses that > > the 4th byte is invalid and just plods forward and converts the > > leading three bytes into the matching high surrogate value and > > then stumbles over the conversion for the low surrogate. > > > > It would be really helpful to have an STC for this problem. > > With some trial and error I found a testcase for this more serious problem > reported yesterday but not quoted above: > > > > In cases like file3-... above, the converted Windows path ends with > > > 0xF000. This suggests that this is an accidental conversion of the > > > terminating null to the 0xF0xx range. > > > > > > In some cases, the created Windows file name has random garbage > > > behind the 0xF000. Then even Cygwin is not able to access or unlink > > > the file after creation. > > Testcase (attached): Thanks for the testcase! I found the problem in the newlib core function creating wchar_t from UTF-8 input. In case of 4 byte UTF-8 sequences, the code created the low surrogate already after reading byte 3, without checking if byte 4 of the UTF-8 sequence is a valid byte. Hilarity ensues. Fortunately this bug has only been introduced very recently, to wit, on 2009-03-24, a mere 16 years ago. And it is my bug and mine alone :} I'm just prep'ing a fix which I'll push in a minute or two. Thanks, Corinna -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple