DMARC-Filter: OpenDMARC Filter v1.4.2 delorie.com 55THmVhd2416418 Authentication-Results: delorie.com; dmarc=pass (p=none dis=none) header.from=cygwin.com Authentication-Results: delorie.com; spf=pass smtp.mailfrom=cygwin.com DKIM-Filter: OpenDKIM Filter v2.11.0 delorie.com 55THmVhd2416418 Authentication-Results: delorie.com; dkim=pass (1024-bit key, unprotected) header.d=cygwin.com header.i=@cygwin.com header.a=rsa-sha256 header.s=default header.b=dRE9jt1i X-Recipient: archive-cygwin AT delorie DOT com DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 928343852768 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; s=default; t=1751219309; bh=ssxObxJuqXYH0WxkKb84xCtBy6XOunmU2qcr0b4S7eI=; h=Subject:To:References:Date:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=dRE9jt1iN9dZHMSeuYR7xwJxdcm6FGz/7E0VvhOas5t7/xV1vSV69/Ojfisc+ecCB HqgwZ8q1KPyukVKPdbp/1yG7FuSqgje1E+JILj2bHkedwwA5i04CfqxgYFkI158UsX 1fyaq8myu9aYPBeI5ubdL8/8ljvK32g5fSkCProE= X-Original-To: cygwin AT cygwin DOT com Delivered-To: cygwin AT cygwin DOT com DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 4257C3852FD7 ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 4257C3852FD7 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1751219254; cv=none; b=SMxuCV/Huet5glngN60mZbt/vU+ECsGKjlyYRpNzOZIpMeyO6LMXDS0XO/Eh8JyC3gTPAvM8iXkTmB/6POdogjApIouGwwS1Qp1iRjA+b53vjKWxbWeijrSWOn3n8eSCZ3O5YwptYjTzr3M7wNHO2qEAu0GZAPf3YFcOUaZcoT0= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1751219254; c=relaxed/simple; bh=+uoFLV+zD3PzHbTZd4hYYKmtE2UPkUPR9Q6/SpKWl5k=; h=Subject:From:To:Message-ID:Date:MIME-Version; b=tePmVGgnihMMnXZ/ZaYuVQgow/yjxsUfm6dZKyWlJWLgcINqDA6umJpe2ZshJbnHnitaC1ZBd75P3oJJt7TFkT0uQU2jGTy0XUR0e/oYFbBLmPqxkxddzPsJSVqDqdMjNIJyQSZ42TJsVqtx/f44KaM9etzK6gnbKgxBGZpdsw0= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 4257C3852FD7 Subject: Re: readdir() returns inaccessible name if file was created with invalid UTF-8 To: cygwin AT cygwin DOT com References: <96f2253b-791b-b8a0-97dd-8d257eefb9b1 AT t-online DOT de> <03c4fae7-7322-572c-ae72-52e300f0b438 AT t-online DOT de> <3295c8bd-2c09-76c7-8b5f-0106dc39dd96 AT t-online DOT de> <5fae4fcc-6847-ab19-b487-3a28c76d96e4 AT t-online DOT de> Message-ID: <2ff83e59-9374-a04a-36fb-e51e5dd5f6b7@t-online.de> Date: Sun, 29 Jun 2025 19:47:29 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 SeaMonkey/2.53.20 MIME-Version: 1.0 In-Reply-To: <5fae4fcc-6847-ab19-b487-3a28c76d96e4@t-online.de> X-TOI-EXPURGATEID: 150726::1751219251-DBFF54BA-10E04365/0/0 CLEAN NORMAL X-TOI-MSGID: 82723ec6-9b51-4c0a-9539-db32d01cfe4b X-BeenThere: cygwin AT cygwin DOT com X-Mailman-Version: 2.1.30 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Christian Franke via Cygwin Reply-To: cygwin AT cygwin DOT com Cc: Christian Franke Content-Type: text/plain; charset="utf-8"; Format="flowed" Errors-To: cygwin-bounces~archive-cygwin=delorie DOT com AT cygwin DOT com Sender: "Cygwin" Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by delorie.com id 55THmVhd2416418 Christian Franke wrote: > Corinna Vinschen via Cygwin wrote: >> On Jun 27 15:32, Christian Franke via Cygwin wrote: >>> $ touch $'t-\xef\x80\x80' >>> The name mapping is: >>> "t-\xEF\x80\x80" -(open, ...)-> L"t-\xDB59" -(readdir)-> "t-" >> Did you copy/paste this from the old mail, by any chance? > > Sorry, I accidentally mixed two cases with same readdir() result: > > "t-\xEF\x80\x80" -(open, ...)-> L"t-\xF000" -(readdir)-> "t-" > "t-\xED\xAD\x99' -(open, ...)-> L"t-\xDB59" -(readdir)-> "t-" > > $ touch $'t-\xed\xad\x99' > $ touch $'t-\xef\x80\x80' > $ ls | uniq -c >       2 t- > > Does no longer occur in 3.7.0-0.165.g1b60f4861b70 but see below. > ... >> ... >> I'll apply the patch shortly. > > $ touch $'t-\xed\xad\x90' > $ touch $'t-\xed\xad\x91' > $ touch $'t-\xed\xad\x92' > $ touch $'t-\xed\xad\x93' > $ touch $'t-\xed\xad\x94' > $ ls | uniq -c >       5 t- > > $ ls -s > ls: cannot access 't-': No such file or directory > ls: cannot access 't-': No such file or directory > ls: cannot access 't-': No such file or directory > ls: cannot access 't-': No such file or directory > ls: cannot access 't-': No such file or directory > total 0 > ? t-  ? t-  ? t-  ? t-  ? t- > > All results found by several runs with different seeds of the attached > test program have in common that the Windows path name contains an > invalid word in UTF-16 High Surrogate range: > > $ ./randnames 42 > $'t-\xEC\x9E\xB3\xEF\x82\x80\xEF\x83\xA0': access() failed, errno=2: > $'t-\xED\xA4\xA8\x80\xE0': original path > L"t-\xD928\xF080\xF0E0": Windows path > > $'t-\xEE\x9E\xB3\xEF\x83\xA1': access() failed, errno=2: > $'t-\xED\xA6\xB0\xE1': original path > L"t-\xD9B0\xF0E1": Windows path > ... > $'t-\xE7\xBE\xB3\xEF\x82\xB3': access() failed, errno=2: > $'t-\xED\xA2\x96\xB3': original path > L"t-\xD896\xF0B3": Windows path > A closer look reveals two problems: 1.) A lone high surrogate is not encoded correctly. Could be fixed with this patch: https://cygwin.com/pipermail/cygwin-patches/2025q2/014001.html 2.) A high surrogate at the very end of the string is not encoded at all. A fix would require to enhance the interface between __*_wctomb() and the outer functions. The outer loop would need to call the function again after L'\0' occurred. BTW: if the file name consists only of a single high surrogate, an interesting corner case of readdir() is visible: $ echo foo >$'\uD876' # Windows name: L"\xD876" $ cat $'\uD876' foo $ ls $ ls -a | uniq -c       1 .       2 .. -- Regards, Christian -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple