| www.delorie.com/archives/browse.cgi | search |
| X-Recipient: | archive-cygwin AT delorie DOT com |
| DKIM-Filter: | OpenDKIM Filter v2.11.0 sourceware.org 7F1DC3858280 |
| DKIM-Signature: | v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; |
| s=default; t=1690825617; | |
| bh=Sbt66MoYGyKvauN7xAW2/cKRhs5RNhFvbrXP9P8srOE=; | |
| h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe: | |
| List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: | |
| From; | |
| b=BO/u0zJ3PhDg/HNCRbJj6yYHNxf6e/dnnvdoN87LjzDwrNx1lTymDb/RGcEE3zjZM | |
| TO3fXiRUo+E50IJVmXhv4lPtxZxG9mo38IGWNao3wtO6XjWfmI+WWYA0Rqc+CbsChl | |
| 3CHuBteg1JY9yygcpY4ak3rIRozCZ9Ozx1sbn/BQ= | |
| X-Original-To: | cygwin AT cygwin DOT com |
| Delivered-To: | cygwin AT cygwin DOT com |
| DKIM-Filter: | OpenDKIM Filter v2.11.0 sourceware.org D55F73858CD1 |
| Date: | Mon, 31 Jul 2023 19:46:20 +0200 |
| To: | Bruno Haible <bruno AT clisp DOT org> |
| Subject: | Re: character class "alpha" |
| Message-ID: | <ZMfzbOOJth8Mk+rJ@calimero.vinschen.de> |
| Mail-Followup-To: | Bruno Haible <bruno AT clisp DOT org>, cygwin AT cygwin DOT com |
| References: | <3884636 DOT 3uDm00564X AT nimes> <ZMeH6yZQkK0exU8H AT calimero DOT vinschen DOT de> |
| <ZMe5Q02S5ap5gBbJ AT calimero DOT vinschen DOT de> <5176597 DOT IBPj4gxFZX AT nimes> | |
| MIME-Version: | 1.0 |
| In-Reply-To: | <5176597.IBPj4gxFZX@nimes> |
| X-BeenThere: | cygwin AT cygwin DOT com |
| X-Mailman-Version: | 2.1.29 |
| List-Id: | General Cygwin discussions and problem reports <cygwin.cygwin.com> |
| List-Unsubscribe: | <https://cygwin.com/mailman/options/cygwin>, |
| <mailto:cygwin-request AT cygwin DOT com?subject=unsubscribe> | |
| List-Archive: | <https://cygwin.com/pipermail/cygwin/> |
| List-Post: | <mailto:cygwin AT cygwin DOT com> |
| List-Help: | <mailto:cygwin-request AT cygwin DOT com?subject=help> |
| List-Subscribe: | <https://cygwin.com/mailman/listinfo/cygwin>, |
| <mailto:cygwin-request AT cygwin DOT com?subject=subscribe> | |
| From: | Corinna Vinschen via Cygwin <cygwin AT cygwin DOT com> |
| Reply-To: | cygwin AT cygwin DOT com |
| Cc: | Corinna Vinschen <corinna-cygwin AT cygwin DOT com>, cygwin AT cygwin DOT com |
| Errors-To: | cygwin-bounces+archive-cygwin=delorie DOT com AT cygwin DOT com |
| Sender: | "Cygwin" <cygwin-bounces+archive-cygwin=delorie DOT com AT cygwin DOT com> |
| X-MIME-Autoconverted: | from base64 to 8bit by delorie.com id 36VHkx4v027138 |
On Jul 31 16:06, Bruno Haible via Cygwin wrote:
> Corinna Vinschen wrote:
> > I have a problem with the c32isalpha function.
> >
> > c32isalpha fails for the character U+FF11 FULLWIDTH DIGIT ONE,
> > because it expects the character to be an alphabetic character.
>
> This is not a big problem. You can see in the test-c32isalpha.c file
> that this test is disabled for many platforms, in particular glibc.
Which is interesting, because I actually tried that today on glibc, and
for iswalpha (0xff11) it returns 1. So it actually behaves as the
testcase expects.
> There's no problem with disabling it on Cygwin as well.
I'd rather make Cygwin do the same as glibc.
> > The Cygwin unicode information is automatically generated from the
> > Unicode data file UnicodeData.txt, fresh from their homepage. iswalpha
> > in newlib is checking for the Unicode categories, using the expression:
> >
> > return cat == CAT_LC || cat == CAT_Lu || cat == CAT_Ll || cat == CAT_Lt
> > || cat == CAT_Lm || cat == CAT_Lo
> > || cat == CAT_Nl // Letter_Number
> > ;
> >
> > with CAT_foo being equivalent to Unicode category foo.
> >
> > Per UnicodeData.txt, ff11 is of category Nd, so it's a digit, not an
> > alphabetic character.
>
> This is not wrong. However, see the comments in the generator of the
> gnulib tables:
>
> https://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=lib/gen-uni-tables.c;h=0dceedc06cd72f886807fd575a2c4dba99cd147a;hb=HEAD#l5789
>
> /* Consider all the non-ASCII digits as alphabetic.
> ISO C 99 forbids us to have them in category "digit",
> but we want iswalnum to return true on them. */
>
> Likewise in the generator of the glibc tables:
>
> https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/unicode-gen/unicode_utils.py;h=5af03113a2f1f063769752ea426fcaf6f6ba9e95;hb=HEAD#l274
>
> The original comment (from 2000) was:
>
> /* SUSV2 gives us some freedom for the "digit" category, but ISO C 99
> takes it away:
> 7.25.2.1.5:
> The iswdigit function tests for any wide character that corresponds
> to a decimal-digit character (as defined in 5.2.1).
> 5.2.1:
> the 10 decimal digits 0 1 2 3 4 5 6 7 8 9
> */
> return (ch >= 0x0030 && ch <= 0x0039);
>
> The question is: In which category do you put these non-ASCII digits?
> "print" and "graph", sure. But other than that? "punct" or "alnum"?
> "punct" seems wrong. If you, like me, decide to put them in "alnum",
> then you they need to be in "alpha" or "digit" (per POSIX
> https://pubs.opengroup.org/onlinepubs/9699919799/functions/iswalnum.html ).
> But ISO C 23 § 7.4.1.5 + § 5.2.1 does not allow them in category "digit".
Thanks for the description. It was clear to me that they don't belong
into the ISO C digit category, but other than that...
So, if we change the expression in iswalpha_l to something like
return cat == CAT_LC || cat == CAT_Lu || cat == CAT_Ll || cat == CAT_Lt
|| cat == CAT_Lm || cat == CAT_Lo
|| cat == CAT_Nl // Letter_Number
/* Also all digits not allowed to be called digits per ISO C 99 */
|| (cat == CAT_Nd && !(c >= (wint_t)'0' && c <= (wint_t)'9'));
;
we're good?
Thanks,
Corinna
--
Problem reports: https://cygwin.com/problems.html
FAQ: https://cygwin.com/faq/
Documentation: https://cygwin.com/docs.html
Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple
| webmaster | delorie software privacy |
| Copyright © 2019 by DJ Delorie | Updated Jul 2019 |