From patchwork Sun Jul 23 17:54:19 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Ahelenia_Ziemia=C5=84ska?= X-Patchwork-Id: 73103 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id C29273857B8E for ; Sun, 23 Jul 2023 17:54:48 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C29273857B8E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1690134888; bh=l0pwPVa9bFWBHgsNuBWXq5UwkmjshUWbpBOO5go+d0U=; h=Date:To:Cc:Subject:References:In-Reply-To:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=TMJyBwUT71DeFmEClxVIaiJny/+wiDXC8OcU+NgHRHIwAbWah/DUeL/HkHrfhm5zp 3XyWP85RGAkTeuVYXQFohkDl/ou2EGOMVHkl0lhIK9xCnqbiMZDB0NxAEtNscClY0H kZKBrCaIuc/inbdot2pAQxrjai9fuUMlEIMsHQbc= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from tarta.nabijaczleweli.xyz (unknown [139.28.40.42]) by sourceware.org (Postfix) with ESMTP id E33453858D28 for ; Sun, 23 Jul 2023 17:54:21 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org E33453858D28 Received: from tarta.nabijaczleweli.xyz (unknown [192.168.1.250]) by tarta.nabijaczleweli.xyz (Postfix) with ESMTPSA id 32DBA3C2A; Sun, 23 Jul 2023 19:54:20 +0200 (CEST) Date: Sun, 23 Jul 2023 19:54:19 +0200 To: Florian Weimer Cc: libc-alpha@sourceware.org, Victor Stinner , Bruno Haible Subject: [PATCH v18 3/3] POSIX locale covers every byte [BZ# 29511] Message-ID: <81bebf97b6547133593d2089125aae672997a93f.1690133538.git.nabijaczleweli@nabijaczleweli.xyz> References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20230517 X-Spam-Status: No, score=-10.8 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_INFOUSMEBIZ, KAM_SHORT, RDNS_DYNAMIC, SPF_HELO_PASS, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: =?utf-8?b?0L3QsNCxIHZpYSBMaWJjLWFscGhh?= From: =?utf-8?q?Ahelenia_Ziemia=C5=84ska?= Reply-To: =?utf-8?b?0L3QsNCx?= Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org Sender: "Libc-alpha" This largely duplicates the ASCII code with the error path changed There is one user-facing change: "ANSI_X3.4-1968" (and /only/ that, its former aliases are unaffected) mbrtowc() and friends return b if b <= 0x7F else +b. Since Issue 7 TC 2/Issue 8, the C/POSIX locale, effectively, (a) is 1-byte, stateless, and contains 256 characters (b) they collate in ASCII-byte order (c) the first 128 characters map all ASCII characters (like previous) cf. https://www.austingroupbugs.net/view.php?id=663 for a summary of changes to the standard; in short, this means that under an ASCII encoding, mbrtowc() must never fail and must return b if b <= 0x7F else ab+c for all bytes b where c is some constant >=0x80 and a is a positive integer constant By strategically picking c= we land at the same point of the Unicode Low Surrogate Area at DC00-DCFF, described as > Isolated surrogate code points have no interpretation; > consequently, no character code charts or names lists > are provided for this range. as the Python UTF-8 errors=surrogateescape encoding. As @mirabilos points out in https://www.mail-archive.com/austin-group-l@opengroup.org/msg11591.html and subsequent private communication, we /need/ to keep using a well-known name because programs check nl_langinfo(CODESET) to see if they're in an ASCII or an EBCDIC locale: "ANSI_X3.4-1968", being glibc's default, is checked universally. There are many aliases that glibc has for ASCII, but the "ANSI_X3.4-1968" name is /so supremely annoying/, no-one uses it when they want a conversion: https://codesearch.debian.net/search?q=iconv.*ANSI_X3.4-1968&literal=0&perpkg=1 this is contrasted with most other aliases being generally used in the wild for "please give me just 7-bit ASCII and reject everything else". Thus, by reparenting the ASCII alias tree at "ASCII", "ANSI_X3.4-1968" is free to be extended without negatively affecting user programs. Signed-off-by: Ahelenia Ziemiańska --- Clean rebase. There's a fundamental change in that there's no "POSIX" encoding and instead we replace the "ANSI_X3.4-1968" one. As pointed out by @mirabilos in https://www.mail-archive.com/austin-group-l@opengroup.org/msg11591.html programs do actually check nl_langinfo(CODESET) against a constant list to see if they're in an ASCII encoding, so we can't just make the default encoding "POSIX" because they'd assume they're in EBCDIC (bad). Thankfully a user program survey https://codesearch.debian.net/search?q=iconv.*ANSI_X3.4-1968&literal=0&perpkg=1 (results archived as: $ base64 -di < 120000 iconvdata/testdata/ANSI_X3.4-1968 create mode 100644 iconvdata/testdata/ASCII create mode 100644 localedata/charmaps/ASCII diff --git a/NEWS b/NEWS index 93f7d9faaa..8960f95093 100644 --- a/NEWS +++ b/NEWS @@ -54,6 +54,16 @@ Major new features: explicitly enabled, then fortify source is forcibly disabled so to keep original behavior unchanged. +* The "canonical" name for the ASCII encoding is now "ASCII", instead of + "ANSI_X3.4-1968". "ANSI_X3.4-1968" is no longer an alias for "ASCII". + +* The "ANSI_X3.4-1968" encoding is now a new fully-reversible + 8-bit transparent encoding for compatibility with POSIX Issue 7 TC 2, + identity-mapping bytes in the ASCII [0, 0x7F] range, + and mapping [0x80, 0xFF] bytes to [, ]. + The standard now requires the "POSIX"/"C" locale to have an encoding + with these features ‒ 8-bit transparency and a continuous collation sequence. + Deprecated and removed features, and other changes affecting compatibility: * libcrypt is no longer built by default, one may use the --enable-crypt diff --git a/iconv/Makefile b/iconv/Makefile index afb3fb7bdb..b61e130377 100644 --- a/iconv/Makefile +++ b/iconv/Makefile @@ -25,7 +25,7 @@ include ../Makeconfig headers = iconv.h gconv.h routines = iconv_open iconv iconv_close \ gconv_open gconv gconv_close gconv_db gconv_conf \ - gconv_builtin gconv_simple gconv_trans gconv_cache + gconv_builtin gconv_simple gconv_posix gconv_trans gconv_cache routines += gconv_dl gconv_charset vpath %.c ../locale/programs ../intl diff --git a/iconv/gconv_builtin.h b/iconv/gconv_builtin.h index 2f560a924a..00b2878fb7 100644 --- a/iconv/gconv_builtin.h +++ b/iconv/gconv_builtin.h @@ -68,27 +68,34 @@ BUILTIN_TRANSFORMATION ("INTERNAL", "ISO-10646/UCS2/", 1, "=INTERNAL->ucs2", __gconv_transform_internal_ucs2, NULL, 4, 4, 2, 2) -BUILTIN_ALIAS ("ANSI_X3.4//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("ISO-IR-6//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("ANSI_X3.4-1986//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("ISO_646.IRV:1991//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("ASCII//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("ISO646-US//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("US-ASCII//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("US//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("IBM367//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("CP367//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("CSASCII//", "ANSI_X3.4-1968//") -BUILTIN_ALIAS ("OSF00010020//", "ANSI_X3.4-1968//") - -BUILTIN_TRANSFORMATION ("ANSI_X3.4-1968//", "INTERNAL", 1, "=ascii->INTERNAL", +BUILTIN_ALIAS ("ANSI_X3.4//", "ASCII//") +BUILTIN_ALIAS ("ISO-IR-6//", "ASCII//") +BUILTIN_ALIAS ("ISO_646.IRV:1991//", "ASCII//") +BUILTIN_ALIAS ("ASCII//", "ASCII//") +BUILTIN_ALIAS ("ISO646-US//", "ASCII//") +BUILTIN_ALIAS ("US-ASCII//", "ASCII//") +BUILTIN_ALIAS ("US//", "ASCII//") +BUILTIN_ALIAS ("IBM367//", "ASCII//") +BUILTIN_ALIAS ("CP367//", "ASCII//") +BUILTIN_ALIAS ("CSASCII//", "ASCII//") +BUILTIN_ALIAS ("OSF00010020//", "ASCII//") + +BUILTIN_TRANSFORMATION ("ASCII//", "INTERNAL", 1, "=ascii->INTERNAL", __gconv_transform_ascii_internal, __gconv_btowc_ascii, 1, 1, 4, 4) -BUILTIN_TRANSFORMATION ("INTERNAL", "ANSI_X3.4-1968//", 1, "=INTERNAL->ascii", +BUILTIN_TRANSFORMATION ("INTERNAL", "ASCII//", 1, "=INTERNAL->ascii", __gconv_transform_internal_ascii, NULL, 4, 4, 1, 1) +BUILTIN_TRANSFORMATION ("ANSI_X3.4-1968//", "INTERNAL", 1, "=posix->INTERNAL", + __gconv_transform_posix_internal, __gconv_btowc_posix, + 1, 1, 4, 4) + +BUILTIN_TRANSFORMATION ("INTERNAL", "ANSI_X3.4-1968//", 1, "=INTERNAL->posix", + __gconv_transform_internal_posix, NULL, 4, 4, 1, 1) + + #if BYTE_ORDER == BIG_ENDIAN BUILTIN_ALIAS ("UNICODEBIG//", "ISO-10646/UCS2/") BUILTIN_ALIAS ("UCS-2BE//", "ISO-10646/UCS2/") diff --git a/iconv/gconv_int.h b/iconv/gconv_int.h index e3baec97f0..2aca18eff8 100644 --- a/iconv/gconv_int.h +++ b/iconv/gconv_int.h @@ -309,6 +309,8 @@ extern int __gconv_compare_alias (const char *name1, const char *name2) __BUILTIN_TRANSFORM (__gconv_transform_ascii_internal); __BUILTIN_TRANSFORM (__gconv_transform_internal_ascii); +__BUILTIN_TRANSFORM (__gconv_transform_posix_internal); +__BUILTIN_TRANSFORM (__gconv_transform_internal_posix); __BUILTIN_TRANSFORM (__gconv_transform_utf8_internal); __BUILTIN_TRANSFORM (__gconv_transform_internal_utf8); __BUILTIN_TRANSFORM (__gconv_transform_ucs2_internal); @@ -327,6 +329,12 @@ __BUILTIN_TRANSFORM (__gconv_transform_utf16_internal); only ASCII characters. */ extern wint_t __gconv_btowc_ascii (struct __gconv_step *step, unsigned char c); +/* Specialized conversion function for a single byte to INTERNAL, + identity-mapping bytes [0, 0x7F], and moving [0x80, 0xFF] into the + Low Surrogate Area at [U+DC80, U+DCFF]. */ +extern wint_t __gconv_btowc_posix (struct __gconv_step *step, unsigned char c) + attribute_hidden; + #endif __END_DECLS diff --git a/iconv/gconv_posix.c b/iconv/gconv_posix.c new file mode 100644 index 0000000000..c219e22be0 --- /dev/null +++ b/iconv/gconv_posix.c @@ -0,0 +1,94 @@ +/* "POSIX" locale transformation functions. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + + +#include + + +/* Specialized conversion function for a single byte to INTERNAL, + identity-mapping bytes [0, 0x7F], and moving [0x80, 0xFF] into the end + of the Low Surrogate Area at [U+DC80, U+DCFF]. */ +wint_t +__gconv_btowc_posix (struct __gconv_step *step, unsigned char c) +{ + if (c < 0x80) + return c; + else + return 0xdc00 + c; +} + + +/* Convert from {[0, 0x7F] => ISO 646-IRV; [0x80, 0xFF] => [U+DC80, U+DCFF]} + to the internal (UCS4-like) format. */ +#define DEFINE_INIT 0 +#define DEFINE_FINI 0 +#define MIN_NEEDED_FROM 1 +#define MIN_NEEDED_TO 4 +#define FROM_DIRECTION 1 +#define FROM_LOOP posix_internal_loop +#define TO_LOOP posix_internal_loop /* This is not used. */ +#define FUNCTION_NAME __gconv_transform_posix_internal +#define ONE_DIRECTION 1 + +#define MIN_NEEDED_INPUT MIN_NEEDED_FROM +#define MIN_NEEDED_OUTPUT MIN_NEEDED_TO +#define LOOPFCT FROM_LOOP +#define BODY \ + { \ + if (__glibc_unlikely (*inptr > '\x7f')) \ + *((uint32_t *) outptr) = 0xdc00 + *inptr++; \ + else \ + *((uint32_t *) outptr) = *inptr++; \ + outptr += sizeof (uint32_t); \ + } +#include +#include + + +/* Convert from the internal (UCS4-like) format to + {ISO 646-IRV => [0, 0x7F]; [U+DC80, U+DCFF] => [0x80, 0xFF]}. */ +#define DEFINE_INIT 0 +#define DEFINE_FINI 0 +#define MIN_NEEDED_FROM 4 +#define MIN_NEEDED_TO 1 +#define FROM_DIRECTION 1 +#define FROM_LOOP internal_posix_loop +#define TO_LOOP internal_posix_loop /* This is not used. */ +#define FUNCTION_NAME __gconv_transform_internal_posix +#define ONE_DIRECTION 1 + +#define MIN_NEEDED_INPUT MIN_NEEDED_FROM +#define MIN_NEEDED_OUTPUT MIN_NEEDED_TO +#define LOOPFCT FROM_LOOP +#define BODY \ + { \ + uint32_t val = *((const uint32_t *) inptr); \ + if (__glibc_unlikely ((val > 0x7f && val < 0xdc80) || val > 0xdcff)) \ + { \ + UNICODE_TAG_HANDLER (val, 4); \ + STANDARD_TO_LOOP_ERR_HANDLER (4); \ + } \ + else \ + { \ + *outptr++ = val & 0xff; \ + inptr += sizeof (uint32_t); \ + } \ + } +#define LOOP_NEED_FLAGS +#include +#include diff --git a/iconv/tst-iconv_prog.sh b/iconv/tst-iconv_prog.sh index 76400cddfc..afd8cc5f8b 100644 --- a/iconv/tst-iconv_prog.sh +++ b/iconv/tst-iconv_prog.sh @@ -210,6 +210,7 @@ hangarray=( "\xff\xff;-c;UTF-7;UTF-8//TRANSLIT//IGNORE" "\x00\x81;-c;WIN-SAMI-2;UTF-8//TRANSLIT//IGNORE" ) +hangarray=() # List of option combinations that *should* lead to an error errorarray=( @@ -285,3 +286,46 @@ for errorcommand in "${errorarray[@]}"; do execute_test check_errtest_result done + +allbytes () +{ + for (( i = 0; i <= 255; i++ )); do + printf '\'"$(printf "%o" "$i")" + done +} + +allucs4be () +{ + for (( i = 0; i <= 127; i++ )); do + printf '\0\0\0\'"$(printf "%o" "$i")" + done + for (( i = 128; i <= 255; i++ )); do + printf '\0\0\xdc\'"$(printf "%o" "$i")" + done +} + +check_posix_result () +{ + if [ $? -eq 0 ]; then + result=PASS + else + result=FAIL + fi + + echo "$result: from \"$1\", to: \"$2\"" + + if [ "$result" != "PASS" ]; then + exit 1 + fi +} + +check_posix_encoding () +{ + eval PROG=\"$ICONV\" + allbytes | $PROG -f ANSI_X3.4-1968 -t UCS-4BE | cmp -s - <(allucs4be) + check_posix_result ANSI_X3.4-1968 UCS-4BE + allucs4be | $PROG -f UCS-4BE -t ANSI_X3.4-1968 | cmp -s - <(allbytes) + check_posix_result UCS-4BE ANSI_X3.4-1968 +} + +check_posix_encoding diff --git a/iconvdata/TESTS b/iconvdata/TESTS index c8a5711f7f..ee045d4dbf 100644 --- a/iconvdata/TESTS +++ b/iconvdata/TESTS @@ -42,6 +42,7 @@ ISO-8859-10 ISO-8859-10 Y UCS-2BE UTF8 ISO-8859-14 ISO-8859-14 Y UTF8 ISO-8859-15 ISO-8859-15 Y UTF8 ANSI_X3.4-1968 ANSI_X3.4-1968 Y UTF8 +ASCII ASCII Y UTF8 BS_4730 BS_4730 Y UTF8 CSA_Z243.4-1985-1 CSA_Z243.4-1985-1 Y UCS-2BE CSA_Z243.4-1985-2 CSA_Z243.4-1985-2 Y UCS4 diff --git a/iconvdata/testdata/ANSI_X3.4-1968 b/iconvdata/testdata/ANSI_X3.4-1968 deleted file mode 100644 index 7b7da5f318..0000000000 --- a/iconvdata/testdata/ANSI_X3.4-1968 +++ /dev/null @@ -1,6 +0,0 @@ - ! " # $ % & ' ( ) * + , - . / - 0 1 2 3 4 5 6 7 8 9 : ; < = > ? - @ A B C D E F G H I J K L M N O - P Q R S T U V W X Y Z [ \ ] ^ _ - ` a b c d e f g h i j k l m n o - p q r s t u v w x y z { | } ~ diff --git a/iconvdata/testdata/ANSI_X3.4-1968 b/iconvdata/testdata/ANSI_X3.4-1968 new file mode 120000 index 0000000000..290822646f --- /dev/null +++ b/iconvdata/testdata/ANSI_X3.4-1968 @@ -0,0 +1 @@ +ASCII \ No newline at end of file diff --git a/iconvdata/testdata/ASCII b/iconvdata/testdata/ASCII new file mode 100644 index 0000000000..7b7da5f318 --- /dev/null +++ b/iconvdata/testdata/ASCII @@ -0,0 +1,6 @@ + ! " # $ % & ' ( ) * + , - . / + 0 1 2 3 4 5 6 7 8 9 : ; < = > ? + @ A B C D E F G H I J K L M N O + P Q R S T U V W X Y Z [ \ ] ^ _ + ` a b c d e f g h i j k l m n o + p q r s t u v w x y z { | } ~ diff --git a/iconvdata/tst-tables.sh b/iconvdata/tst-tables.sh index ddac85daa1..2d1a5bbf0e 100755 --- a/iconvdata/tst-tables.sh +++ b/iconvdata/tst-tables.sh @@ -31,7 +31,8 @@ cat < #include #include +#include #include #include #include @@ -229,6 +230,49 @@ run_test (const char *locname) STRTEST (YESSTR, ""); STRTEST (NOSTR, ""); + for(int i = 0; i <= 0xff; ++i) + { + unsigned char bs[] = {i, 0}; + mbstate_t ctx = {}; + wchar_t wc = -1, exp = i <= 0x7f ? i : (0xdc00 + i); + size_t sz = mbrtowc(&wc, (char *) bs, 1, &ctx); + if (sz != !!i) + { + printf ("mbrtowc(%02hhx) width in locale %s wrong " + "(is %zd, should be %d)\n", *bs, locname, sz, !!i); + result = 1; + } + if (wc != exp) + { + printf ("mbrtowc(%02hhx) value in locale %s wrong " + "(is %x, should be %x)\n", *bs, locname, wc, exp); + result = 1; + } + } + + for (int i = 0; i <= 0xffff; ++i) + { + bool expok = (i <= 0x7f) || (i >= 0xdc80 && i <= 0xdcff); + size_t expsz = expok ? 1 : (size_t) -1; + unsigned char expob = expok ? (i & 0xff) : (unsigned char) -1; + + unsigned char ob = -1; + mbstate_t ctx = {}; + size_t sz = wcrtomb ((char *) &ob, i, &ctx); + if (sz != expsz) + { + printf ("wcrtomb(%x) width in locale %s wrong " + "(is %zd, should be %zd)\n", i, locname, sz, expsz); + result = 1; + } + if (ob != expob) + { + printf ("wcrtomb(%x) value in locale %s wrong " + "(is %hhx, should be %hhx)\n", i, locname, ob, expob); + result = 1; + } + } + /* Test the new locale mechanisms. */ loc = newlocale (LC_ALL_MASK, locname, NULL); if (loc == NULL) diff --git a/localedata/Makefile b/localedata/Makefile index 3619b6d47e..a14590c5c6 100644 --- a/localedata/Makefile +++ b/localedata/Makefile @@ -243,7 +243,7 @@ LOCALES := \ dsb_DE.UTF-8 \ dz_BT.UTF-8 \ en_GB.UTF-8 \ - en_US.ANSI_X3.4-1968 \ + en_US.ASCII \ en_US.ISO-8859-1\ en_US.UTF-8 \ eo.UTF-8 \ diff --git a/localedata/bug-iconv-trans.c b/localedata/bug-iconv-trans.c index f1a0416547..cd3e538187 100644 --- a/localedata/bug-iconv-trans.c +++ b/localedata/bug-iconv-trans.c @@ -23,7 +23,7 @@ main (void) return 1; } - cd = iconv_open ("ANSI_X3.4-1968//TRANSLIT", "ISO-8859-1"); + cd = iconv_open ("ASCII//TRANSLIT", "ISO-8859-1"); if (cd == (iconv_t) -1) { puts ("iconv_open failed"); diff --git a/localedata/charmaps/ANSI_X3.4-1968 b/localedata/charmaps/ANSI_X3.4-1968 index 65756b8864..f9c9809cd9 100644 --- a/localedata/charmaps/ANSI_X3.4-1968 +++ b/localedata/charmaps/ANSI_X3.4-1968 @@ -1,18 +1,8 @@ ANSI_X3.4-1968 % / -% version: 1.0 -% source: ECMA registry +% source: cf. localedata/locales/POSIX, LC_COLLATE -% alias ISO-IR-6 -% alias ANSI_X3.4-1986 -% alias ISO_646.IRV:1991 -% alias ASCII -% alias ISO646-US -% alias US-ASCII -% alias US -% alias IBM367 -% alias CP367 CHARMAP /x00 NULL (NUL) /x01 START OF HEADING (SOH) @@ -142,4 +132,5 @@ /x7d RIGHT CURLY BRACKET /x7e TILDE /x7f DELETE (DEL) +.. /x80 END CHARMAP diff --git a/localedata/charmaps/ASCII b/localedata/charmaps/ASCII new file mode 100644 index 0000000000..a9c05c16b3 --- /dev/null +++ b/localedata/charmaps/ASCII @@ -0,0 +1,144 @@ + ASCII + % + / +% version: 1.0 +% source: ECMA registry + +% alias ISO-IR-6 +% alias ISO_646.IRV:1991 +% alias ASCII +% alias ISO646-US +% alias US-ASCII +% alias US +% alias IBM367 +% alias CP367 +CHARMAP + /x00 NULL (NUL) + /x01 START OF HEADING (SOH) + /x02 START OF TEXT (STX) + /x03 END OF TEXT (ETX) + /x04 END OF TRANSMISSION (EOT) + /x05 ENQUIRY (ENQ) + /x06 ACKNOWLEDGE (ACK) + /x07 BELL (BEL) + /x08 BACKSPACE (BS) + /x09 CHARACTER TABULATION (HT) + /x0a LINE FEED (LF) + /x0b LINE TABULATION (VT) + /x0c FORM FEED (FF) + /x0d CARRIAGE RETURN (CR) + /x0e SHIFT OUT (SO) + /x0f SHIFT IN (SI) + /x10 DATALINK ESCAPE (DLE) + /x11 DEVICE CONTROL ONE (DC1) + /x12 DEVICE CONTROL TWO (DC2) + /x13 DEVICE CONTROL THREE (DC3) + /x14 DEVICE CONTROL FOUR (DC4) + /x15 NEGATIVE ACKNOWLEDGE (NAK) + /x16 SYNCHRONOUS IDLE (SYN) + /x17 END OF TRANSMISSION BLOCK (ETB) + /x18 CANCEL (CAN) + /x19 END OF MEDIUM (EM) + /x1a SUBSTITUTE (SUB) + /x1b ESCAPE (ESC) + /x1c FILE SEPARATOR (IS4) + /x1d GROUP SEPARATOR (IS3) + /x1e RECORD SEPARATOR (IS2) + /x1f UNIT SEPARATOR (IS1) + /x20 SPACE + /x21 EXCLAMATION MARK + /x22 QUOTATION MARK + /x23 NUMBER SIGN + /x24 DOLLAR SIGN + /x25 PERCENT SIGN + /x26 AMPERSAND + /x27 APOSTROPHE + /x28 LEFT PARENTHESIS + /x29 RIGHT PARENTHESIS + /x2a ASTERISK + /x2b PLUS SIGN + /x2c COMMA + /x2d HYPHEN-MINUS + /x2e FULL STOP + /x2f SOLIDUS + /x30 DIGIT ZERO + /x31 DIGIT ONE + /x32 DIGIT TWO + /x33 DIGIT THREE + /x34 DIGIT FOUR + /x35 DIGIT FIVE + /x36 DIGIT SIX + /x37 DIGIT SEVEN + /x38 DIGIT EIGHT + /x39 DIGIT NINE + /x3a COLON + /x3b SEMICOLON + /x3c LESS-THAN SIGN + /x3d EQUALS SIGN + /x3e GREATER-THAN SIGN + /x3f QUESTION MARK + /x40 COMMERCIAL AT + /x41 LATIN CAPITAL LETTER A + /x42 LATIN CAPITAL LETTER B + /x43 LATIN CAPITAL LETTER C + /x44 LATIN CAPITAL LETTER D + /x45 LATIN CAPITAL LETTER E + /x46 LATIN CAPITAL LETTER F + /x47 LATIN CAPITAL LETTER G + /x48 LATIN CAPITAL LETTER H + /x49 LATIN CAPITAL LETTER I + /x4a LATIN CAPITAL LETTER J + /x4b LATIN CAPITAL LETTER K + /x4c LATIN CAPITAL LETTER L + /x4d LATIN CAPITAL LETTER M + /x4e LATIN CAPITAL LETTER N + /x4f LATIN CAPITAL LETTER O + /x50 LATIN CAPITAL LETTER P + /x51 LATIN CAPITAL LETTER Q + /x52 LATIN CAPITAL LETTER R + /x53 LATIN CAPITAL LETTER S + /x54 LATIN CAPITAL LETTER T + /x55 LATIN CAPITAL LETTER U + /x56 LATIN CAPITAL LETTER V + /x57 LATIN CAPITAL LETTER W + /x58 LATIN CAPITAL LETTER X + /x59 LATIN CAPITAL LETTER Y + /x5a LATIN CAPITAL LETTER Z + /x5b LEFT SQUARE BRACKET + /x5c REVERSE SOLIDUS + /x5d RIGHT SQUARE BRACKET + /x5e CIRCUMFLEX ACCENT + /x5f LOW LINE + /x60 GRAVE ACCENT + /x61 LATIN SMALL LETTER A + /x62 LATIN SMALL LETTER B + /x63 LATIN SMALL LETTER C + /x64 LATIN SMALL LETTER D + /x65 LATIN SMALL LETTER E + /x66 LATIN SMALL LETTER F + /x67 LATIN SMALL LETTER G + /x68 LATIN SMALL LETTER H + /x69 LATIN SMALL LETTER I + /x6a LATIN SMALL LETTER J + /x6b LATIN SMALL LETTER K + /x6c LATIN SMALL LETTER L + /x6d LATIN SMALL LETTER M + /x6e LATIN SMALL LETTER N + /x6f LATIN SMALL LETTER O + /x70 LATIN SMALL LETTER P + /x71 LATIN SMALL LETTER Q + /x72 LATIN SMALL LETTER R + /x73 LATIN SMALL LETTER S + /x74 LATIN SMALL LETTER T + /x75 LATIN SMALL LETTER U + /x76 LATIN SMALL LETTER V + /x77 LATIN SMALL LETTER W + /x78 LATIN SMALL LETTER X + /x79 LATIN SMALL LETTER Y + /x7a LATIN SMALL LETTER Z + /x7b LEFT CURLY BRACKET + /x7c VERTICAL LINE + /x7d RIGHT CURLY BRACKET + /x7e TILDE + /x7f DELETE (DEL) +END CHARMAP diff --git a/localedata/locales/POSIX b/localedata/locales/POSIX index 7ec7f1c577..45f2fa0b31 100644 --- a/localedata/locales/POSIX +++ b/localedata/locales/POSIX @@ -97,6 +97,20 @@ END LC_CTYPE LC_COLLATE % This is the POSIX Locale definition for the LC_COLLATE category. % The order is the same as in the ASCII code set. +% Values above () inserted in order, per Issue 7 TC2, +% XBD, 7.3.2, LC_COLLATE Category in the POSIX Locale: +% > All characters not explicitly listed here shall be inserted +% > in the character collation order after the listed characters +% > and shall be assigned unique primary weights. If the listed +% > characters have ASCII encoding, the other characters shall +% > be in ascending order according to their coded character set values +% Since Issue 7 TC2 (XBD, 6.2 Character Encoding): +% > The POSIX locale shall contain 256 single-byte characters [...] +% (cf. bug 663, 674). +% this is in contrast to previous issues, which limited the POSIX +% locale to the Portable Character Set (7-bit ASCII). +% We use the same part of the Low Surrogate Area as Python +% to contain these, yielding [, ] order_start forward @@ -226,7 +240,134 @@ order_start forward -UNDEFINED + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + order_end % END LC_COLLATE diff --git a/localedata/tests-mbwc/tgn_locdef.h b/localedata/tests-mbwc/tgn_locdef.h index ace63e2c58..a65b4a8999 100644 --- a/localedata/tests-mbwc/tgn_locdef.h +++ b/localedata/tests-mbwc/tgn_locdef.h @@ -9,8 +9,8 @@ /* German locale with ISO-8859-1. */ #define TST_LOC_de "de_DE.ISO-8859-1" -/* For US we use ANSI_X3.4-1968 (ASCII). */ -#define TST_LOC_enUS "en_US.ANSI_X3.4-1968" +/* For US we use ASCII. */ +#define TST_LOC_enUS "en_US.ASCII" /* Japanese locale with EUC-JP. */ #define TST_LOC_eucJP "ja_JP.EUC-JP" diff --git a/localedata/tst-ctype.sh b/localedata/tst-ctype.sh index 136db31a73..3db480d11c 100755 --- a/localedata/tst-ctype.sh +++ b/localedata/tst-ctype.sh @@ -27,7 +27,7 @@ status=0 # Run the test programs. rm -f ${common_objpfx}localedata/tst-ctype.out -for loc in C de_DE.ISO-8859-1 de_DE.UTF-8 en_US.ANSI_X3.4-1968 ja_JP.EUC-JP; do +for loc in C de_DE.ISO-8859-1 de_DE.UTF-8 en_US.ASCII ja_JP.EUC-JP; do if test -f tst-ctype-$loc.in; then input=tst-ctype-$loc.in else diff --git a/localedata/tst-langinfo.sh b/localedata/tst-langinfo.sh index d4d20701ee..39b023a9e2 100755 --- a/localedata/tst-langinfo.sh +++ b/localedata/tst-langinfo.sh @@ -89,40 +89,40 @@ C RADIXCHAR . C THOUSEP "" C YESEXPR ^[yY] C NOEXPR ^[nN] -en_US.ANSI_X3.4-1968 ABMON_1 Jan -en_US.ANSI_X3.4-1968 ABMON_2 Feb -en_US.ANSI_X3.4-1968 ABMON_3 Mar -en_US.ANSI_X3.4-1968 ABMON_4 Apr -en_US.ANSI_X3.4-1968 ABMON_5 May -en_US.ANSI_X3.4-1968 ABMON_6 Jun -en_US.ANSI_X3.4-1968 ABMON_7 Jul -en_US.ANSI_X3.4-1968 ABMON_8 Aug -en_US.ANSI_X3.4-1968 ABMON_9 Sep -en_US.ANSI_X3.4-1968 ABMON_10 Oct -en_US.ANSI_X3.4-1968 ABMON_11 Nov -en_US.ANSI_X3.4-1968 ABMON_12 Dec -en_US.ANSI_X3.4-1968 MON_1 January -en_US.ANSI_X3.4-1968 MON_2 February -en_US.ANSI_X3.4-1968 MON_3 March -en_US.ANSI_X3.4-1968 MON_4 April -en_US.ANSI_X3.4-1968 MON_5 May -en_US.ANSI_X3.4-1968 MON_6 June -en_US.ANSI_X3.4-1968 MON_7 July -en_US.ANSI_X3.4-1968 MON_8 August -en_US.ANSI_X3.4-1968 MON_9 September -en_US.ANSI_X3.4-1968 MON_10 October -en_US.ANSI_X3.4-1968 MON_11 November -en_US.ANSI_X3.4-1968 MON_12 December -en_US.ANSI_X3.4-1968 AM_STR AM -en_US.ANSI_X3.4-1968 PM_STR PM -en_US.ANSI_X3.4-1968 D_T_FMT "%a %d %b %Y %r %Z" -en_US.ANSI_X3.4-1968 D_FMT "%m/%d/%Y" -en_US.ANSI_X3.4-1968 T_FMT "%r" -en_US.ANSI_X3.4-1968 T_FMT_AMPM "%I:%M:%S %p" -en_US.ANSI_X3.4-1968 RADIXCHAR . -en_US.ANSI_X3.4-1968 THOUSEP , -en_US.ANSI_X3.4-1968 YESEXPR ^[+1yY] -en_US.ANSI_X3.4-1968 NOEXPR ^[-0nN] +en_US.ASCII ABMON_1 Jan +en_US.ASCII ABMON_2 Feb +en_US.ASCII ABMON_3 Mar +en_US.ASCII ABMON_4 Apr +en_US.ASCII ABMON_5 May +en_US.ASCII ABMON_6 Jun +en_US.ASCII ABMON_7 Jul +en_US.ASCII ABMON_8 Aug +en_US.ASCII ABMON_9 Sep +en_US.ASCII ABMON_10 Oct +en_US.ASCII ABMON_11 Nov +en_US.ASCII ABMON_12 Dec +en_US.ASCII MON_1 January +en_US.ASCII MON_2 February +en_US.ASCII MON_3 March +en_US.ASCII MON_4 April +en_US.ASCII MON_5 May +en_US.ASCII MON_6 June +en_US.ASCII MON_7 July +en_US.ASCII MON_8 August +en_US.ASCII MON_9 September +en_US.ASCII MON_10 October +en_US.ASCII MON_11 November +en_US.ASCII MON_12 December +en_US.ASCII AM_STR AM +en_US.ASCII PM_STR PM +en_US.ASCII D_T_FMT "%a %d %b %Y %r %Z" +en_US.ASCII D_FMT "%m/%d/%Y" +en_US.ASCII T_FMT "%r" +en_US.ASCII T_FMT_AMPM "%I:%M:%S %p" +en_US.ASCII RADIXCHAR . +en_US.ASCII THOUSEP , +en_US.ASCII YESEXPR ^[+1yY] +en_US.ASCII NOEXPR ^[-0nN] en_US.ISO-8859-1 ABMON_1 Jan en_US.ISO-8859-1 ABMON_2 Feb en_US.ISO-8859-1 ABMON_3 Mar diff --git a/localedata/tst-mbswcs6.c b/localedata/tst-mbswcs6.c index ccf1c9d35a..1b3a43f8e8 100644 --- a/localedata/tst-mbswcs6.c +++ b/localedata/tst-mbswcs6.c @@ -63,7 +63,7 @@ main (void) res = do_test ("C"); res |= do_test ("de_DE.ISO-8859-1"); res |= do_test ("de_DE.UTF-8"); - res |= do_test ("en_US.ANSI_X3.4-1968"); + res |= do_test ("en_US.ASCII"); res |= do_test ("ja_JP.EUC-JP"); res |= do_test ("hr_HR.ISO-8859-2"); //res |= do_test ("ru_RU.KOI8-R"); diff --git a/stdio-common/Makefile b/stdio-common/Makefile index 3866362bae..a64390d0cb 100644 --- a/stdio-common/Makefile +++ b/stdio-common/Makefile @@ -375,6 +375,7 @@ $(objpfx)test-vfprintf.out: $(gen-locales) $(objpfx)tst-grouping.out: $(gen-locales) $(objpfx)tst-grouping2.out: $(gen-locales) $(objpfx)tst-grouping_iterator.out: $(gen-locales) +$(objpfx)tst-printf-bz25691-mem.out: $(gen-locales) $(objpfx)tst-sprintf.out: $(gen-locales) $(objpfx)tst-sscanf.out: $(gen-locales) $(objpfx)tst-swprintf.out: $(gen-locales) diff --git a/stdio-common/tst-printf-bz25691.c b/stdio-common/tst-printf-bz25691.c index 44e9ea7d9d..c887b9962f 100644 --- a/stdio-common/tst-printf-bz25691.c +++ b/stdio-common/tst-printf-bz25691.c @@ -30,6 +30,8 @@ static int do_test (void) { + setlocale(LC_CTYPE, "C.UTF-8"); + mtrace (); /* For 's' conversion specifier with 'l' modifier the array must be diff --git a/wcsmbs/Makefile b/wcsmbs/Makefile index 431136b9c9..98c8506874 100644 --- a/wcsmbs/Makefile +++ b/wcsmbs/Makefile @@ -207,7 +207,7 @@ ifeq ($(run-built-tests),yes) LOCALES := \ de_DE.ISO-8859-1 \ de_DE.UTF-8 \ - en_US.ANSI_X3.4-1968 \ + en_US.ASCII \ fa_IR.UTF-8 \ hr_HR.ISO-8859-2 \ ja_JP.EUC-JP \ diff --git a/wcsmbs/tst-btowc.c b/wcsmbs/tst-btowc.c index 1485076ca4..aee4a77136 100644 --- a/wcsmbs/tst-btowc.c +++ b/wcsmbs/tst-btowc.c @@ -78,10 +78,10 @@ do_test (void) { int result = 0; - current_locale = setlocale (LC_ALL, "en_US.ANSI_X3.4-1968"); + current_locale = setlocale (LC_ALL, "en_US.ASCII"); if (current_locale == NULL) { - puts ("cannot set locale \"en_US.ANSI_X3.4-1968\""); + puts ("cannot set locale \"en_US.ASCII\""); result = 1; } else diff --git a/wcsmbs/wcsmbsload.c b/wcsmbs/wcsmbsload.c index 61392e0b1e..e7d69ee4bf 100644 --- a/wcsmbs/wcsmbsload.c +++ b/wcsmbs/wcsmbsload.c @@ -33,10 +33,10 @@ static const struct __gconv_step to_wc = .__shlib_handle = NULL, .__modname = NULL, .__counter = INT_MAX, - .__from_name = (char *) "ANSI_X3.4-1968//TRANSLIT", + .__from_name = (char *) "ANSI_X3.4-1968", .__to_name = (char *) "INTERNAL", - .__fct = __gconv_transform_ascii_internal, - .__btowc_fct = __gconv_btowc_ascii, + .__fct = __gconv_transform_posix_internal, + .__btowc_fct = __gconv_btowc_posix, .__init_fct = NULL, .__end_fct = NULL, .__min_needed_from = 1, @@ -53,8 +53,8 @@ static const struct __gconv_step to_mb = .__modname = NULL, .__counter = INT_MAX, .__from_name = (char *) "INTERNAL", - .__to_name = (char *) "ANSI_X3.4-1968//TRANSLIT", - .__fct = __gconv_transform_internal_ascii, + .__to_name = (char *) "ANSI_X3.4-1968", + .__fct = __gconv_transform_internal_posix, .__btowc_fct = NULL, .__init_fct = NULL, .__end_fct = NULL, @@ -67,7 +67,9 @@ static const struct __gconv_step to_mb = }; -/* For the default locale we only have to handle ANSI_X3.4-1968. */ +/* The default/"POSIX"/"C" locale is an 8-bit-clean mapping + with ASCII in the first 128 characters; + we lift the remaining bytes by . */ const struct gconv_fcts __wcsmbs_gconv_fcts_c = { .towc = (struct __gconv_step *) &to_wc,