From patchwork Sun Feb 18 18:54:09 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Jules Bertholet X-Patchwork-Id: 85944 Return-Path: X-Original-To: patchwork@sourceware.org Delivered-To: patchwork@sourceware.org Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 389513861836 for ; Sun, 18 Feb 2024 18:54:35 +0000 (GMT) X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from wrqvqshf.outbound-mail.sendgrid.net (wrqvqshf.outbound-mail.sendgrid.net [149.72.70.15]) by sourceware.org (Postfix) with ESMTPS id 462483861024 for ; Sun, 18 Feb 2024 18:54:09 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 462483861024 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=quoi.xyz Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=em1912.quoi.xyz ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 462483861024 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=149.72.70.15 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1708282452; cv=none; b=EtPR1S5uTfLLvyIra3GWysN5GRxviUtxCk7qDm3fkI/F90qPadUHYHAoY/RmxTP3o067V7lOQH+T7Pi4kMrWST0I0nPE7jPfmzFpEnsZcZmGp4VD86RtOD4HDbL323+jgoDXTqxBBdX2v6Esl8mB1isGa92Pyt1frHe/ssfrRMo= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1708282452; c=relaxed/simple; bh=bvYfvLTyJA+7fYeD9K1PrKSIm+EZBjtrefqy260hcwM=; h=DKIM-Signature:From:Subject:Date:Message-ID:MIME-Version:To; b=rPY7m1TSxcz4DljFo2IFJLVP0VMT+iOONRBbOHh8F+RolEKo7g6z2RqyE0LOidkV/m+qoLpaEzjrL2mLna+vxoU7OK4wzXcadezjKuhbCt/Q4RHTQMQbzn6rusQk+qERK2oAa3gllDR2YNxU8nTBB3XVg2My+786UZthP6XkzfY= ARC-Authentication-Results: i=1; server2.sourceware.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=quoi.xyz; h=from:subject:in-reply-to:references:mime-version:to:cc:content-type: content-transfer-encoding:cc:content-type:from:subject:to; s=s1; bh=HqAK/ZyChINk3EH0rz7XWa5Vk+BEIW3tbDAbrxkEz6s=; b=cqJy1f65UozIG9J7s3FUDMdp3EIyzueF9RjFOlkQKGUrcvwvuhumOh6jQEyXZAf/OhZ/ gxJ0S4cX37Aihn8Xnrrkpxo06Cd4pGMKdOFMA7JFp6f1G2gxkv//47qxZFm69HieXpNQkh Dca97ioy8c3o74aqviapOOjnalu1d9kjj3DBcG9wvoeH5WF5uP+pWCoogaXFzs9zKbScVv 9gif2Xx7rlkmKKWT2SACo0bx1WPPNUQ33x/fkQWq1QkTrC/PikbEOVkuH7IT/D+mRd5ngM cqn9minNIZDjTmgE+s8jvKessOkE53qb4n7zSNFDJ2zxZsw1SFkvb9+8ewwwiBaw== Received: by filterdrecv-5bbdbb56cd-52lf9 with SMTP id filterdrecv-5bbdbb56cd-52lf9-1-65D25250-18 2024-02-18 18:54:08.885502419 +0000 UTC m=+859056.724449218 Received: from quoi.xyz (unknown) by geopod-ismtpd-12 (SG) with ESMTP id JUp4ZHrBSWy4i-ST_iR-Xg Sun, 18 Feb 2024 18:54:08.781 +0000 (UTC) From: Jules Bertholet Subject: [PATCH][v2] localedata: Fix several issues with the set of characters considered 0-width [BZ #31370] Date: Sun, 18 Feb 2024 18:54:09 +0000 (UTC) Message-ID: <20240218185326.16663-1-julesbertholet@quoi.xyz> In-Reply-To: <20240211190202.414300-2-julesbertholet@quoi.xyz> References: <20240211190202.414300-2-julesbertholet@quoi.xyz> MIME-Version: 1.0 X-SG-EID: pG4Bv12xk3gLYqaLRqStQNoyYUkOYIcrsoZkuBsEAL8oF8DL0shIH5yzK7gAu8Fw0GzqF8t3cnQsgfdEXsg6568VvIFtMwgWhh5REz1oJFTMVNNeU8G44GKijnjenJZh4PSy635UY8rZEfR7cn7UXtcAgJO6pVSKDsqJu/e8zjLsf8zgXWhJdgCdleZt8l4k5OIjdz9dxYokSc1ypuD/diL1f7kXzbjYGwcSCkF3PiPMhHf9R5tkFEzRbCmpr0Oyxn5TVTfii4Sxj0XXDY9iHw== To: libc-alpha@sourceware.org Cc: Carlos O'Donnell , Mike Fabian , libc-locales@sourceware.org, Jules Bertholet X-Entity-ID: 28f4Yw7S4WnSp85Bnn3KUg== X-Spam-Status: No, score=-7.4 required=5.0 tests=BAYES_00, BODY_8BITS, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_INFOUSMEBIZ, KAM_SENDGRID, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces+patchwork=sourceware.org@sourceware.org This new version of the patch has a more detailed commit message, and includes one more related fix. --- Unicode specifies (https://www.unicode.org/faq/unsup_char.html#3) that characters with the `Default_Ignorable_Code_Point` property > should be rendered as completely invisible (and non advancing, i.e. “zero width”), if not explicitly supported in rendering. Hence, `wcwidth()` should give them all a width of 0, with two exceptions: - the soft hyphen (U+00AD SOFT HYPHEN) is assigned width 1 by longstanding precedent - U+115F HANGUL CHOSEONG FILLER needs a carveout due to the unique behavior of the conjoining Korean jamo characters. One composed Hangul "syllable block" like 퓛 is made up of two to three individual component characters, or "jamo". These are all assigned an `East_Asian_Width` of `Wide` by Unicode, which would normally mean they would all be assigned width 2 by glibc; a combination of (leading choseong jamo) + (medial jungseong jamo) + (trailing jongseong jamo) would then have width 2 + 2 + 2 = 6. However, glibc (and other wcwidth implementations) special-cases jungseong and jongseong, assigning them all width 0, to ensure that the complete block has width 2 + 0 + 0 = 2 as it should. U+115F is meant for use in syllable blocks that are intentionally missing a leading jamo; it must be assigned a width of 2 even though it has no visible display to ensure that the complete block has width 2. However, `wcwidth()` currently (before this patch) incorrectly assigns non-zero width to U+3164 HANGUL FILLER and U+FFA0 HALFWIDTH HANGUL FILLER; this commit fixes that. You can read more about Unicode jamo in the Unicode spec, sections 3.12 and 18.6 , and about `Default_Ignorable_Code_Point` in §5.21 . --- The Unicode Standard, §5.21 - Characters Ignored for Display says the following: > A small number of format characters (General_Category = Cf ) are also not given the Default_Ignorable_Code_Point property. > This may surprise implementers, who often assume that all format characters are generally ignored in fallback display. > The exact list of these exceptional format characters can be found in the Unicode Character Database. > There are, however, three important sets of such format characters to note: > > - prepended concatenation marks > - interlinear annotation characters > - Egyptian hieroglyph format controls > > The prepended concatenation marks always have a visible display. > See “Prepended Concatenation Marks” in [*Section 23.2, Layout Controls*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.35858.HeadingBreak.132.Layout.Controls) > for more discussion of the use and display of these signs. > > The other two notable sets of format characters that exceptionally are not ignored in fallback display consist of the interlinear annotation characters, > U+FFF9 INTERLINEAR ANNOTATION ANCHOR through U+FFFB INTERLINEAR ANNOTATION TERMINATOR, > and the Egyptian hieroglyph format controls, U+13430 EGYPTIAN HIEROGLYPH VERTICAL JOINER through U+1343F EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE. > These characters should have a visible glyph display for fallback rendering, because if they are not displayed, > it is too easy to misread the resulting displayed text. > See “Annotation Characters” in [*Section 23.8, Specials*](https://www.unicode.org/versions/Unicode15.1.0/ch23.pdf#M9.21335.Heading.133.Specials), > as well as [*Section 11.4, Egyptian Hieroglyphs*](https://www.unicode.org/versions/Unicode15.1.0/ch11.pdf#M9.73291.Heading.1418.Egyptian.Hieroglyphs) > for more discussion of the use and display of these characters. glibc currently correctly assigns non-zero width to the prepended concatenation marks, but it incorrectly gives zero width to the interlinear annotation characters (which a generic terminal cannot interpret) and the Egyptian hieroglyph format controls (which are not widely supported in rendering implementations at present). This commit fixes both these issues as well. Signed-off-by: Jules Bertholet --- localedata/charmaps/UTF-8 | 21 ++++++---- localedata/unicode-gen/Makefile | 2 + localedata/unicode-gen/utf8_gen.py | 67 +++++++++++++++++------------- 3 files changed, 53 insertions(+), 37 deletions(-) diff --git a/localedata/charmaps/UTF-8 b/localedata/charmaps/UTF-8 index bd8075f20d..f3fcd64fce 100644 --- a/localedata/charmaps/UTF-8 +++ b/localedata/charmaps/UTF-8 @@ -49842,12 +49842,17 @@ END CHARMAP % Character width according to Unicode 15.0.0. % - Default width is 1. +% - U+115F HANGUL CHOSEONG FILLER has width 2. +% - Combining jungseong and jongseong Hangul jamo have with 0. +% - U+00AD SOFT HYPHEN has width 1. % - Double-width characters have width 2; generated from % "grep '^[^;]*;[WF]' EastAsianWidth.txt" -% - Non-spacing characters have width 0; generated from PropList.txt or -% "grep '^[^;]*;[^;]*;[^;]*;[^;]*;NSM;' UnicodeData.txt" -% - Format control characters have width 0; generated from -% "grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt" +% - Non-spacing marks have width 0; generated from +% "grep '^[^;]*;[^;]*;Mn;' UnicodeData.txt" +% - Enclosing marks have width 0; generated from +% "grep '^[^;]*;[^;]*;Me;' UnicodeData.txt" +% - "Default_Ignorable_Code_Point"s have width 0; generated from +% "grep '^[^;]*;\s*Default_Ignorable_Code_Point' UnicodeData.txt" WIDTH ... 0 ... 0 @@ -50069,7 +50074,9 @@ WIDTH ... 0 ... 2 ... 2 -... 2 +... 2 + 0 +... 2 ... 2 ... 2 ... 2 @@ -50124,8 +50131,8 @@ WIDTH ... 2 0 ... 2 + 0 ... 2 -... 0 0 0 ... 0 @@ -50226,7 +50233,7 @@ WIDTH ... 0 0 0 -... 0 + 0 ... 0 ... 0 ... 0 diff --git a/localedata/unicode-gen/Makefile b/localedata/unicode-gen/Makefile index fd0c732ac4..1975065679 100644 --- a/localedata/unicode-gen/Makefile +++ b/localedata/unicode-gen/Makefile @@ -1,4 +1,5 @@ # Copyright (C) 2015-2023 Free Software Foundation, Inc. +# Copyright (C) 2024 The GNU Toolchain Authors. # This file is part of the GNU C Library. # The GNU C Library is free software; you can redistribute it and/or @@ -94,6 +95,7 @@ UTF-8: UnicodeData.txt EastAsianWidth.txt UTF-8: utf8_gen.py $(PYTHON3) utf8_gen.py -u UnicodeData.txt \ -e EastAsianWidth.txt -p PropList.txt \ + -d DerivedCoreProperties.txt \ --unicode_version $(UNICODE_VERSION) UTF-8-report: UTF-8 ../charmaps/UTF-8 diff --git a/localedata/unicode-gen/utf8_gen.py b/localedata/unicode-gen/utf8_gen.py index b48dc2aaa4..eedf6eadb0 100755 --- a/localedata/unicode-gen/utf8_gen.py +++ b/localedata/unicode-gen/utf8_gen.py @@ -1,6 +1,7 @@ #!/usr/bin/python3 # -*- coding: utf-8 -*- # Copyright (C) 2014-2023 Free Software Foundation, Inc. +# Copyright (C) 2024 The GNU Toolchain Authors. # This file is part of the GNU C Library. # # The GNU C Library is free software; you can redistribute it and/or @@ -28,7 +29,6 @@ It will output UTF-8 file ''' import argparse -import sys import re import unicode_utils @@ -203,25 +203,24 @@ def write_header_width(outfile, unicode_version): outfile.write('% Character width according to Unicode ' + '{:s}.\n'.format(unicode_version)) outfile.write('% - Default width is 1.\n') + outfile.write('% - U+115F HANGUL CHOSEONG FILLER has width 2.\n') + outfile.write('% - Combining jungseong and jongseong Hangul jamo have with 0.\n') + outfile.write('% - U+00AD SOFT HYPHEN has width 1.\n') outfile.write('% - Double-width characters have width 2; generated from\n') outfile.write('% "grep \'^[^;]*;[WF]\' EastAsianWidth.txt"\n') - outfile.write('% - Non-spacing characters have width 0; ' - + 'generated from PropList.txt or\n') - outfile.write('% "grep \'^[^;]*;[^;]*;[^;]*;[^;]*;NSM;\' ' - + 'UnicodeData.txt"\n') - outfile.write('% - Format control characters have width 0; ' - + 'generated from\n') - outfile.write("% \"grep '^[^;]*;[^;]*;Cf;' UnicodeData.txt\"\n") -# Not needed covered by Cf -# outfile.write("% - Zero width characters have width 0; generated from\n") -# outfile.write("% \"grep '^[^;]*;ZERO WIDTH ' UnicodeData.txt\"\n") + outfile.write('% - Non-spacing marks have width 0; generated from\n') + outfile.write('% "grep \'^[^;]*;[^;]*;Mn;\' UnicodeData.txt"\n') + outfile.write('% - Enclosing marks have width 0; generated from\n') + outfile.write('% "grep \'^[^;]*;[^;]*;Me;\' UnicodeData.txt"\n') + outfile.write('% - "Default_Ignorable_Code_Point"s have width 0; generated from\n') + outfile.write("% \"grep '^[^;]*;\\s*Default_Ignorable_Code_Point' UnicodeData.txt\"\n") outfile.write("WIDTH\n") -def process_width(outfile, ulines, elines, plines): +def process_width(outfile, ulines, elines, dlines): '''ulines are lines from UnicodeData.txt, elines are lines from - EastAsianWidth.txt containing characters with width “W” or “F”, - plines are lines from PropList.txt which contain characters - with the property “Prepended_Concatenation_Mark”. + EastAsianWidth.txt containing characters with width “W” or “F”. + dlines are lines from DerivedCoreProperties.txt which contain + characters with the property “Default_Ignorable_Code_Point”. ''' width_dict = {} @@ -237,12 +236,12 @@ def process_width(outfile, ulines, elines, plines): for line in ulines: fields = line.split(";") - if fields[4] == "NSM" or fields[2] in ("Cf", "Me", "Mn"): + if fields[4] == "NSM" or fields[2] in ("Me", "Mn"): width_dict[int(fields[0], 16)] = 0 - for line in plines: - # Characters with the property “Prepended_Concatenation_Mark” - # should have the width 1: + for line in dlines: + # Characters with the property “Default_Ignorable_Code_Point” + # should have the width 0: fields = line.split(";") if not '..' in fields[0]: code_points = (fields[0], fields[0]) @@ -250,7 +249,13 @@ def process_width(outfile, ulines, elines, plines): code_points = fields[0].split("..") for key in range(int(code_points[0], 16), int(code_points[1], 16)+1): - del width_dict[key] # default width is 1 + width_dict[key] = 0 # default width is 1 + + # special case: U+115F HANGUL CHOSEONG FILLER + # combines with other Hangul jamo to form a width-2 + # syllable block, so treat it as width 2 + # despite it being a `Default_Ignorable_Code_Point` + width_dict[0x115F] = 2 # handle special cases for compatibility for key in list((0x00AD,)): @@ -302,7 +307,7 @@ def process_width(outfile, ulines, elines, plines): if __name__ == "__main__": PARSER = argparse.ArgumentParser( description=''' - Generate a UTF-8 file from UnicodeData.txt, EastAsianWidth.txt, and PropList.txt. + Generate a UTF-8 file from UnicodeData.txt, DerivedCoreProperties.txt, and EastAsianWidth.txt ''') PARSER.add_argument( '-u', '--unicode_data_file', @@ -319,11 +324,11 @@ if __name__ == "__main__": help=('The EastAsianWidth.txt file to read, ' + 'default: %(default)s')) PARSER.add_argument( - '-p', '--prop_list_file', + '-d', '--derived_core_properties_file', nargs='?', type=str, - default='PropList.txt', - help=('The PropList.txt file to read, ' + default='DerivedCoreProperties.txt', + help=('The DerivedCoreProperties.txt file to read, ' + 'default: %(default)s')) PARSER.add_argument( '--unicode_version', @@ -352,11 +357,13 @@ if __name__ == "__main__": continue if re.match(r'^[^;]*;[WF]', LINE): EAST_ASIAN_WIDTH_LINES.append(LINE.strip()) - with open(ARGS.prop_list_file, mode='r') as PROP_LIST_FILE: - PROP_LIST_LINES = [] - for LINE in PROP_LIST_FILE: - if re.match(r'^[^;]*;[\s]*Prepended_Concatenation_Mark', LINE): - PROP_LIST_LINES.append(LINE.strip()) + with open(ARGS.derived_core_properties_file, mode='r') as DERIVED_CORE_PROPERTIES_FILE: + DERIVED_CORE_PROPERTIES_LINES = [] + for LINE in DERIVED_CORE_PROPERTIES_FILE: + if re.match(r'.*', LINE): + continue + if re.match(r'^[^;]*;\s*Default_Ignorable_Code_Point', LINE): + DERIVED_CORE_PROPERTIES_LINES.append(LINE.strip()) with open('UTF-8', mode='w') as OUTFILE: # Processing UnicodeData.txt and write CHARMAP to UTF-8 file write_header_charmap(OUTFILE) @@ -367,5 +374,5 @@ if __name__ == "__main__": process_width(OUTFILE, UNICODE_DATA_LINES, EAST_ASIAN_WIDTH_LINES, - PROP_LIST_LINES) + DERIVED_CORE_PROPERTIES_LINES) OUTFILE.write("END WIDTH\n")