X-Recipient: archive-cygwin AT delorie DOT com X-Spam-Check-By: sourceware.org Date: Mon, 15 Jun 2009 10:44:43 +0200 From: Corinna Vinschen To: cygwin AT cygwin DOT com, newlib AT sourceware DOT org Subject: [PATCH] Add "@cjknarrow" modifier (was Re: [Fwd: [1.7] wcwidth failing configure tests]) Message-ID: <20090615084443.GO5039@calimero.vinschen.de> Mail-Followup-To: cygwin AT cygwin DOT com, newlib AT sourceware DOT org References: <20090512165404 DOT GW21324 AT calimero DOT vinschen DOT de> <416096c60905120956n5521929bm69586f5e6325a994 AT mail DOT gmail DOT com> <20090512173153 DOT GY21324 AT calimero DOT vinschen DOT de> <3f0ad08d0905140858j17c7b374paa649f18ef18178d AT mail DOT gmail DOT com> <200905201652 DOT n4KGqYGm000509 AT mail DOT bln1 DOT bf DOT nsn-intra DOT net> <200906051625 DOT n55GP6t3028411 AT mail DOT bln1 DOT bf DOT nsn-intra DOT net> <3f0ad08d0906060242t275a78e7tb9913bf78d1c5e83 AT mail DOT gmail DOT com> <200906121538 DOT n5CFcSld014997 AT mail DOT bln1 DOT bf DOT nsn-intra DOT net> <3f0ad08d0906140604y49c470eeu68c6c307ec1cd073 AT mail DOT gmail DOT com> <3f0ad08d0906140618w53c82556ye709c70efc1c65e0 AT mail DOT gmail DOT com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <3f0ad08d0906140618w53c82556ye709c70efc1c65e0@mail.gmail.com> User-Agent: Mutt/1.5.19 (2009-02-20) Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com On Jun 14 22:18, IWAMURO Motonori wrote: > 2009/6/13 Corinna Vinschen > > The problem appears to be that there is no standard for the handling > > of ambiguous characters. > > Yes, but the guideline exists. > http://cygwin.com/ml/cygwin/2009-05/msg00444.html A single mail in a single mailing list of a single project. That's rather a suggestion than a guideline... > > > Ambiguous characters behave like wide or narrow characters depending > > > on the context (language tag, script identification, associated > > > font, source of data, or explicit markup; all can provide the > > > context). If the context cannot be established reliably, they should > > > be treated as narrow characters by default. > > > Define the default for ja, ko, and zh to use width = 2, with a > > @cjknarrow (or whatever) modifier to use width = 1. > > I think it is good idea. If everybody agrees to this suggestion, here's the patch. Tested with various combinations like LANG=ja_JP DOT UTF-8 AT cjknarrow LANG=ja_JP AT cjknarrow LANG=ja DOT UTF-8 AT cjknarrow LANG=ja AT cjknarrow Corinna * libc/locale/locale.c (loadlocale): Add handling of "@cjknarrow" modifier on _MB_CAPABLE targets. Add comment to explain. Index: libc/locale/locale.c =================================================================== RCS file: /cvs/src/src/newlib/libc/locale/locale.c,v retrieving revision 1.20 diff -u -p -r1.20 locale.c --- libc/locale/locale.c 3 Jun 2009 19:28:22 -0000 1.20 +++ libc/locale/locale.c 15 Jun 2009 08:40:46 -0000 @@ -397,6 +397,9 @@ loadlocale(struct _reent *p, int categor int (*l_wctomb) (struct _reent *, char *, wchar_t, const char *, mbstate_t *); int (*l_mbtowc) (struct _reent *, wchar_t *, const char *, size_t, const char *, mbstate_t *); +#ifdef _MB_CAPABLE + int cjknarrow = 0; +#endif /* "POSIX" is translated to "C", as on Linux. */ if (!strcmp (locale, "POSIX")) @@ -427,10 +430,14 @@ loadlocale(struct _reent *p, int categor if (c[0] == '.') { /* Charset */ - strcpy (charset, c + 1); - if ((c = strchr (charset, '@'))) + char *chp; + + ++c; + strcpy (charset, c); + if ((chp = strchr (charset, '@'))) /* Strip off modifier */ - *c = '\0'; + *chp = '\0'; + c += strlen (charset); } else if (c[0] == '\0' || c[0] == '@') /* End of string or just a modifier */ @@ -442,6 +449,17 @@ loadlocale(struct _reent *p, int categor else /* Invalid string */ return NULL; +#ifdef _MB_CAPABLE + if (c[0] == '@') + { + /* Modifier */ + /* Only one modifier is recognized right now. "cjknarrow" is used + to modify the behaviour of wcwidth() for East Asian languages. + For details see the comment at the end of this function. */ + if (!strcmp (c + 1, "cjknarrow")) + cjknarrow = 1; + } +#endif } /* We only support this subset of charsets. */ switch (charset[0]) @@ -604,13 +622,15 @@ loadlocale(struct _reent *p, int categor __mbtowc = l_mbtowc; __set_ctype (charset); /* Check for the language part of the locale specifier. In case - of "ja", "ko", or "zh", assume the use of CJK fonts. This is - stored in lc_ctype_cjk_lang and tested in wcwidth() to figure - out the width to return (1 or 2) for the "CJK Ambiguous Width" - category of characters. */ - lc_ctype_cjk_lang = (strncmp (locale, "ja", 2) == 0 - || strncmp (locale, "ko", 2) == 0 - || strncmp (locale, "zh", 2) == 0); + of "ja", "ko", or "zh", assume the use of CJK fonts, unless the + "@cjknarrow" modifier has been specifed. + The result is stored in lc_ctype_cjk_lang and tested in wcwidth() + to figure out the width to return (1 or 2) for the "CJK Ambiguous + Width" category of characters. */ + lc_ctype_cjk_lang = !cjknarrow + && ((strncmp (locale, "ja", 2) == 0 + || strncmp (locale, "ko", 2) == 0 + || strncmp (locale, "zh", 2) == 0)); #endif } else if (category == LC_MESSAGES) -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/