www.delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/07/28/08:05:57

X-Recipient: archive-cygwin AT delorie DOT com
X-Spam-Check-By: sourceware.org
Date: Tue, 28 Jul 2009 14:05:31 +0200
From: Corinna Vinschen <corinna-cygwin AT cygwin DOT com>
To: cygwin AT cygwin DOT com
Subject: Re: wchar_t width (was: bug in mbrtowc?)
Message-ID: <20090728120531.GA22988@calimero.vinschen.de>
Reply-To: cygwin AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
References: <416096c60907280437ie8febfme33c238431fa7da8 AT mail DOT gmail DOT com>
MIME-Version: 1.0
In-Reply-To: <416096c60907280437ie8febfme33c238431fa7da8@mail.gmail.com>
User-Agent: Mutt/1.5.19 (2009-02-20)
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Unsubscribe: <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

On Jul 28 12:37, Andy Koppe wrote:
> 2009/7/28 Corinna Vinschen:
> >> Trouble is, the hack will also only work correctly if the whole UTF-8
> >> sequence for the non-BMP character is passed at once. If you pass the
> >> bytes one-by-one instead, and assuming the bug above wasn't there,
> >> you'd get this:
> >
> > Yes, I know.  The real trouble is, I don't know how that can be fixed
> > in a still sort-of-POSIXy way.
> 
> The way I'd suggested is sort-of-POSIXy, but perhaps not enough,
> because apps that check the mbrtowc() return code (and not the written
> wc) against zero will interpret a low surrogate as string end. An
> alternative might be to just return an error when there's no compliant
> way to return the low surrogate. Do you think either of these are
> worth pursuing?

I'm thinking of faking a valid return of 1 (or 2, or 3) after the third byte
has been read.  Three bytes are sufficient to create the first surrogate
half in wc.  Upon reading the last byte, return 1 and set wc to the second
surrogate half.

> Therefore I think long-term Cygwin's wchar_t will need to change to 32
> bits for Linux compatibility. Of course that would require major,
> ABI-breaking changes:

That's really not an option for now.

> - Introduce a separate type for representing UTF-16, e.g. "vchar_t",
> because 'v' is half a 'w' ;)

There's a proposal to the C standard to add specific Unicode types
along the lines of c_utf[8,16,32]_t or something like that.

That doesn't help us the least, unfortunately.  For now we have to live
with the wrong decision to make wchar_t 16 bit on the Win32 platform.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019