X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-1.8 required=5.0 tests=AWL,BAYES_00,SARE_MSGID_LONG40,SPF_PASS X-Spam-Check-By: sourceware.org MIME-Version: 1.0 In-Reply-To: <94b5b62d0907272133g5d75858ei2a328d82cd54da11@mail.gmail.com> References: <416096c60907271456x5e8cb3f7y64433d542ec6cdcb AT mail DOT gmail DOT com> <94b5b62d0907272133g5d75858ei2a328d82cd54da11 AT mail DOT gmail DOT com> Date: Tue, 28 Jul 2009 06:22:56 +0100 Message-ID: <416096c60907272222m6cfdd29dk8e2b2cc4b0a04281@mail.gmail.com> Subject: Re: bug in mbrtowc? From: Andy Koppe To: cygwin AT cygwin DOT com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com 2009/7/28 Pedro Izecksohn: >> #include >> #include >> #include >> #include >> >> int main(void) { >> wchar_t wc; >> size_t ret; >> mbstate_t s =3D { 0 }; >> puts(setlocale(LC_CTYPE, "en_GB.UTF-8")); >> printf("%i\n", mbrtowc(&wc, "\xe2", 1, 0)); >> printf("%i\n", mbrtowc(&wc, "\x94", 1, 0)); >> printf("%i\n", mbrtowc(&wc, "\x84", 1, 0)); >> printf("%x\n", wc); >> return 0; >> } >> >> The sequence E2 94 84 should translate to U+2514. Instead, the second >> and third calls to mbrtowc report encoding errors. It does work >> correctly if the three bytes are passed to mbrtowc() in one go: > =C2=A0From the "Linux Programmer=E2=80=99s Manual" (release 3.15 of the L= inux man-pages): > "If the n bytes starting at s do not contain a complete multibyte > character, =C2=A0mbrtowc() =C2=A0returns =C2=A0(size_t) -2." Correct. And the first call to mbrtowc() does just that. The problem is that the second call returns -1, which signals an encoding error, even though E2 94 is a valid yet incomplete sequence, i.e. it should also return -2 and remember what it's seen so far in its internal state. The third call should return 1 and write 0x2504 to wc. Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple