www.delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/12/29/08:20:56

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-2.0 required=5.0 tests=AWL,BAYES_00,SPF_SOFTFAIL
X-Spam-Check-By: sourceware.org
Message-ID: <4B3A0246.4050705@byu.net>
Date: Tue, 29 Dec 2009 06:21:10 -0700
From: Eric Blake <ebb9 AT byu DOT net>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.23) Gecko/20090812 Thunderbird/2.0.0.23 Mnenhy/0.7.6.666
MIME-Version: 1.0
To: cygwin AT cygwin DOT com, rodmedina AT cantv DOT net
Subject: Re: gcc4[1.7] printf treats differently a string constant and a character array
References: <380-2200912128193944786 AT cantv DOT net> <416096c60912281437o16aec4cct8b64b7518d9a9a1 AT mail DOT gmail DOT com> <416096c60912282217h57cf311h6af5d98ff9580f0 AT mail DOT gmail DOT com>
In-Reply-To: <416096c60912282217h57cf311h6af5d98ff9580f0@mail.gmail.com>
X-IsSubscribed: yes
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

According to Andy Koppe on 12/28/2009 11:17 PM:
>>> I am using LC_ALL=es_VE.ISO-8859-15.

So you told gcc which charset to use for those non-ASCII characters, which
resulted in raw 8-bit bytes.  puts is required to work transparently on
bytes, but printf is specified as a mix between bytes (arguments matching
%s) and characters (the format string itself, and arguments matching %ls).

> 
> Ah, the problem actually is that your program is missing a call to
> setlocale(LC_CTYPE, "") to switch to the locale and character set
> specified in the environment. In fact, since your program contains
> hard-coded ISO-8859-15 strings, you should probably do
> setlocale(LC_CTYPE, "<whatever>.ISO-8859-15").

Well, as long as you are running it on your machine, with
LC_ALL=es_VE.ISO-8859-15 in the environment, then setlocale(LC_ALL,"")
will pick up the same charset as what gcc hard-coded into your app.  But
yes, by using 8-bit bytes in your string, you have married your executable
to a particular locale, and it is no longer portable to machines using a
different charset.  To be more portable, you would want to use some
iconv() conversions (or look into using gettext() for translation catalogs).

> 
> Without a setlocale call, programs use the "C" locale, and on Cygwin
> 1.7 that implies the UTF-8 character set. Those single accented
> ISO-8859-15 characters are invalid when interpreted as UTF-8, so
> printf halts there. The accented character pairs like "á", meanwhile,
> happen to be valid UTF-8, so they get through.
> 
> I couldn't find specific text about invalid bytes in the POSIX printf
> spec,

http://www.opengroup.org/onlinepubs/9699919799/functions/fprintf.html

"all forms of fprintf() shall fail if:

[EILSEQ]
    [CX] A wide-character code that does not correspond to a valid
character has been detected."

> It's talking about "characters" rather than "bytes" there, which I
> think does leave the behaviour for invalid bytes undefined,

It's actually well-defined - non-characters in the format string MUST make
printf fail.  However, it raises the issue of whether the failure must
occur without any output, or only upon detection of the first invalid
character whether or not prior characters and % directives have been acted
upon.  I think the standard is silent on that point, making it a QoI issue.

Remember, POSIX states that any use in a character context of bytes with
the 8th-bit set is specifically undefined in the C locale (whether that be
C.ASCII or C.UTF-8).  Using accented characters (which result in bytes
with the 8th-bit set, whether you use UTF-8 or ISO-8859-15) falls into
that category, so the bug is in your program for expecting sane results
while not changing the locale away from C.

- --
Don't work too hard, make some time for fun as well!

Eric Blake             ebb9 AT byu DOT net
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Public key at home.comcast.net/~ericblake/eblake.gpg
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAks6AkYACgkQ84KuGfSFAYCcZwCfSqNz9qdjxEBXHMwtPJ+8bx9T
6S4AoJlgfarKywPgDH6TY3Zy16/3jc1K
=YRTJ
-----END PGP SIGNATURE-----

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019