Mail Archives: cygwin/2010/09/19/16:34:01
| X-Recipient: | archive-cygwin AT delorie DOT com | 
| X-SWARE-Spam-Status: | No, hits=0.7 required=5.0	tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SARE_LWSHORTT,T_TO_NO_BRKTS_FREEMAIL | 
| X-Spam-Check-By: | sourceware.org | 
| MIME-Version: | 1.0 | 
| In-Reply-To: | <20100918200851.GA5760@calimero.vinschen.de> | 
| References: | <AANLkTikzGH8GUZ5ZUytSJShfYE=KMyphyue83Q8XMm4- AT mail DOT gmail DOT com>	<20100916092458 DOT GB15121 AT calimero DOT vinschen DOT de>	<AANLkTimwcbmxMtfZWbkztef+fxQfKtoM9CsFOd38E2a3 AT mail DOT gmail DOT com>	<20100918092139 DOT GE14602 AT calimero DOT vinschen DOT de>	<20100918200851 DOT GA5760 AT calimero DOT vinschen DOT de> | 
| Date: | Sun, 19 Sep 2010 16:33:45 -0400 | 
| Message-ID: | <AANLkTi=O_VkQEdXfCLsRQa40zM7min2X=cwosFM95oTU@mail.gmail.com> | 
| Subject: | Re: awk gsub problem | 
| From: | Lee <ler762 AT gmail DOT com> | 
| To: | cygwin AT cygwin DOT com | 
| X-IsSubscribed: | yes | 
| Mailing-List: | contact cygwin-help AT cygwin DOT com; run by ezmlm | 
| List-Id: | <cygwin.cygwin.com> | 
| List-Unsubscribe: | <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com> | 
| List-Subscribe: | <mailto:cygwin-subscribe AT cygwin DOT com> | 
| List-Archive: | <http://sourceware.org/ml/cygwin/> | 
| List-Post: | <mailto:cygwin AT cygwin DOT com> | 
| List-Help: | <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs> | 
| Sender: | cygwin-owner AT cygwin DOT com | 
| Mail-Followup-To: | cygwin AT cygwin DOT com | 
| Delivered-To: | mailing list cygwin AT cygwin DOT com | 
On 9/18/10, Corinna Vinschen wrote:
> On Sep 18 11:21, Corinna Vinschen wrote:
>> On Sep 17 22:30, Lee wrote:
>> > On 9/16/10, Corinna Vinschen wrote:
>> > > On Sep 15 18:30, Lee wrote:
>> > >> I don't know if this is just a problem with the cygwin version of
>> > >> awk,
>> > >> me misunderstanding something or what, but it looks like gsub isn't
>> > >> working correctly in awk:
>> > >> $ sh /tmp/test.awk
>> > >> s= ::0::  should = ::S0::
>> > >>
>> > >> $ cat /tmp/test.awk
>> > >> awk '
>> > >> BEGIN {
>> > >>   s="Serial0"
>> > >>   gsub("[a-z]","",s)
>> > >>   printf("s= ::%s::  should = ::S0::\n", s)
>> > >>   exit
>> > >> } '
>> > >>
>> > >> I also tried it with IGNORECASE=0 and with "awk --traditional" - same
>> > >> results.
>> > > Works fine for me:
>> >
>> > Comment out the 'set LANG=" and gsub works fine:
>> > $ echo $LANG
>> > C.UTF-8
>> >
>> > $ sh /tmp/test.awk
>> > s= ::S0::  should = ::S0::
>> >
>> > $ export LANG=en_US.UTF-8
>> >
>> > $ sh /tmp/test.awk
>> > s= ::0::  should = ::S0::
>> >
>> > So awk gsub works for me again - thank you!
>> >
>> > Just out of curiosity, why would setting LANG to en_US break
>> > case-sensitivity in gsub?
>>
>> I don't know either.  I just asked the upstream maintainer.  At least it
>> isn't a Cygwin problem, since it also behaves the same on Linux.
>
> I got reply from the upstream maintainer.  Case-sensitivity in gsub is
> not broken, rather it's really a language dependent difference.
>
> If LANG is "en_US" or "en_US.utf8", then the regular expression "[a-z]"
> does *not* correspond anymore to the ASCII codes.  Rather it corresponds
> to something like "[aAbBcCdD...zZ]", independent of the actual character
> encoding ISO-8859-1 or UTF-8.
Thank you - I appreciate the follow-up.
Was the reply from the upstream maintainer answered on a mailing list?
 (& if so, which one?)  I'd like to understand the problem they're
solving..  I get the idea of "[[:lower:]]" working regardless of
collating order of the current char set, but how "[a-z]" gets
translated to something like "[aAbBcCdD...zZ]" boggles my mind.  It
seems like they had to have gone out of their way to translate [a-z]
into a case-insensitive RE.
But regardless, it still seems broken to me.  From the gawk man page:
   The various command line options control how gawk interprets
characters in regular expressions.
   --traditional
      Traditional Unix awk regular expressions are matched.  The GNU
operators are not special, interval expressions are not available, and
neither are the POSIX character classes ([[:alnum:]] and so on).
The way I read it, I can change the line in my .bashrc from
  export AWK="/usr/bin/gawk.exe"
to
  export AWK="/usr/bin/gawk.exe --traditional"
and not have to change any scripts that use $AWK.  If "--traditional"
meant one no longer was able to do a case-sensitive RE ("[a-z]" gets
translated into "[aAbB...zZ]" and "[[:lower:]]" isn't interpreted as a
lower case character RE) I'd expect that to be high-lighted in the man
page.  But like I said in my initial msg, --traditional doesn't fix
the problem:
$ cat test.awk
awk --traditional '
BEGIN {
  s="Serial0"
  gsub("[a-z]","",s)
  printf("s= ::%s::  should = ::S0::\n", s)
  exit
} '
$ export LANG=en_US.UTF-8
$ sh test.awk
s= ::0::  should = ::S0::
> What you really want is this:
s/really want/have to do/
>   BEGIN {
>     s="Serial0"
>     gsub("[[:lower:]]","",s)
>     printf("s= ::%s::  should = ::S0::\n", s)
>     exit
>   }
>
> The "[[:lower:]]" expression always catches all valid lowercase letters,
> independent of the langauge, territory, and charset used.
At least for the short term, my work-around is not setting LANG.
Thanks again,
Lee
--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple
- Raw text -