Mail Archives: cygwin/2010/09/19/18:01:43
X-Recipient: | archive-cygwin AT delorie DOT com
|
X-SWARE-Spam-Status: | No, hits=-1.3 required=5.0 tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_WEB,SARE_LWSHORTT,T_TO_NO_BRKTS_FREEMAIL
|
X-Spam-Check-By: | sourceware.org
|
X-DKIM: | Sendmail DKIM Filter v2.8.2 mail-in-03.arcor-online.net CFD8CD8517
|
Message-ID: | <4C968836.3000903@arcor.de>
|
Date: | Mon, 20 Sep 2010 00:01:26 +0200
|
From: | Dirk Fassbender <dirk DOT fassbender AT arcor DOT de>
|
User-Agent: | Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.2.9) Gecko/20100915 Lightning/1.0b2 Thunderbird/3.1.4
|
MIME-Version: | 1.0
|
To: | cygwin AT cygwin DOT com
|
Subject: | Re: awk gsub problem
|
References: | <AANLkTikzGH8GUZ5ZUytSJShfYE=KMyphyue83Q8XMm4- AT mail DOT gmail DOT com> <20100916092458 DOT GB15121 AT calimero DOT vinschen DOT de> <AANLkTimwcbmxMtfZWbkztef+fxQfKtoM9CsFOd38E2a3 AT mail DOT gmail DOT com> <20100918092139 DOT GE14602 AT calimero DOT vinschen DOT de> <20100918200851 DOT GA5760 AT calimero DOT vinschen DOT de> <AANLkTi=O_VkQEdXfCLsRQa40zM7min2X=cwosFM95oTU AT mail DOT gmail DOT com>
|
In-Reply-To: | <AANLkTi=O_VkQEdXfCLsRQa40zM7min2X=cwosFM95oTU@mail.gmail.com>
|
Mailing-List: | contact cygwin-help AT cygwin DOT com; run by ezmlm
|
List-Id: | <cygwin.cygwin.com>
|
List-Subscribe: | <mailto:cygwin-subscribe AT cygwin DOT com>
|
List-Archive: | <http://sourceware.org/ml/cygwin/>
|
List-Post: | <mailto:cygwin AT cygwin DOT com>
|
List-Help: | <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
|
Sender: | cygwin-owner AT cygwin DOT com
|
Mail-Followup-To: | cygwin AT cygwin DOT com
|
Delivered-To: | mailing list cygwin AT cygwin DOT com
|
Am 19.09.2010 22:33, schrieb Lee:
>
> Thank you - I appreciate the follow-up.
>
> Was the reply from the upstream maintainer answered on a mailing list?
> (& if so, which one?) I'd like to understand the problem they're
> solving.. I get the idea of "[[:lower:]]" working regardless of
> collating order of the current char set, but how "[a-z]" gets
> translated to something like "[aAbBcCdD...zZ]" boggles my mind. It
> seems like they had to have gone out of their way to translate [a-z]
> into a case-insensitive RE.
>
> But regardless, it still seems broken to me. From the gawk man page:
>
> The various command line options control how gawk interprets
> characters in regular expressions.
>
> --traditional
> Traditional Unix awk regular expressions are matched. The GNU
> operators are not special, interval expressions are not available, and
> neither are the POSIX character classes ([[:alnum:]] and so on).
>
> The way I read it, I can change the line in my .bashrc from
> export AWK="/usr/bin/gawk.exe"
> to
> export AWK="/usr/bin/gawk.exe --traditional"
> and not have to change any scripts that use $AWK. If "--traditional"
> meant one no longer was able to do a case-sensitive RE ("[a-z]" gets
> translated into "[aAbB...zZ]" and "[[:lower:]]" isn't interpreted as a
> lower case character RE) I'd expect that to be high-lighted in the man
> page. But like I said in my initial msg, --traditional doesn't fix
> the problem:
>
> $ cat test.awk
> awk --traditional '
> BEGIN {
> s="Serial0"
> gsub("[a-z]","",s)
> printf("s= ::%s:: should = ::S0::\n", s)
> exit
> } '
>
> $ export LANG=en_US.UTF-8
>
> $ sh test.awk
> s= ::0:: should = ::S0::
>
>
>> What you really want is this:
> s/really want/have to do/
>
>> BEGIN {
>> s="Serial0"
>> gsub("[[:lower:]]","",s)
>> printf("s= ::%s:: should = ::S0::\n", s)
>> exit
>> }
>>
>> The "[[:lower:]]" expression always catches all valid lowercase letters,
>> independent of the langauge, territory, and charset used.
> At least for the short term, my work-around is not setting LANG.
>
> Thanks again,
> Lee
>
> --
> Problem reports: http://cygwin.com/problems.html
> FAQ: http://cygwin.com/faq/
> Documentation: http://cygwin.com/docs.html
> Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
>
>
>
Hello Lee,
you hit a well know problem with different character sets.
Normally it is not recognized, because the standard character set from
UNIX, LINUX And WINDOWS systems
have the characters "abcdefghijklmnopqrstuvwxyz" in a sequence. But this
is not the case for all character sets.
E.g. *EBCDIC* is one example for such a character set.
The different character set are a great problem for porting programs
from one system to another.
The documentation for gawk in the man page is not complete. Many GNU
programs have the full/better documentation in the info pages.
The documentation for your problem is accessible by the following command:
info gawk character list
It is the first paragraph in the info page.
2.4 Using Character Lists
=========================
Within a character list, a "range expression" consists of two
characters separated by a hyphen. It matches any single character that
sorts between the two characters, using the locale's collating sequence
and character set. For example, in the default C locale, `[a-dx-z]' is
equivalent to `[abcdxyz]'. Many locales sort characters in dictionary
order, and in these locales, `[a-dx-z]' is typically not equivalent to
`[abcdxyz]'; instead it might be equivalent to `[aBbCcDdxXyYz]', for
example. To obtain the traditional interpretation of bracket
expressions, you can use the C locale by setting the `LC_ALL'
environment variable to the value `C'.
Regards
Dirk
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
- Raw text -