X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-1.3 required=5.0 tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,RCVD_IN_SORBS_WEB,SARE_LWSHORTT,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: sourceware.org X-DKIM: Sendmail DKIM Filter v2.8.2 mail-in-03.arcor-online.net CFD8CD8517 Message-ID: <4C968836.3000903@arcor.de> Date: Mon, 20 Sep 2010 00:01:26 +0200 From: Dirk Fassbender User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.2.9) Gecko/20100915 Lightning/1.0b2 Thunderbird/3.1.4 MIME-Version: 1.0 To: cygwin AT cygwin DOT com Subject: Re: awk gsub problem References: <20100916092458 DOT GB15121 AT calimero DOT vinschen DOT de> <20100918092139 DOT GE14602 AT calimero DOT vinschen DOT de> <20100918200851 DOT GA5760 AT calimero DOT vinschen DOT de> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Am 19.09.2010 22:33, schrieb Lee: > > Thank you - I appreciate the follow-up. > > Was the reply from the upstream maintainer answered on a mailing list? > (& if so, which one?) I'd like to understand the problem they're > solving.. I get the idea of "[[:lower:]]" working regardless of > collating order of the current char set, but how "[a-z]" gets > translated to something like "[aAbBcCdD...zZ]" boggles my mind. It > seems like they had to have gone out of their way to translate [a-z] > into a case-insensitive RE. > > But regardless, it still seems broken to me. From the gawk man page: > > The various command line options control how gawk interprets > characters in regular expressions. > > --traditional > Traditional Unix awk regular expressions are matched. The GNU > operators are not special, interval expressions are not available, and > neither are the POSIX character classes ([[:alnum:]] and so on). > > The way I read it, I can change the line in my .bashrc from > export AWK="/usr/bin/gawk.exe" > to > export AWK="/usr/bin/gawk.exe --traditional" > and not have to change any scripts that use $AWK. If "--traditional" > meant one no longer was able to do a case-sensitive RE ("[a-z]" gets > translated into "[aAbB...zZ]" and "[[:lower:]]" isn't interpreted as a > lower case character RE) I'd expect that to be high-lighted in the man > page. But like I said in my initial msg, --traditional doesn't fix > the problem: > > $ cat test.awk > awk --traditional ' > BEGIN { > s="Serial0" > gsub("[a-z]","",s) > printf("s= ::%s:: should = ::S0::\n", s) > exit > } ' > > $ export LANG=en_US.UTF-8 > > $ sh test.awk > s= ::0:: should = ::S0:: > > >> What you really want is this: > s/really want/have to do/ > >> BEGIN { >> s="Serial0" >> gsub("[[:lower:]]","",s) >> printf("s= ::%s:: should = ::S0::\n", s) >> exit >> } >> >> The "[[:lower:]]" expression always catches all valid lowercase letters, >> independent of the langauge, territory, and charset used. > At least for the short term, my work-around is not setting LANG. > > Thanks again, > Lee > > -- > Problem reports: http://cygwin.com/problems.html > FAQ: http://cygwin.com/faq/ > Documentation: http://cygwin.com/docs.html > Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple > > > Hello Lee, you hit a well know problem with different character sets. Normally it is not recognized, because the standard character set from UNIX, LINUX And WINDOWS systems have the characters "abcdefghijklmnopqrstuvwxyz" in a sequence. But this is not the case for all character sets. E.g. *EBCDIC* is one example for such a character set. The different character set are a great problem for porting programs from one system to another. The documentation for gawk in the man page is not complete. Many GNU programs have the full/better documentation in the info pages. The documentation for your problem is accessible by the following command: info gawk character list It is the first paragraph in the info page. 2.4 Using Character Lists ========================= Within a character list, a "range expression" consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, using the locale's collating sequence and character set. For example, in the default C locale, `[a-dx-z]' is equivalent to `[abcdxyz]'. Many locales sort characters in dictionary order, and in these locales, `[a-dx-z]' is typically not equivalent to `[abcdxyz]'; instead it might be equivalent to `[aBbCcDdxXyYz]', for example. To obtain the traditional interpretation of bracket expressions, you can use the C locale by setting the `LC_ALL' environment variable to the value `C'. Regards Dirk -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple