www.delorie.com/archives/browse.cgi | search |
X-Recipient: | archive-cygwin AT delorie DOT com |
X-SWARE-Spam-Status: | No, hits=-1.3 required=5.0 tests=AWL,BAYES_00,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE |
X-Spam-Check-By: | sourceware.org |
MIME-Version: | 1.0 |
In-Reply-To: | <4BF55F87.4060407@towo.net> |
References: | <20100520123926 DOT GA1432 AT onderneming10 DOT xs4all DOT nl> <AANLkTilpbuyiJIswTZGQN5jsHsK793ITUP9pcB95Hf1l AT mail DOT gmail DOT com> <4BF55F87 DOT 4060407 AT towo DOT net> |
Date: | Thu, 20 May 2010 20:46:05 +0300 |
Message-ID: | <AANLkTilkEU-LI1jINJ2j4CmwJJjILml2m_zYdyMGhdUV@mail.gmail.com> |
Subject: | Re: sed doesn't like LANG= anymore |
From: | Andy Koppe <andy DOT koppe AT gmail DOT com> |
To: | "cygwin AT cygwin DOT com" <cygwin AT cygwin DOT com> |
X-IsSubscribed: | yes |
Mailing-List: | contact cygwin-help AT cygwin DOT com; run by ezmlm |
List-Id: | <cygwin.cygwin.com> |
List-Unsubscribe: | <mailto:cygwin-unsubscribe-archive-cygwin=delorie DOT com AT cygwin DOT com> |
List-Subscribe: | <mailto:cygwin-subscribe AT cygwin DOT com> |
List-Archive: | <http://sourceware.org/ml/cygwin/> |
List-Post: | <mailto:cygwin AT cygwin DOT com> |
List-Help: | <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs> |
Sender: | cygwin-owner AT cygwin DOT com |
Mail-Followup-To: | cygwin AT cygwin DOT com |
Delivered-To: | mailing list cygwin AT cygwin DOT com |
On Thursday, May 20, 2010, Thomas Wolff: > With LANG=anything-unknown, the charmap is set to ASCII, so it works (as there is at least no multibyte character then). Anything above 0x7F is invalid with charset ASCII though (since 1.7.2). But perhaps sed skips the multibyte conversion functions when in the C locale. > Considering the described effect, I doubt that a UTF-8 decoder should swallow an ASCII byte after an incomplete UTF-8 sequence; > it should rather stop at the last UTF-8 sequence byte, and consider any subsequent initial UTF-8 or ASCII byte as a new character. 0xE5 is a valid initial byte of a UTF-8 sequence, hence mbtowc returns -2 ("incomplete") after that and -1 ("invalid") on encountering the following ASCII byte. I think it would be wrong to ignore the encoding error, and it's up to the application to back up and feed in the same byte again if it wants to. Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
webmaster | delorie software privacy |
Copyright © 2019 by DJ Delorie | Updated Jul 2019 |