X-Recipient: archive-cygwin AT delorie DOT com DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0D4703857C53 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cygwin.com; s=default; t=1596492323; bh=XTo3jWPNMsQcvH0C6q9ilH7AOZ/wcUFwe8GRgEzxQrQ=; h=In-Reply-To:To:Subject:Date:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=nfoOKTx0Jvj6WOz1gjrCglXqcT2Ac/tTM7cGkmVHCAsyN3p8vn5e+oZi0m4GNNo/0 Yligm2VVPju/Kg8PUvk7mPBNyat6kBGi7qKEmvbic8w+tFBRFvn1OO2cil3R8Pzm/y nPMwyJwOaanQGaLZIQ/4qvxVeeSiuK/tFZYHG0EY= X-Original-To: cygwin AT cygwin DOT com Delivered-To: cygwin AT cygwin DOT com DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 946E03857C42 In-Reply-To: To: cygwin AT cygwin DOT com Subject: Re: Trouble with output character sets from Win32 applications running under mintty Message-ID: Date: Mon, 3 Aug 2020 18:05:18 -0400 References: <1314865780 DOT 20200803204249 AT yandex DOT ru> MIME-Version: 1.0 X-KeepSent: E0AAB507:AC9FD3B4-852585B9:0076DEA7; name=$KeepSent; type=4 X-Disclaimed: 25291 X-Spam-Status: No, score=-2.4 required=5.0 tests=BAYES_00, HTML_MESSAGE, KAM_DMARC_STATUS, RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on server2.sourceware.org X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: cygwin AT cygwin DOT com X-Mailman-Version: 2.1.29 Precedence: list List-Id: General Cygwin discussions and problem reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Michael Shay via Cygwin Reply-To: Michael Shay Content-Type: text/plain; charset="iso-8859-1" Errors-To: cygwin-bounces AT cygwin DOT com Sender: "Cygwin" Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by delorie.com id 073M5m2v015492 Michael From: "Brian Inglis" To: cygwin AT cygwin DOT com Date: 08/03/2020 05:23 PM Subject: Re: Trouble with output character sets from Win32 applications running under mintty Sent by: "Cygwin" On 2020-08-03 11:42, Andrey Repin wrote: >> Doesn't help. I tried 65001 (UTF-8): > > Because you're confusing things. > chcp has nothing to do with LANG or LC_*. > Et vice versa. > > chcp sets console code page for native console applications. Only for those > supporting it. Many do not. > LANG sets output parameters for Cygwin applications (and other programs that > look for it, but these are few). You cut the significant statement at the top of the OP: >> I'm having a problem with Cygwin 3.1.4, changing the character set on the >> fly. It seems to work with Cygwin applications, but not with Win32 >> applications. He has problems with invalid characters only running win32 console applications: I changed the subject to hopefully better reflect the issue. I am unsure where Cygwin 3.1.4 comes into Win32 applications - you have to use the Windows codepage conversion routines. You can only change input character sets on the fly; output character sets will depend on mintty support of xterm-compatible character set support and switching escape sequences; if you set up UCS16LE console output, Windows and mintty should handle it. Perhaps a better description of your environment, build tools, what you are trying to do, what you expect as output, and what you are getting as output, could help us better understand and help with the issue you see. -- Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada This email may be disturbing to some readers as it contains too much technical detail. Reader discretion is advised. [Data in IEC units and prefixes, physical quantities in SI.] -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple The script I sent changes the locale information i.e. LANG and LC_ALL are set to en_US.CP1252. i.e. export LANG="en_US.CP1252" export LC_ALL=en_US.CP1252 Then, it runs a simple Win32 program that takes a single input argument, ZÇ, the second character being C-cedilla, an 8-bit character, hex value 0xc7. The Win32 program transcodes the input Unicode argument using the Cygwin character set to determine the codepage, 1252. It then prints the transcoded characters to stdout, and the result should be ZÇ, identical to the input argument. This works fine using Cygwin 1.7.28. Cygwin 3.1.4 is launching the Win32 application, and is responsible for transcoding the arguments passed to it by mksh, in this case CP1252 characters ZÇ, into Unicode. That means Cygwin has to use the mb-to-uc function for transcoding codepage 1252 to Unicode. It does not. It uses the UTF-8 to Unicode function (I've seen this using gdb). That function flags the Ç as an invalid UTF-8 sequence, not surprisingly since it's not a UTF-8 character. No matter what character set I use in 'export LANG...' and 'export LC_ALL...', Cygwin 3.1.4 always uses the uft8-to-wc transcoding function in sys1.7.28 Uses the correct function. I'm not using mintty, I'm using mksh, a requirement since our software uses lots of shell scripts, and for legacy support, that means using a Korn shell. I could understand it if 1.7.28 didn't do the proper transcoding, but it does. I used: gdb mksh to load mksh into the debugger, then started it with start -c 'cygtest.exe ZÇ' That allowed me to step into child_info_spawn::worker() and stop at the call to CreateProcess(), where the command line (cygtest.exe) and argument (ZÇ) are translated into Unicode. This is the code to which I'm referring, in strfuncs.cc, which is supposed to translate the command line and arguments from CP 1252 into Unicode. size_t __reg3 sys_mbstowcs (wchar_t * dst, size_t dlen, const char *src, size_t nms) { mbtowc_p f_mbtowc = __MBTOWC; if (f_mbtowc == __ascii_mbtowc) { f_mbtowc = __utf8_mbtowc; <<<< THE CODE CHANGES THE '__ascii_mbtowc' TO '__utf8_mbtowc' EVERY TIME, REGARDLESS OF THE CODEPAGE. } return sys_cp_mbstowcs (f_mbtowc, dst, dlen, src, nms); } So 'f_mbtowc' is set to _ascii_mbtowc, the default.You said: You can only change input character sets on the fly; The input character set to Cygwin should have been changed to CP 1252, as it was in 1.7.28. At least, that's what I would expect to happen. If it does not, or if miintty is required, then that's a regression from 1.7.28. Mike Shay NOTICE from Ab Initio: This email (including any attachments) may contain information that is subject to confidentiality obligations or is legally privileged, and sender does not waive confidentiality or privilege. If received in error, please notify the sender, delete this email, and make no further use, disclosure, or distribution. -- Problem reports: https://cygwin.com/problems.html FAQ: https://cygwin.com/faq/ Documentation: https://cygwin.com/docs.html Unsubscribe info: https://cygwin.com/ml/#unsubscribe-simple