www.delorie.com/archives/browse.cgi   search  
Mail Archives: cygwin/2009/09/24/11:31:45

X-Recipient: archive-cygwin AT delorie DOT com
X-SWARE-Spam-Status: No, hits=-2.3 required=5.0 tests=AWL,BAYES_00
X-Spam-Check-By: sourceware.org
Date: Thu, 24 Sep 2009 17:31:23 +0200 (CEST)
Message-Id: <200909241531.n8OFVNj3010906@mail.bln1.bf.nsn-intra.net>
From: Thomas Wolff <towo AT towo DOT net>
To: cygwin AT cygwin DOT com
Subject: Re: Encoding of German 'umlauts' - please explain
References: <loom DOT 20090924T100848-137 AT post DOT gmane DOT org>
MIME-Version: 1.0
Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm
List-Id: <cygwin.cygwin.com>
List-Subscribe: <mailto:cygwin-subscribe AT cygwin DOT com>
List-Archive: <http://sourceware.org/ml/cygwin/>
List-Post: <mailto:cygwin AT cygwin DOT com>
List-Help: <mailto:cygwin-help AT cygwin DOT com>, <http://sourceware.org/ml/#faqs>
Sender: cygwin-owner AT cygwin DOT com
Mail-Followup-To: cygwin AT cygwin DOT com
Delivered-To: mailing list cygwin AT cygwin DOT com

Ronald Fischer wrote:
> Maybe someone could enlighten me about the following:
> ...
> That means, the German letter ü has encoding 0xFC. If I do the same on CMD shell
> (the 'od' used here comes from the Gnu Utilities for Windows), I see:
> ...
> That is, ü is encoded as 0x81. Why is this different?

> I am aware that, for historic reason, different encodings exist (the old
> DOS encoding, Windows ANSI encoding etc.).
So you answered your question yourself :)
> I wouldn't have expected those
> differences, however, when comparing bash.exe vs. cmd.exe.

The encoding is applied by the terminal, not the application. For bash, 
the letter ü is only a sequence of one or two bytes, while the terminal 
decides which bytes your keyboard sends to the application when you enter 
ü, and what to display when your program outputs those bytes (i.e., 
traditionally, while in the age of locales things may sometimes get more 
complicated :( ).

Having said this, I also need to adjust the following response:

Matthias Andree wrote:
> Because the code pages differ. 0xFC is ISO-8859-1 ("Latin 1") or -15 ("Latin 9")
> or CP1252/Windows-1252 (Latin 1 Extended; the latter allocates 0x80...0x9f
> differently than ISO-8859-1) and CMD uses CP437 or CP850.

This is not really correct; like bash, CMD does not use a codepage itself.
If you start CMD from Windows, it will implicitly be embedded in a Windows 
console which uses CP437 (American), CP850 (Western European) or some other 
default of your system configuration.

However, you could also run CMD from a cygwin bash. In this case, maximising 
the confusion, there are two different situations:
* Run mintty, start CMD from bash there: CMD will see the same codepage as 
  bash since it is the one configured for mintty. So echo ü would produce 
  0xFC even in CMD (assuming mintty runs one of the codepages which map 
  ü to 0xFC).
* Run cygwin console, observe this: Since the cygwin console is a hybrid as 
  the encoding is emulated by the cygwin dll within a Windows console, unlike 
  all other terminals, the effective "codepage" varies with the application:
  A cygwin application will use the encoding configured for the cygwin session, 
  while any non-cygwin application will use the native Windows console codepage.
  So you may echo ü from bash, then start CMD from there, echo ü again, and will 
  get different codes for the same key!

Kind regards,
Thomas

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019