X-Recipient: archive-cygwin AT delorie DOT com X-SWARE-Spam-Status: No, hits=-2.5 required=5.0 tests=AWL,BAYES_00,SPF_HELO_PASS X-Spam-Check-By: sourceware.org Message-ID: <4B8A6069.4030008@towo.net> Date: Sun, 28 Feb 2010 13:24:09 +0100 From: Thomas Wolff User-Agent: Thunderbird 2.0.0.23 (Windows/20090812) MIME-Version: 1.0 To: cygwin AT cygwin DOT com Subject: Re: Non-canonical mode input via tcsetattr(), under mintty console References: <513288 DOT 14252 DOT qm AT web19014 DOT mail DOT hk2 DOT yahoo DOT com> In-Reply-To: <513288.14252.qm@web19014.mail.hk2.yahoo.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-IsSubscribed: yes Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Dave Lee schrieb: > Hi all, > > I was testing a program that uses non-canonical mode input via > tcsetattr(). > > ... > Specifically, I entered the chinese character "例" (which means "rule" > or "example"). It occupies 3 bytes in UTF-8 representation: E4, BE, 8B. > > On standard console, the read() call returned THREE bytes (n == 3), and > (not surprisingly) E4, BE and 8B were returned to buf[]. > > On mintty console, the read() call returned ONE byte (n == 1), and only > E4 were returned to buf[]. I could grab the other two bytes if I did > additional calls to read(). > This is absolutely in line with the specified interface of read(), whether or not you apply some tcsetattr settings, and whether or not there is a difference between cygwin console and mintty. It is a traditional byte-oriented function and has no knowlege or handling of character encoding, and there is no guarantee that a multi-byte character comes in one piece. (Even if mintty were changed to try to feed them in one piece, there would still be no guarantee that you receive them in one piece.) You have four options (two each whether you want UTF-8 or Unicode words in your program): * Read bytes and decode UTF-8 yourself. Basically simple as long as you are careful to avoid errors. * Read bytes and transform with one of the mbtowc (multi-byte to wide-character) functions (provided you want characters as Unicode words, not UTF-8 sequences in your program). The interface of those functions is a little bit tricky, though. * Use wide character input functions (e.g. from the ncursesw library) (provided... see above). They may not be completely flexible with respect to specific interaction requirements (tcsetattr settings...), though, I'm not sure. * Use wide character input functions and transform back to UTF-8 with wctomb functions, if you need. Thomas -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple