www.delorie.com/gnu/docs/recode/recode_23.html   search  
 
Buy GNU books!


The recode reference manual

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.1 Universal Character Set, 2 bytes

One surface of UCS is usable for the subset defined by its first sixty thousand characters (in fact, 31 * 2^11 codes), and uses exactly two bytes per character. It is a mere dump of the internal memory representation which is natural for this subset and as such, conveys with it endianness problems.

A non-empty UCS-2 file normally begins with a so called byte order mark, having value 0xFEFF. The value 0xFFFE is not an UCS character, so if this value is seen at the beginning of a file, recode reacts by swapping all pairs of bytes. The library also properly reacts to other occurrences of 0xFEFF or 0xFFFE elsewhere than at the beginning, because concatenation of UCS-2 files should stay a simple matter, but it might trigger a diagnostic about non canonical input.

By default, when producing an UCS-2 file, recode always outputs the high order byte before the low order byte. But this could be easily overridden through the 21-Permutation surface (see section 13.1 Permuting groups of bytes). For example, the command:

 
recode u8..u2/21 < input > output

asks for an UTF-8 to UCS-2 conversion, with swapped byte output.

Use UCS-2 as a genuine charset. This charset is available in recode under the name ISO-10646-UCS-2. Accepted aliases are UCS-2, BMP, rune and u2.

The recode library is able to combine UCS-2 some sequences of codes into single code characters, to represent a few diacriticized characters, ligatures or diphtongs which have been included to ease mapping with other existing charsets. It is also able to explode such single code characters into the corresponding sequence of codes. The request syntax for triggering such operations is rudimentary and temporary. The combined-UCS-2 pseudo character set is a special form of UCS-2 in which known combinings have been replaced by the simpler code. Using combined-UCS-2 instead of UCS-2 in an after position of a request forces a combining step, while using combined-UCS-2 instead of UCS-2 in a before position of a request forces an exploding step. For the time being, one has to resort to advanced request syntax to achieve other effects. For example:

 
recode u8..co,u2..u8 < input > output

copies an UTF-8 input over output, still to be in UTF-8, yet merging combining characters into single codes whenever possible.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

  webmaster   donations   bookstore     delorie software   privacy  
  Copyright © 2003   by The Free Software Foundation     Updated Jun 2003  

Please take a moment to fill out this visitor survey
You can help support this site by visiting the advertisers that sponsor it! (only once each, though)