| www.delorie.com/gnu/docs/recode/recode_23.html | search |
![]() Buy GNU books! | |
recode reference manual| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
One surface of UCS is usable for the subset defined by its first
sixty thousand characters (in fact, 31 * 2^11 codes), and uses
exactly two bytes per character. It is a mere dump of the internal
memory representation which is natural for this subset and as such,
conveys with it endianness problems.
A non-empty UCS-2 file normally begins with a so called byte
order mark, having value 0xFEFF. The value 0xFFFE is not an
UCS character, so if this value is seen at the beginning of a file,
recode reacts by swapping all pairs of bytes. The library also
properly reacts to other occurrences of 0xFEFF or 0xFFFE
elsewhere than at the beginning, because concatenation of UCS-2
files should stay a simple matter, but it might trigger a diagnostic
about non canonical input.
By default, when producing an UCS-2 file, recode always
outputs the high order byte before the low order byte. But this could be
easily overridden through the 21-Permutation surface
(see section 13.1 Permuting groups of bytes). For example, the command:
recode u8..u2/21 < input > output |
asks for an UTF-8 to UCS-2 conversion, with swapped byte
output.
Use UCS-2 as a genuine charset. This charset is available in
recode under the name ISO-10646-UCS-2. Accepted aliases
are UCS-2, BMP, rune and u2.
The recode library is able to combine UCS-2 some sequences
of codes into single code characters, to represent a few diacriticized
characters, ligatures or diphtongs which have been included to ease
mapping with other existing charsets. It is also able to explode
such single code characters into the corresponding sequence of codes.
The request syntax for triggering such operations is rudimentary and
temporary. The combined-UCS-2 pseudo character set is a special
form of UCS-2 in which known combinings have been replaced by the
simpler code. Using combined-UCS-2 instead of UCS-2 in an
after position of a request forces a combining step, while using
combined-UCS-2 instead of UCS-2 in a before position
of a request forces an exploding step. For the time being, one has to
resort to advanced request syntax to achieve other effects. For example:
recode u8..co,u2..u8 < input > output |
copies an UTF-8 input over output, still to be in
UTF-8, yet merging combining characters into single codes whenever
possible.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
| webmaster donations bookstore | delorie software privacy |
| Copyright © 2003 by The Free Software Foundation | Updated Jun 2003 |