| www.delorie.com/gnu/docs/recode/recode_22.html | search |
![]() Buy GNU books! | |
recode reference manual| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Standard ISO 10646 defines a universal character set, intended to encompass in the long run all languages written on this planet. It is based on wide characters, and offer possibilities for two billion characters (2^31).
This charset was to become available in recode under the name
UCS, with many external surfaces for it. But in the current
version, only surfaces of UCS are offered, each presented as a
genuine charset rather than a surface. Such surfaces are only meaningful
for the UCS charset, so it is not that useful to draw a line
between the surfaces and the only charset to which they may apply.
UCS stands for Universal Character Set. UCS-2 and
UCS-4 are fixed length encodings, using two or four bytes per
character respectively. UTF stands for UCS Transformation
Format, and are variable length encodings dedicated to UCS.
UTF-1 was based on ISO 2022, it did not succeed(9). UTF-2
replaced it, it has been called UTF-FSS (File System Safe) in
Unicode or Plan9 context, but is better known today as UTF-8.
To complete the picture, there is UTF-16 based on 16 bits bytes,
and UTF-7 which is meant for transmissions limited to 7-bit bytes.
Most often, one might see UTF-8 used for external storage, and
UCS-2 used for internal storage.
When recode is producing any representation of UCS,
it uses the replacement character U+FFFD for any valid
character which is not representable in the goal charset(10).
This happens, for example, when UCS-2 is not capable to echo a
wide UCS-4 character, or for a similar reason, an UTF-8
sequence using more than three bytes. The replacement character is
meant to represent an existing character. So, it is never produced to
represent an invalid sequence or ill-formed character in the input text.
In such cases, recode just gets rid of the noise, while taking note
of the error in its usual ways.
Even if UTF-8 is an encoding, really, it is the encoding of a single
character set, and nothing else. It is useful to distinguish between an
encoding (a surface within recode) and a charset, but only
when the surface may be applied to several charsets. Specifying a charset
is a bit simpler than specifying a surface in a recode request.
There would not be a practical advantage at imposing a more complex syntax
to recode users, when it is simple to assimilate UTF-8 to
a charset. Similar considerations apply for UCS-2, UCS-4,
UTF-16 and UTF-7. These are all considered to be charsets.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
| webmaster donations bookstore | delorie software privacy |
| Copyright © 2003 by The Free Software Foundation | Updated Jun 2003 |