| www.delorie.com/gnu/docs/recode/recode_26.html | search |
![]() Buy GNU books! | |
recode reference manual| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Even if UTF-8 does not originally come from IETF, there is now
RFC 2279 to describe it. In letters sent on 1995-01-21 and 1995-04-20,
Markus Kuhn writes:
UTF-8is anASCIIcompatible multi-byte encoding of the ISO 10646 universal character set (UCS).UCSis a 31-bit superset of all other character set standards. The first 256 characters ofUCSare identical to those of ISO 8859-1 (Latin-1). TheUCS-2encoding of UCS is a sequence of bigendian 16-bit words, theUCS-4encoding is a sequence of bigendian 32-bit words. TheUCS-2subset of ISO 10646 is also known as "Unicode". As bothUCS-2andUCS-4require heavy modifications to traditionalASCIIoriented system designs (e.g. Unix), theUTF-8encoding has been designed for these applications.In
UTF-8, onlyASCIIcharacters are encoded using bytes below 128. All other non-ASCII characters are encoded as multi-byte sequences consisting only of bytes in the range 128-253. This avoids critical bytes like NUL and / inUTF-8strings, which makes theUTF-8encoding suitable for being handled by the standard C string library and being used in Unix file names. Other properties include the preserved lexical sorting order and thatUTF-8allows easy self-synchronisation of software receivingUTF-8strings.
UTF-8 is the most common external surface of UCS, each
character uses from one to six bytes, and is able to encode all 2^31
characters of the UCS. It is implemented as a charset, with the
following properties:
ASCII is completely invariant under UTF-8,
and those are the only one-byte characters. UCS values and
ASCII values coincide. No multi-byte characters ever contain bytes
less than 128. NUL is NUL. A multi-byte character
always starts with a byte of 192 or more, and is always followed by a
number of bytes between 128 to 191. That means that you may read at
random on disk or memory, and easily discover the start of the current,
next or previous character. You can count, skip or extract characters
with this only knowledge.
UTF-8, or to safely state that it is not.
These properties also have a few nice consequences:
UCS representation. Here, N is a number between
1 and 6. So, UTF-8 is most economical when mapping ASCII (1 byte),
followed by UCS-2 (1 to 3 bytes) and UCS-4 (1 to 6 bytes).
UCS strings is preserved.
In some case, when little processing is done on a lot of strings, one may
choose for efficiency reasons to handle UTF-8 strings directly even
if variable length, as it is easy to get start of characters. Character
insertion or replacement might require moving the remainder of the string
in either direction. In most cases, it is faster and easier to convert
from UTF-8 to UCS-2 or UCS-4 prior to processing.
This charset is available in recode under the name UTF-8.
Accepted aliases are UTF-2, UTF-FSS, FSS_UTF,
TF-8 and u8.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
| webmaster donations bookstore | delorie software privacy |
| Copyright © 2003 by The Free Software Foundation | Updated Jun 2003 |