| www.delorie.com/gnu/docs/recode/recode_49.html | search |
![]() Buy GNU books! | |
recode reference manual| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Character entities have been introduced by SGML and made widely popular through HTML, the markup language in use for the World Wide Web, or Web or WWW for short. For representing unusual characters, HTML texts use special sequences, beginning with an ampersand & and ending with a semicolon ;. The sequence may itself start with a number sigh # and be followed by digits, so forming a numeric character reference, or else be an alphabetic identifier, so forming a character entity reference.
The HTML standards have been revised into different HTML levels over time,
and the list of allowable character entities differ in them. The later XML,
meant to simplify many things, has an option (`standalone=yes') which
much restricts that list. The recode library is able to convert
character references between their mnemonic form and their numeric form,
depending on aimed HTML standard level. It also can, of course, convert
between HTML and various other charsets.
Here is a list of those HTML variants which recode supports.
Some notes have been provided by François Yergeau yergeau@alis.com.
XML-standalone
recode under the name
XML-standalone, with h0 as an acceptable alias. It is
documented in section 4.1 of http://www.w3.org/TR/REC-xml.
It only knows `&', `>', `<', `"'
and `''.
HTML_1.1
recode under the name HTML_1.1,
with h1 as an acceptable alias. HTML 1.0 was never really documented.
HTML_2.0
recode under the name HTML_2.0,
and has RFC1866, 1866 and h2 for aliases. HTML 2.0
entities are listed in RFC 1866. Basically, there is an entity for
each alphabetical character in the right part of ISO 8859-1.
In addition, there are four entities for syntax-significant ASCII characters:
`&', `>', `<' and `"'.
HTML-i18n
recode under the name
HTML-i18n, and has RFC2070 and 2070 for
aliases. RFC 2070 added entities to cover the whole right
part of ISO 8859-1. The list is conveniently accessible at
http://www.alis.com:8085/ietf/html/html-latin1.sgml. In addition,
four i18n-related entities were added: `‌' (`‌'),
`‍' (`‍'), `‎' (`‎') and `‏'
(`‏').
HTML_3.2
recode under the name
HTML_3.2, with h3 as an acceptable alias.
HTML 3.2 took up the full
Latin-1 list but not the i18n-related entities from RFC 2070.
HTML_4.0
recode under the name HTML_4.0,
and has h4 and h for aliases. Beware that the particular
alias h is not tied to HTML 4.0, but to the highest HTML
level supported by recode; so it might later represent HTML level
5 if this is ever created. HTML 4.0 has the whole Latin-1 list, a set of entities for
symbols, mathematical symbols, and Greek letters, and another set for
markup-significant and internationalization characters comprising the
4 ASCII entities, the 4 i18n-related from RFC 2070 plus some more.
See http://www.w3.org/TR/REC-html40/sgml/entities.html.
Printable characters from Latin-1 may be used directly in an HTML text. However, partly because people have deficient keyboards, partly because people want to transmit HTML texts over non 8-bit clean channels while not using MIME, it is common (yet debatable) to use character entity references even for Latin-1 characters, when they fall outside ASCII (that is, when they have the 8th bit set).
When you recode from another charset to HTML, beware that all
occurrences of double quotes, ampersands, and left or right angle brackets
are translated into special sequences. However, in practice, people often
use ampersands and angle brackets in the other charset for introducing
HTML commands, compromising it: it is not pure HTML, not it is pure
other charset. These particular translations can be rather inconvenient,
they may be specifically inhibited through the command option `-d'
(see section 3.7 Using mixed charset input).
Codes not having a mnemonic entity are output by recode using the
`&#nnn;' notation, where nnn is a decimal representation
of the UCS code value. When there is an entity name for a character, it
is always preferred over a numeric character reference. ASCII printable
characters are always generated directly. So is the newline. While reading
HTML, recode supports numeric character reference as alternate
writings, even when written as hexadecimal numbers, as in `�'.
This is documented in:
http://www.w3.org/TR/REC-html40/intro/sgmltut.html#h-3.2.3 |
When recode translates to HTML, the translation occurs according to
the HTML level as selected by the goal charset. When translating from
HTML, recode not only accepts the character entity references known at
that level, but also those of all other levels, as well as a few alternative
special sequences, to be forgiving to files using other HTML standards.
The recode program can be used to normalise an HTML file using
oldish conventions. For example, it accepts `&AE;', as this once was a
valid writing, somewhere. However, it should always produce `Æ'
instead of `&AE;'. Yet, this is not completely true. If one does:
recode h3..h3 < input |
the operation will be optimised into a mere copy, and you can get `&AE;' this way, if you had some in your input file. But if you explicitly defeat the optimisation, like this maybe:
recode h3..u2,u2..h3 < input |
then `&AE;' should be normalised into `Æ' by the operation.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
| webmaster donations bookstore | delorie software privacy |
| Copyright © 2003 by The Free Software Foundation | Updated Jun 2003 |