www.delorie.com/gnu/docs/recode/recode_49.html   search  
 
Buy GNU books!


The recode reference manual

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

12.1 World Wide Web representations

Character entities have been introduced by SGML and made widely popular through HTML, the markup language in use for the World Wide Web, or Web or WWW for short. For representing unusual characters, HTML texts use special sequences, beginning with an ampersand & and ending with a semicolon ;. The sequence may itself start with a number sigh # and be followed by digits, so forming a numeric character reference, or else be an alphabetic identifier, so forming a character entity reference.

The HTML standards have been revised into different HTML levels over time, and the list of allowable character entities differ in them. The later XML, meant to simplify many things, has an option (`standalone=yes') which much restricts that list. The recode library is able to convert character references between their mnemonic form and their numeric form, depending on aimed HTML standard level. It also can, of course, convert between HTML and various other charsets.

Here is a list of those HTML variants which recode supports. Some notes have been provided by François Yergeau yergeau@alis.com.

XML-standalone
This charset is available in recode under the name XML-standalone, with h0 as an acceptable alias. It is documented in section 4.1 of http://www.w3.org/TR/REC-xml. It only knows `&amp;', `&gt;', `&lt;', `&quot;' and `&apos;'.

HTML_1.1
This charset is available in recode under the name HTML_1.1, with h1 as an acceptable alias. HTML 1.0 was never really documented.

HTML_2.0
This charset is available in recode under the name HTML_2.0, and has RFC1866, 1866 and h2 for aliases. HTML 2.0 entities are listed in RFC 1866. Basically, there is an entity for each alphabetical character in the right part of ISO 8859-1. In addition, there are four entities for syntax-significant ASCII characters: `&amp;', `&gt;', `&lt;' and `&quot;'.

HTML-i18n
This charset is available in recode under the name HTML-i18n, and has RFC2070 and 2070 for aliases. RFC 2070 added entities to cover the whole right part of ISO 8859-1. The list is conveniently accessible at http://www.alis.com:8085/ietf/html/html-latin1.sgml. In addition, four i18n-related entities were added: `&zwnj;' (`&#8204;'), `&zwj;' (`&#8205;'), `&lrm;' (`&#8206') and `&rlm;' (`&#8207;').

HTML_3.2
This charset is available in recode under the name HTML_3.2, with h3 as an acceptable alias. HTML 3.2 took up the full Latin-1 list but not the i18n-related entities from RFC 2070.

HTML_4.0
This charset is available in recode under the name HTML_4.0, and has h4 and h for aliases. Beware that the particular alias h is not tied to HTML 4.0, but to the highest HTML level supported by recode; so it might later represent HTML level 5 if this is ever created. HTML 4.0 has the whole Latin-1 list, a set of entities for symbols, mathematical symbols, and Greek letters, and another set for markup-significant and internationalization characters comprising the 4 ASCII entities, the 4 i18n-related from RFC 2070 plus some more. See http://www.w3.org/TR/REC-html40/sgml/entities.html.

Printable characters from Latin-1 may be used directly in an HTML text. However, partly because people have deficient keyboards, partly because people want to transmit HTML texts over non 8-bit clean channels while not using MIME, it is common (yet debatable) to use character entity references even for Latin-1 characters, when they fall outside ASCII (that is, when they have the 8th bit set).

When you recode from another charset to HTML, beware that all occurrences of double quotes, ampersands, and left or right angle brackets are translated into special sequences. However, in practice, people often use ampersands and angle brackets in the other charset for introducing HTML commands, compromising it: it is not pure HTML, not it is pure other charset. These particular translations can be rather inconvenient, they may be specifically inhibited through the command option `-d' (see section 3.7 Using mixed charset input).

Codes not having a mnemonic entity are output by recode using the `&#nnn;' notation, where nnn is a decimal representation of the UCS code value. When there is an entity name for a character, it is always preferred over a numeric character reference. ASCII printable characters are always generated directly. So is the newline. While reading HTML, recode supports numeric character reference as alternate writings, even when written as hexadecimal numbers, as in `&#xfffd'. This is documented in:

 
http://www.w3.org/TR/REC-html40/intro/sgmltut.html#h-3.2.3

When recode translates to HTML, the translation occurs according to the HTML level as selected by the goal charset. When translating from HTML, recode not only accepts the character entity references known at that level, but also those of all other levels, as well as a few alternative special sequences, to be forgiving to files using other HTML standards.

The recode program can be used to normalise an HTML file using oldish conventions. For example, it accepts `&AE;', as this once was a valid writing, somewhere. However, it should always produce `&AElig;' instead of `&AE;'. Yet, this is not completely true. If one does:

 
recode h3..h3 < input

the operation will be optimised into a mere copy, and you can get `&AE;' this way, if you had some in your input file. But if you explicitly defeat the optimisation, like this maybe:

 
recode h3..u2,u2..h3 < input

then `&AE;' should be normalised into `&AElig;' by the operation.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

  webmaster   donations   bookstore     delorie software   privacy  
  Copyright © 2003   by The Free Software Foundation     Updated Jun 2003  

Please take a moment to fill out this visitor survey
You can help support this site by visiting the advertisers that sponsor it! (only once each, though)