www.delorie.com/gnu/docs/recode/recode_13.html   search  
Buy GNU books!

The recode reference manual

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.7 Using mixed charset input

In real life and practice, textual files are often made up of many charsets at once. Some parts of the file encode one charset, while other parts encode another charset, and so forth. Usually, a file does not toggle between more than two or three charsets. The means to distinguish which charsets are encoded at various places is not always available. The recode program is able to handle only a few simple cases of mixed input.

The default recode behaviour is to expect pure charset files, to be recoded as other pure charset files. However, the following options allow for a few precise kinds of mixed charset files.

While converting to or from one of HTML or LaTeX charset, limit conversion to some subset of all characters. For HTML, limit conversion to the subset of all non-ASCII characters. For LaTeX, limit conversion to the subset of all non-English letters. This is particularly useful, for example, when people create what would be valid HTML, TeX or LaTeX files, if only they were using provided sequences for applying diacritics instead of using the diacriticised characters directly from the underlying character set.

While converting to HTML or LaTeX charset, this option assumes that characters not in the said subset are properly coded or protected already, recode then transmit them literally. While converting the other way, this option prevents translating back coded or protected versions of characters not in the said subset. See section 12.1 World Wide Web representations. See section 12.2 LaTeX macro calls.

The bulk of the input file is expected to be written in ASCII, except for parts, like comments and string constants, which are written using another charset than ASCII. When language is `c', the recoding will proceed only with the contents of comments or strings, while everything else will be copied without recoding. When language is `po', the recoding will proceed only within translator comments (those having whitespace immediately following the initial `#') and with the contents of msgstr strings.

For the above things to work, the non-ASCII encoding of the comment or string should be such that an ASCII scan will successfully find where the comment or string ends.

Even if ASCII is the usual charset for writing programs, some compilers are able to directly read other charsets, like UTF-8, say. There is currently no provision in recode for reading mixed charset sources which are not based on ASCII. It is probable that the need for mixed recoding is not as pressing in such cases.

For example, after one does:

recode -Spo pc/..u8 < input.po > output.po

file `output.po' holds a copy of `input.po' in which only translator comments and the contents of msgstr strings have been recoded from the IBM-PC charset to pure UTF-8, without attempting conversion of end-of-lines. Machine generated comments and original msgid strings are not to be touched by this recoding.

If language is not specified, `c' is assumed.

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

  webmaster   donations   bookstore     delorie software   privacy  
  Copyright 2003   by The Free Software Foundation     Updated Jun 2003