| www.delorie.com/gnu/docs/recode/recode_13.html | search |
![]() Buy GNU books! | |
recode reference manual| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
In real life and practice, textual files are often made up of many charsets
at once. Some parts of the file encode one charset, while other parts
encode another charset, and so forth. Usually, a file does not toggle
between more than two or three charsets. The means to distinguish
which charsets are encoded at various places is not always available.
The recode program is able to handle only a few simple cases
of mixed input.
The default recode behaviour is to expect pure charset files, to
be recoded as other pure charset files. However, the following options
allow for a few precise kinds of mixed charset files.
HTML or LaTeX
charset, limit conversion to some subset of all characters.
For HTML, limit conversion to the subset of all non-ASCII
characters. For LaTeX, limit conversion to the subset of all
non-English letters. This is particularly useful, for example, when
people create what would be valid HTML, TeX or LaTeX
files, if only they were using provided sequences for applying
diacritics instead of using the diacriticised characters directly
from the underlying character set.
While converting to HTML or LaTeX charset, this option
assumes that characters not in the said subset are properly coded
or protected already, recode then transmit them literally.
While converting the other way, this option prevents translating back
coded or protected versions of characters not in the said subset.
See section 12.1 World Wide Web representations. See section 12.2 LaTeX macro calls.
ASCII,
except for parts, like comments and string constants, which are written
using another charset than ASCII. When language is `c',
the recoding will proceed only with the contents of comments or strings,
while everything else will be copied without recoding. When language
is `po', the recoding will proceed only within translator comments
(those having whitespace immediately following the initial `#')
and with the contents of msgstr strings.
For the above things to work, the non-ASCII encoding of the comment
or string should be such that an ASCII scan will successfully find
where the comment or string ends.
Even if ASCII is the usual charset for writing programs, some
compilers are able to directly read other charsets, like UTF-8, say.
There is currently no provision in recode for reading mixed charset
sources which are not based on ASCII. It is probable that the need
for mixed recoding is not as pressing in such cases.
For example, after one does:
recode -Spo pc/..u8 < input.po > output.po |
file `output.po' holds a copy of `input.po' in which
only translator comments and the contents of msgstr strings
have been recoded from the IBM-PC charset to pure UTF-8,
without attempting conversion of end-of-lines. Machine generated comments
and original msgid strings are not to be touched by this recoding.
If language is not specified, `c' is assumed.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
| webmaster donations bookstore | delorie software privacy |
| Copyright © 2003 by The Free Software Foundation | Updated Jun 2003 |