| www.delorie.com/gnu/docs/recode/recode_8.html | search |
![]() Buy GNU books! | |
recode reference manual| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
In the case where the request is merely written as before..after, then before and after specify the start charset and the goal charset for the recoding.
For recode, charset names may contain any character, besides a
comma, a forward slash, or two periods in a row. But in practice, charset
names are currently limited to alphabetic letters (upper or lower case),
digits, hyphens, underlines, periods, colons or round parentheses.
The complete syntax for a valid request allows for unusual things, which might surprise at first. (Do not pay too much attention to these facilities on first reading.) For example, request may also contain intermediate charsets, like in the following example:
before..interim1..interim2..after |
meaning that recode should internally produce the interim1
charset from the start charset, then work out of this interim1
charset to internally produce interim2, and from there towards the
goal charset. In fact, recode internally combines recipes and
automatically uses interim charsets, when there is no direct recipe for
transforming before into after. But there might be many ways
to do it. When many routes are possible, the above chaining syntax
may be used to more precisely force the program towards a particular route,
which it might not have naturally selected otherwise. On the other hand,
because recode tries to choose good routes, chaining is only needed
to achieve some rare, unusual effects.
Moreover, many such requests (sub-requests, more precisely) may be separated with commas (but no spaces at all), indicating a sequence of recodings, where the output of one has to serve as the input of the following one. For example, the two following requests are equivalent:
before..interim1..interim2..after before..interim1,interim1..interim2,interim2..after |
In this example, the charset input for any recoding sub-request is identical
to the charset output by the preceding sub-request. But it does not have
to be so in the general case. One might wonder what would be the meaning
of declaring the charset input for a recoding sub-request of being of
different nature than the charset output by a preceding sub-request, when
recodings are chained in this way. Such a strange usage might have a
meaning and be useful for the recode expert, but they are quite
uncommon in practice.
More useful is the distinction between the concept of charset, and the concept of surfaces. An encoded charset is represented by:
pure-charset/surface1/surface2... |
using slashes to introduce surfaces, if any. The order of application of surfaces is usually important, they cannot be freely commuted. In the given example, surface1 is first applied over the pure-charset, then surface2 is applied over the result. Given this request:
before/surface1/surface2..after/surface3 |
the recode program will understand that the input files should
have surface2 removed first (because it was applied last), then
surface1 should be removed. The next step will be to translate the
codes from charset before to charset after, prior to applying
surface3 over the result.
Some charsets have one or more implied surfaces. In this case, the
implied surfaces are automatically handled merely by naming the charset,
without any explicit surface to qualify it. Let's take an example to
illustrate this feature. The request `pc..l1' will indeed decode MS-DOS
end of lines prior to converting IBM-PC codes to Latin-1, because `pc'
is the name of a charset(3) which has CR-LF for its usual surface.
The request `pc/..l1' will not decode end of lines, since
the slash introduces surfaces, and even if the surface list is empty, it
effectively defeats the automatic removal of surfaces for this charset.
So, empty surfaces are useful, indeed!
Both charsets and surfaces may have predefined alternate names, or aliases. However, and this is rather important to understand, implied surfaces are attached to individual aliases rather than on genuine charsets. Consequently, the official charset name and all of its aliases do not necessarily share the same implied surfaces. The charset and all its aliases may each have its own different set of implied surfaces.
Charset names, surface names, or their aliases may always be abbreviated
to any unambiguous prefix. Internally in recode, disambiguating
tables are kept separate for charset names and surface names.
While recognising a charset name or a surface name (or aliases thereof),
recode ignores all characters besides letters and digits, so for
example, the hyphens and underlines being part of an official charset
name may safely be omitted (no need to un-confuse them!). There is also
no distinction between upper and lower case for charset or surface names.
One of the before or after keywords may be omitted. If the double dot separator is omitted too, then the charset is interpreted as the before charset.(4)
When a charset name is omitted or left empty, the value of the
DEFAULT_CHARSET variable in the environment is used instead. If this
variable is not defined, the recode library uses the current locale's
encoding. On POSIX compliant systems, this depends on the first non-empty
value among the environment variables LC_ALL, LC_CTYPE, LANG, and can be
determined through the command `locale charmap'.
If the charset name is omitted but followed by surfaces, the surfaces then qualify the usual or default charset. For example, the request `../x' is sufficient for applying an hexadecimal surface to the input text(5).
The allowable values for before or after charsets, and various surfaces, are described in the remainder of this document.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
| webmaster donations bookstore | delorie software privacy |
| Copyright © 2003 by The Free Software Foundation | Updated Jun 2003 |