www.delorie.com/gnu/docs/recode/recode_8.html   search  
 
Buy GNU books!


The recode reference manual

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.2 The request parameter

In the case where the request is merely written as before..after, then before and after specify the start charset and the goal charset for the recoding.

For recode, charset names may contain any character, besides a comma, a forward slash, or two periods in a row. But in practice, charset names are currently limited to alphabetic letters (upper or lower case), digits, hyphens, underlines, periods, colons or round parentheses.

The complete syntax for a valid request allows for unusual things, which might surprise at first. (Do not pay too much attention to these facilities on first reading.) For example, request may also contain intermediate charsets, like in the following example:

 
before..interim1..interim2..after

meaning that recode should internally produce the interim1 charset from the start charset, then work out of this interim1 charset to internally produce interim2, and from there towards the goal charset. In fact, recode internally combines recipes and automatically uses interim charsets, when there is no direct recipe for transforming before into after. But there might be many ways to do it. When many routes are possible, the above chaining syntax may be used to more precisely force the program towards a particular route, which it might not have naturally selected otherwise. On the other hand, because recode tries to choose good routes, chaining is only needed to achieve some rare, unusual effects.

Moreover, many such requests (sub-requests, more precisely) may be separated with commas (but no spaces at all), indicating a sequence of recodings, where the output of one has to serve as the input of the following one. For example, the two following requests are equivalent:

 
before..interim1..interim2..after
before..interim1,interim1..interim2,interim2..after

In this example, the charset input for any recoding sub-request is identical to the charset output by the preceding sub-request. But it does not have to be so in the general case. One might wonder what would be the meaning of declaring the charset input for a recoding sub-request of being of different nature than the charset output by a preceding sub-request, when recodings are chained in this way. Such a strange usage might have a meaning and be useful for the recode expert, but they are quite uncommon in practice.

More useful is the distinction between the concept of charset, and the concept of surfaces. An encoded charset is represented by:

 
pure-charset/surface1/surface2...

using slashes to introduce surfaces, if any. The order of application of surfaces is usually important, they cannot be freely commuted. In the given example, surface1 is first applied over the pure-charset, then surface2 is applied over the result. Given this request:

 
before/surface1/surface2..after/surface3

the recode program will understand that the input files should have surface2 removed first (because it was applied last), then surface1 should be removed. The next step will be to translate the codes from charset before to charset after, prior to applying surface3 over the result.

Some charsets have one or more implied surfaces. In this case, the implied surfaces are automatically handled merely by naming the charset, without any explicit surface to qualify it. Let's take an example to illustrate this feature. The request `pc..l1' will indeed decode MS-DOS end of lines prior to converting IBM-PC codes to Latin-1, because `pc' is the name of a charset(3) which has CR-LF for its usual surface. The request `pc/..l1' will not decode end of lines, since the slash introduces surfaces, and even if the surface list is empty, it effectively defeats the automatic removal of surfaces for this charset. So, empty surfaces are useful, indeed!

Both charsets and surfaces may have predefined alternate names, or aliases. However, and this is rather important to understand, implied surfaces are attached to individual aliases rather than on genuine charsets. Consequently, the official charset name and all of its aliases do not necessarily share the same implied surfaces. The charset and all its aliases may each have its own different set of implied surfaces.

Charset names, surface names, or their aliases may always be abbreviated to any unambiguous prefix. Internally in recode, disambiguating tables are kept separate for charset names and surface names.

While recognising a charset name or a surface name (or aliases thereof), recode ignores all characters besides letters and digits, so for example, the hyphens and underlines being part of an official charset name may safely be omitted (no need to un-confuse them!). There is also no distinction between upper and lower case for charset or surface names.

One of the before or after keywords may be omitted. If the double dot separator is omitted too, then the charset is interpreted as the before charset.(4)

When a charset name is omitted or left empty, the value of the DEFAULT_CHARSET variable in the environment is used instead. If this variable is not defined, the recode library uses the current locale's encoding. On POSIX compliant systems, this depends on the first non-empty value among the environment variables LC_ALL, LC_CTYPE, LANG, and can be determined through the command `locale charmap'.

If the charset name is omitted but followed by surfaces, the surfaces then qualify the usual or default charset. For example, the request `../x' is sufficient for applying an hexadecimal surface to the input text(5).

The allowable values for before or after charsets, and various surfaces, are described in the remainder of this document.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

  webmaster   donations   bookstore     delorie software   privacy  
  Copyright © 2003   by The Free Software Foundation     Updated Jun 2003  

Please take a moment to fill out this visitor survey
You can help support this site by visiting the advertisers that sponsor it! (only once each, though)