www.delorie.com/gnu/docs/recode/recode_9.html   search  
 
Buy GNU books!


The recode reference manual

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.3 Asking for various lists

Many options control listing output generated by recode itself, they are not meant to accompany actual file recodings. These options are:

`--version'
The program merely prints its version numbers on standard output, and exits without doing anything else.

`--help'
The program merely prints a page of help on standard output, and exits without doing any recoding.

`-C'
`--copyright'
Given this option, all other parameters and options are ignored. The program prints briefly the copyright and copying conditions. See the file `COPYING' in the distribution for full statement of the Copyright and copying conditions.

`-h[language/][name]'
`--header[=[language/][name]]'
Instead of recoding files, recode writes a language source file on standard output and exits. This source is meant to be included in a regular program written in the same programming language: its purpose is to declare and initialise an array, named name, which represents the requested recoding. The only acceptable values for language are `c' or `perl', and may may be abbreviated. If language is not specified, `c' is assumed. If name is not specified, then it defaults to `before_after'. Strings before and after are cleaned before being used according to the syntax of language.

Even if recode tries its best, this option does not always succeed in producing the requested source table. It will however, provided the recoding can be internally represented by only one step after the optimisation phase, and if this merged step conveys a one-to-one or a one-to-many explicit table. Also, when attempting to produce sources tables, recode relaxes its checking a tiny bit: it ignores the algorithmic part of some tabular recodings, it also avoids the processing of implied surfaces. But this is all fairly technical. Better try and see!

Beware that other options might affect the produced source tables, these are: `-d', `-g' and, particularly, `-s'.

`-k pairs'
`--known=pairs'
This particular option is meant to help identifying an unknown charset, using as hints some already identified characters of the charset. Some examples will help introducing the idea.

Let's presume here that recode is run in an ISO-8859-1 locale, and that DEFAULT_CHARSET is unset in the environment. Suppose you have guessed that code 130 (decimal) of the unknown charset represents a lower case `e' with an acute accent. That is to say that this code should map to code 233 (decimal) in the usual charset. By executing:

 
recode -k 130:233

you should obtain a listing similar to:

 
AtariST atarist
CWI cphu cwi cwi2
IBM437 437 cp437 ibm437
IBM850 850 cp850 ibm850
IBM851 851 cp851 ibm851
IBM852 852 cp852 ibm852
IBM857 857 cp857 ibm857
IBM860 860 cp860 ibm860
IBM861 861 cp861 cpis ibm861
IBM863 863 cp863 ibm863
IBM865 865 cp865 ibm865

You can give more than one clue at once, to restrict the list further. Suppose you have also guessed that code 211 of the unknown charset represents an upper case `E' with diaeresis, that is, code 203 in the usual charset. By requesting:

 
recode -k 130:233,211:203

you should obtain:

 
IBM850 850 cp850 ibm850
IBM852 852 cp852 ibm852
IBM857 857 cp857 ibm857

The usual charset may be overridden by specifying one non-option argument. For example, to request the list of charsets for which code 130 maps to code 142 for the Macintosh, you may ask:

 
recode -k 130:142 mac

and get:

 
AtariST atarist
CWI cphu cwi cwi2
IBM437 437 cp437 ibm437
IBM850 850 cp850 ibm850
IBM851 851 cp851 ibm851
IBM852 852 cp852 ibm852
IBM857 857 cp857 ibm857
IBM860 860 cp860 ibm860
IBM861 861 cp861 cpis ibm861
IBM863 863 cp863 ibm863
IBM865 865 cp865 ibm865

which, of course, is identical to the result of the first example, since the code 142 for the Macintosh is a small `e' with acute.

More formally, option `-k' lists all possible before charsets for the after charset given as the sole non-option argument to recode, but subject to restrictions given in pairs. If there is no non-option argument, the after charset is taken to be the default charset for this recode.

The restrictions are given as a comma separated list of pairs, each pair consisting of two numbers separated by a colon. The numbers are taken as decimal when the initial digit is between `1' and `9'; `0x' starts an hexadecimal number, or else `0' starts an octal number. The first number is a code in any before charset, while the second number is a code in the specified after charset. If the first number would not be transformed into the second number by recoding from some before charset to the after charset, then this before charset is rejected. A before charset is listed only if it is not rejected by any pair. The program will only test those before charsets having a tabular style internal description (see section 7. Tabular sources (RFC 1345)), so should be the selected after charset.

The produced list is in fact a subset of the list produced by the option `-l'. As for option `-l', the non-option argument is interpreted as a charset name, possibly abbreviated to any non ambiguous prefix.

`-l[format]'
`--list[=format]'
This option asks for information about all charsets, or about one particular charset. No file will be recoded.

If there is no non-option arguments, recode ignores the format value of the option, it writes a sorted list of charset names on standard output, one per line. When a charset name have aliases or synonyms, they follow the true charset name on its line, sorted from left to right. Each charset or alias is followed by its implied surfaces, if any. This list is over two hundred lines. It is best used with `grep -i', as in:

 
recode -l | grep -i greek

There might be one non-option argument, in which case it is interpreted as a charset name, possibly abbreviated to any non ambiguous prefix. This particular usage of the `-l' option is obeyed only for charsets having a tabular style internal description (see section 7. Tabular sources (RFC 1345)). Even if most charsets have this property, some do not, and the option `-l' cannot be used to detail these particular charsets. For knowing if a particular charset can be listed this way, you should merely try and see if this works. The format value of the option is a keyword from the following list. Keywords may be abbreviated by dropping suffix letters, and even reduced to the first letter only:

`decimal'
This format asks for the production on standard output of a concise tabular display of the charset, in which character code values are expressed in decimal.

`octal'
This format uses octal instead of decimal in the concise tabular display of the charset.

`hexadecimal'
This format uses hexadecimal instead of decimal in the concise tabular display of the charset.

`full'
This format requests an extensive display of the charset on standard output, using one line per character showing its decimal, hexadecimal, octal and UCS-2 code values, and also a descriptive comment which should be the 10646 name for the character.

The descriptive comment is given in English and ASCII, yet if the English description is not available but a French one is, then the French description is given instead, using Latin-1. However, if the LANGUAGE or LANG environment variable begins with the letters `fr', then listing preference goes to French when both descriptions are available.

When option `-l' is used together with a charset argument, the format defaults to decimal.

`-T'
`--find-subsets'
This option is a maintainer tool for evaluating the redundancy of those charsets, in recode, which are internally represented by an UCS-2 data table. After the listing has been produced, the program exits without doing any recoding. The output is meant to be sorted, like this: `recode -T | sort'. The option triggers recode into comparing all pairs of charsets, seeking those which are subsets of others. The concept and results are better explained through a few examples. Consider these three sample lines from `-T' output:

 
[  0] IBM891 == IBM903
[  1] IBM1004 < CP1252
[ 12] INVARIANT < CSA_Z243.4-1985-1

The first line means that IBM891 and IBM903 are completely identical as far as recode is concerned, so one is fully redundant to the other. The second line says that IBM1004 is wholly contained within CP1252, yet there is a single character which is in CP1252 without being in IBM1004. The third line says that INVARIANT is wholly contained within CSA_Z243.4-1985-1, but twelve characters are in CSA_Z243.4-1985-1 without being in INVARIANT. The whole output might most probably be reduced and made more significant through a transitivity study.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

  webmaster   donations   bookstore     delorie software   privacy  
  Copyright © 2003   by The Free Software Foundation     Updated Jun 2003  

Please take a moment to fill out this visitor survey
You can help support this site by visiting the advertisers that sponsor it! (only once each, though)