| www.delorie.com/gnu/docs/recode/recode_9.html | search |
![]() Buy GNU books! | |
recode reference manual| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Many options control listing output generated by recode itself,
they are not meant to accompany actual file recodings. These options are:
recode writes a language source
file on standard output and exits. This source is meant to be included
in a regular program written in the same programming language:
its purpose is to declare and initialise an array, named name,
which represents the requested recoding. The only acceptable values for
language are `c' or `perl', and may may be abbreviated.
If language is not specified, `c' is assumed. If name
is not specified, then it defaults to `before_after'.
Strings before and after are cleaned before being used according
to the syntax of language.
Even if recode tries its best, this option does not always succeed in
producing the requested source table. It will however, provided the recoding
can be internally represented by only one step after the optimisation phase,
and if this merged step conveys a one-to-one or a one-to-many explicit
table. Also, when attempting to produce sources tables, recode
relaxes its checking a tiny bit: it ignores the algorithmic part of some
tabular recodings, it also avoids the processing of implied surfaces.
But this is all fairly technical. Better try and see!
Beware that other options might affect the produced source tables, these are: `-d', `-g' and, particularly, `-s'.
Let's presume here that recode is run in an ISO-8859-1 locale, and
that DEFAULT_CHARSET is unset in the environment.
Suppose you have guessed that code 130 (decimal) of the unknown charset
represents a lower case `e' with an acute accent. That is to say
that this code should map to code 233 (decimal) in the usual charset.
By executing:
recode -k 130:233 |
you should obtain a listing similar to:
AtariST atarist CWI cphu cwi cwi2 IBM437 437 cp437 ibm437 IBM850 850 cp850 ibm850 IBM851 851 cp851 ibm851 IBM852 852 cp852 ibm852 IBM857 857 cp857 ibm857 IBM860 860 cp860 ibm860 IBM861 861 cp861 cpis ibm861 IBM863 863 cp863 ibm863 IBM865 865 cp865 ibm865 |
You can give more than one clue at once, to restrict the list further. Suppose you have also guessed that code 211 of the unknown charset represents an upper case `E' with diaeresis, that is, code 203 in the usual charset. By requesting:
recode -k 130:233,211:203 |
you should obtain:
IBM850 850 cp850 ibm850 IBM852 852 cp852 ibm852 IBM857 857 cp857 ibm857 |
The usual charset may be overridden by specifying one non-option argument. For example, to request the list of charsets for which code 130 maps to code 142 for the Macintosh, you may ask:
recode -k 130:142 mac |
and get:
AtariST atarist CWI cphu cwi cwi2 IBM437 437 cp437 ibm437 IBM850 850 cp850 ibm850 IBM851 851 cp851 ibm851 IBM852 852 cp852 ibm852 IBM857 857 cp857 ibm857 IBM860 860 cp860 ibm860 IBM861 861 cp861 cpis ibm861 IBM863 863 cp863 ibm863 IBM865 865 cp865 ibm865 |
which, of course, is identical to the result of the first example, since the code 142 for the Macintosh is a small `e' with acute.
More formally, option `-k' lists all possible before
charsets for the after charset given as the sole non-option
argument to recode, but subject to restrictions given in
pairs. If there is no non-option argument, the after
charset is taken to be the default charset for this recode.
The restrictions are given as a comma separated list of pairs, each pair consisting of two numbers separated by a colon. The numbers are taken as decimal when the initial digit is between `1' and `9'; `0x' starts an hexadecimal number, or else `0' starts an octal number. The first number is a code in any before charset, while the second number is a code in the specified after charset. If the first number would not be transformed into the second number by recoding from some before charset to the after charset, then this before charset is rejected. A before charset is listed only if it is not rejected by any pair. The program will only test those before charsets having a tabular style internal description (see section 7. Tabular sources (RFC 1345)), so should be the selected after charset.
The produced list is in fact a subset of the list produced by the option `-l'. As for option `-l', the non-option argument is interpreted as a charset name, possibly abbreviated to any non ambiguous prefix.
If there is no non-option arguments, recode ignores the format
value of the option, it writes a sorted list of charset names on standard
output, one per line. When a charset name have aliases or synonyms,
they follow the true charset name on its line, sorted from left to right.
Each charset or alias is followed by its implied surfaces, if any. This list
is over two hundred lines. It is best used with `grep -i', as in:
recode -l | grep -i greek |
There might be one non-option argument, in which case it is interpreted as a charset name, possibly abbreviated to any non ambiguous prefix. This particular usage of the `-l' option is obeyed only for charsets having a tabular style internal description (see section 7. Tabular sources (RFC 1345)). Even if most charsets have this property, some do not, and the option `-l' cannot be used to detail these particular charsets. For knowing if a particular charset can be listed this way, you should merely try and see if this works. The format value of the option is a keyword from the following list. Keywords may be abbreviated by dropping suffix letters, and even reduced to the first letter only:
UCS-2 code values, and also a descriptive comment which should be
the 10646 name for the character.
The descriptive comment is given in English and ASCII, yet if the English
description is not available but a French one is, then the French description
is given instead, using Latin-1. However, if the LANGUAGE
or LANG environment variable begins with the letters `fr',
then listing preference goes to French when both descriptions are available.
When option `-l' is used together with a charset argument,
the format defaults to decimal.
recode, which are internally represented by an UCS-2
data table. After the listing has been produced, the program exits
without doing any recoding. The output is meant to be sorted, like
this: `recode -T | sort'. The option triggers recode into
comparing all pairs of charsets, seeking those which are subsets of others.
The concept and results are better explained through a few examples.
Consider these three sample lines from `-T' output:
[ 0] IBM891 == IBM903 [ 1] IBM1004 < CP1252 [ 12] INVARIANT < CSA_Z243.4-1985-1 |
The first line means that IBM891 and IBM903 are completely
identical as far as recode is concerned, so one is fully redundant
to the other. The second line says that IBM1004 is wholly
contained within CP1252, yet there is a single character which is
in CP1252 without being in IBM1004. The third line says
that INVARIANT is wholly contained within CSA_Z243.4-1985-1,
but twelve characters are in CSA_Z243.4-1985-1 without being in
INVARIANT. The whole output might most probably be reduced and
made more significant through a transitivity study.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
| webmaster donations bookstore | delorie software privacy |
| Copyright © 2003 by The Free Software Foundation | Updated Jun 2003 |