Mail Archives: cygwin/2004/04/02/23:49:13
Dave Korn wrote:
> I was pretty stunned to find d2u didn't have the same effect as tr -d. A
> few seconds work in the debugger, however, made it clear.
>
> Right inside conv.c, in the main convert (...) function, there's an
> attempted optimisation. After opening the file for conversion, it reads a
> char at a time until it finds the first '\n' or '\r' in the whole file. If
> a '\n' comes first, it assumes the file is in Unix format; if a '\r' comes
> first, it assumes the file must be in DOS format.
>
> Now, these assumptions are reasonable enough ways of guessing the file
> format if it hasn't been specified by the command name or command line
> switch, and therefore of deducing which kind of translation is required.
>
> But then it checks to see if the guessed format matches the format you've
> asked it to convert into. If so, it attempts to 'optimise' the conversion
> by simply not performing it: it closes the file and leaves it untouched.
This is not just a performance enhancement. It also has the following
properties:
(1) the access time on files which are not actually modified does not
get updated
(2) it's an attempt to prevent users from permanently scrogging
binary files. See: d2u, on a binary file, is an irreversible operation.
So, if you do "d2u *" you'll probably kill something deep inside some
binary file, and you can't fix it -- unless some minimal safeguards are
in place.
u2d MAY be reversible -- IF there were no pre-exising \r\n
combinations in the file to begin with -- so when (OMG-fixit-)d2u is
run, obviously the first '\n' is preceeded by a (newly-added) '\r\n', so
the prog merrily replaces ALL '\r\n' with '\n'...which MAY fix your
oops, but maybe not.
So, with the current code, if you snarf the first "line" -- all chars
until the first '\n' -- if it's a binary file the odds are pretty low
that the immediately-preceeding character is a '\r' -- so d2u as
currently coded will bail out, and no harm is done.
It doesn't work so well in the other direction -- by the same logic
above, you'll almost never bail out early if you run 'u2d' on a binary
file -- but if you immediately do a 'd2u' you MIGHT be able to recover.)
>
> Unfortunately, there is an extra unstated assumption in between deducing
> the file type from the first EOL in the file and deducing that you don't
> need to perform a conversion, and that assumption is that every other line
> in the file has the same EOL as the first line. And that assumption is
> bogus, and it means that d2u/u2d and friends are no use on files which have
> mixed EOL types, unless by good chance the very first line has the EOL type
> that you wish to convert away from.
>
> My attached patch simply removes the attempted optimisation. Like I say,
> I think it's an invalid shortcut to assume that every line in a file has the
> same EOL type. I could imagine a case could be made for keeping the
> 'optimisation' and perhaps providing a command-line switch "-f" or "--force"
> to force full processing of files even if they seem to already be in the
> right mode; OTOH I'd say that even if you wanted to keep the optimisation
> in some cases, it's a dangerous optimisation that can lead to incorrect
> output, and therefore it should only be switched on when the user
> deliberately adds a command-line option, rather than being on by default and
> disableable.
While I admit that the 'safety' aspect is not foolproof, it is a
valuable mini-feature. To make the default behavior un-"safe" kinda
defeats the purpose. Further, why do you need an extra commandline
option at all? Obvious, the first EOL is either one or the other.
So, if you have a mixed-mode file, and want to go to UNIX EOL, then this
will always work:
u2d myFile.txt && d2u myFile.txt
I'm not shooting down your patch, but I'd like further discussion on
list. I think the "safety" features of the current behavior have been
overlooked, and I hesistate to remove them without a thorough
discussion. 'cause as soon as I remove it, some poor slob is going to
do 'd2u /cygdrive/c/WINNT/nt.dat' and invent new swear words using my
name...
--
Chuck
--
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Problem reports: http://cygwin.com/problems.html
Documentation: http://cygwin.com/docs.html
FAQ: http://cygwin.com/faq/
- Raw text -