Mailing-List: contact cygwin-help AT cygwin DOT com; run by ezmlm List-Subscribe: List-Archive: List-Post: List-Help: , Sender: cygwin-owner AT cygwin DOT com Mail-Followup-To: cygwin AT cygwin DOT com Delivered-To: mailing list cygwin AT cygwin DOT com Message-ID: <406E4184.30904@cwilson.fastmail.fm> Date: Fri, 02 Apr 2004 23:45:56 -0500 From: Charles Wilson User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040120 MultiZilla/1.6.2.0c MIME-Version: 1.0 To: cygwin AT cygwin DOT com Subject: Re: Bogus assumption prevents d2u/u2d/conv/etal working on mixed files. References: In-Reply-To: Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit Dave Korn wrote: > I was pretty stunned to find d2u didn't have the same effect as tr -d. A > few seconds work in the debugger, however, made it clear. > > Right inside conv.c, in the main convert (...) function, there's an > attempted optimisation. After opening the file for conversion, it reads a > char at a time until it finds the first '\n' or '\r' in the whole file. If > a '\n' comes first, it assumes the file is in Unix format; if a '\r' comes > first, it assumes the file must be in DOS format. > > Now, these assumptions are reasonable enough ways of guessing the file > format if it hasn't been specified by the command name or command line > switch, and therefore of deducing which kind of translation is required. > > But then it checks to see if the guessed format matches the format you've > asked it to convert into. If so, it attempts to 'optimise' the conversion > by simply not performing it: it closes the file and leaves it untouched. This is not just a performance enhancement. It also has the following properties: (1) the access time on files which are not actually modified does not get updated (2) it's an attempt to prevent users from permanently scrogging binary files. See: d2u, on a binary file, is an irreversible operation. So, if you do "d2u *" you'll probably kill something deep inside some binary file, and you can't fix it -- unless some minimal safeguards are in place. u2d MAY be reversible -- IF there were no pre-exising \r\n combinations in the file to begin with -- so when (OMG-fixit-)d2u is run, obviously the first '\n' is preceeded by a (newly-added) '\r\n', so the prog merrily replaces ALL '\r\n' with '\n'...which MAY fix your oops, but maybe not. So, with the current code, if you snarf the first "line" -- all chars until the first '\n' -- if it's a binary file the odds are pretty low that the immediately-preceeding character is a '\r' -- so d2u as currently coded will bail out, and no harm is done. It doesn't work so well in the other direction -- by the same logic above, you'll almost never bail out early if you run 'u2d' on a binary file -- but if you immediately do a 'd2u' you MIGHT be able to recover.) > > Unfortunately, there is an extra unstated assumption in between deducing > the file type from the first EOL in the file and deducing that you don't > need to perform a conversion, and that assumption is that every other line > in the file has the same EOL as the first line. And that assumption is > bogus, and it means that d2u/u2d and friends are no use on files which have > mixed EOL types, unless by good chance the very first line has the EOL type > that you wish to convert away from. > > My attached patch simply removes the attempted optimisation. Like I say, > I think it's an invalid shortcut to assume that every line in a file has the > same EOL type. I could imagine a case could be made for keeping the > 'optimisation' and perhaps providing a command-line switch "-f" or "--force" > to force full processing of files even if they seem to already be in the > right mode; OTOH I'd say that even if you wanted to keep the optimisation > in some cases, it's a dangerous optimisation that can lead to incorrect > output, and therefore it should only be switched on when the user > deliberately adds a command-line option, rather than being on by default and > disableable. While I admit that the 'safety' aspect is not foolproof, it is a valuable mini-feature. To make the default behavior un-"safe" kinda defeats the purpose. Further, why do you need an extra commandline option at all? Obvious, the first EOL is either one or the other. So, if you have a mixed-mode file, and want to go to UNIX EOL, then this will always work: u2d myFile.txt && d2u myFile.txt I'm not shooting down your patch, but I'd like further discussion on list. I think the "safety" features of the current behavior have been overlooked, and I hesistate to remove them without a thorough discussion. 'cause as soon as I remove it, some poor slob is going to do 'd2u /cygdrive/c/WINNT/nt.dat' and invent new swear words using my name... -- Chuck -- Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple Problem reports: http://cygwin.com/problems.html Documentation: http://cygwin.com/docs.html FAQ: http://cygwin.com/faq/