Message-ID: <35FEF59C.9B4D8BFC@vlsi.com> Date: Tue, 15 Sep 1998 16:17:48 -0700 From: Charles Marslett MIME-Version: 1.0 To: DJ Delorie CC: djgpp-workers AT delorie DOT com Subject: Re: auto-binary-mode? References: <199809152120 DOT RAA08510 AT delorie DOT com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Precedence: bulk DJ Delorie wrote: > > Hey, I just had an idea. When a file is opened and the first block is > read in, if the user didn't specify binary or text, why not look at > the data and try to guess? The presence of null, control, or certain > 8-bit characters should indicate binary vs text as a default. I have used that idea in the past (personal version of Microsoft's compiler 6-8 years ago ;-). The best single indicator of text vs. binary turned out not to be non-text byte values, though. Lots of text files had PC graphics characters in them and they often clustered at the beginning of the file (title boxes and such). But I found that looking for at least 3 CR/LF pairs in the first 512 bytes of the file worked pretty well (PC file format, of course) and it worked better if you relaxed the rule when lots of backspaces showed up (I think I counted backspaces and when the counter hit 100 I counted that as a CR/LF pair or some such thing). If the CR/LF counter was 0, 1 or 2 I had a binary file, more than that indicated a text file (I actually used assembly with scan instructions, so there really wasn't a counter as such -- just where the program counter was). It's slower than comparing for a 'b' or 't' in the function call, but still pretty fast. I also stuck in a few extra tests that looked for a few signatures I knew I was going to be reading ("PK" for ZIP files and "MZ" for executables, for example). At least back in those days, nulls were not a good binary file indicator because they seemed to occur in a lot of files captured from stdout streams with ">". Like most ad hoc designs, it got pretty baroque by the time I stopped using it. Looking for 01, 02, 03, FF and FE might work pretty well, though. --Charles