X-Authentication-Warning: delorie.com: mailnull set sender to djgpp-bounces using -f From: Hans-Bernhard Broeker Newsgroups: comp.os.msdos.djgpp Subject: Re: how to determine if a file is text/binary Date: 30 Apr 2002 11:55:04 GMT Organization: Aachen University of Technology (RWTH) Lines: 43 Message-ID: References: NNTP-Posting-Host: acp3bf.physik.rwth-aachen.de X-Trace: nets3.rz.RWTH-Aachen.DE 1020167704 7256 137.226.32.75 (30 Apr 2002 11:55:04 GMT) X-Complaints-To: abuse AT rwth-aachen DOT de NNTP-Posting-Date: 30 Apr 2002 11:55:04 GMT Originator: broeker@ To: djgpp AT delorie DOT com DJ-Gateway: from newsgroup comp.os.msdos.djgpp Reply-To: djgpp AT delorie DOT com xeon wrote: > I'm wondering, how to determine is a file is a text file, or a binary > file, programatically. I'm thinking about reading 4 bytes from the > file and test them if they're in the range of usual text ([a-z], > [A-Z], etc. The 4 bytes is read from the following locations : 1st > byte, last byte, and 2 randomly selected offset inside the file. Is > this enough? Quite probably not. It's both too picky and not picky enough. It's too picky because a file can easily be a text file without containing a single letter in the whole file. Think of a spreadsheed-like collection of lots of numbers in decimal figures. It could be in some strange foreign character mapping where almost all letters have codes outside the ASCII range (like all that trashy spam recently flooding the net all coming from a particular country in East Asia). It's not picky enough because there's a significant probability that all four bytes you tested happen to be printable ASCII characters, but none of the others is. More generally spoken: *every* file can potentially be a binary file. To give just one example where such tests are almost guaranteed to fail: GNU's .info files. These files look like text files (not a single non-ASCII character in the whole file, setting aside some control-L and control-_ ones), but they really are binary files, because there are fseek() offsets inside the files that would break if the file is transferred to DOS text mode. The DJGPP ports of info readers know how to deal with that problem, but that was done only because it became a burden to keep telling users to leave these binary files alone. The usual trick is as Eli described it: read some chunk of the file and check for any strictly forbidden characters that cannot ever appear text files, regardless of their encoding. E.g. null bytes. The free zip/unzip tools read the first kilobyte for this purpose, IIRC. That, too, obviously cannot be failsafe, but it works well most of the time. -- Hans-Bernhard Broeker (broeker AT physik DOT rwth-aachen DOT de) Even if all the snow were burnt, ashes would remain.