Message-ID: <328A467A.1A58@pobox.oleane.com> Date: Wed, 13 Nov 1996 23:06:50 +0100 From: Francois Charton Organization: CCMSA MIME-Version: 1.0 To: grendel AT ananke DOT amu DOT edu DOT pl CC: djgpp AT delorie DOT com Subject: Re: Why not to use 'tar' before packing DJGPP? References: <32823D97 DOT 44DD AT sabat DOT tu DOT kielce DOT pl> <3282A82E DOT 7EE7 AT cs DOT com> <55vapk$s4l AT news DOT ox DOT ac DOT uk> <561pv7$36c AT news DOT ox DOT ac DOT uk> <32851B7C DOT 66E2 AT ananke DOT amu DOT edu DOT pl> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Mark Habersack wrote: > > The reason of better compression ratios of tar archives is very simple. > LZW compression which (or rather its modification) is used by PKZIP and > some other archivers uses something called a "sliding dictionary". It's > a structure that holds pairs of data: pattern and its numerical code. > When the compressor reads the input stream it looks up the dictionary to > whether the just-read pattern already ocurred in the previously read > stream of data. If so, then the pattern is replaced with the > corresponding code read from dictionary - thus you have just compressed > the input data. This is a simplified description of the LZW algorithm, > but it's enough to understand what follows. > > Just a small point : the algorithm used by PKZip is *not* LZW (that's the one in the GIF format), but another one called "LZ77". The LZ mean Lempel-Ziv, the two guys which invented the algorithm. They actually published two algorithms, on in 1977 (LZ77) and the second one in 1978 (LZ78), LZW is a derivative of LZ78. These two algorithm are *not* alike at all : LZ77 use a "sliding window" to look for dictionnary matches LZ78 builds a dictionnary incrementally, by adding en entry everytime it meets a string not yet in the dictionnary. Historically LZ78 was used in the first compression programs for PC, like ARC, and LZW if used in the GIF format. Since then, newer progs like PKZIP, LHA, ARJ... use LZ77. LZ77 has one very strong point (IMHO): its coding and decoding routines are very asymetric : coding is a bit complex to program, and the algorithm is rather slow. But decoding is very simple to program, and very quick to run. For this reason, it has been used a lot in executable compression. Another (not so unimportant) point is that LZW is patented... Francois