www.delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp/1996/12/19/14:46:13

From: mlkessle AT cip DOT physik DOT uni-wuerzburg DOT de (Manuel Kessler)
Newsgroups: comp.os.msdos.djgpp
Subject: Re: Is DJGPP that efficient?
Date: 19 Dec 1996 15:59:53 GMT
Organization: CipPool der Physikalischen Institute, Uni Wuerzburg
Lines: 40
Message-ID: <59bopp$vn3@winx03.informatik.uni-wuerzburg.de>
References: <199612161347 DOT IAA01261 AT delorie DOT com> <32B8749B DOT 6DFD AT nlc DOT net DOT au> <32B8ECAF DOT 5F9F AT gbrmpa DOT gov DOT au>
NNTP-Posting-Host: wpax24.physik.uni-wuerzburg.de
To: djgpp AT delorie DOT com
DJ-Gateway: from newsgroup comp.os.msdos.djgpp


Leath Muller (leathm AT gbrmpa DOT gov DOT au) wrote:
:> > I've got the original docs from the intel homepage for assembler
:> > programmers, and the Pentium floating point mul is nowhere near 3
:> > clocks. The throughput is something like 20 to 60 clocks depending on
:> > precision. The fastest version is about 4-6 clocks faster than the
:> > integer mul. On the other hand, the mmx can do 8 8 bit muls in a couple
:> > of clocks, and the Pentium Pro can do a 32bit mul in something like 3 or
:> > 4. Fixed point seems better by the moment. (I'd still use floating point
:> > for trig though)
:> 
:> Well, I have the Pentium Programmers Manual sitting in front of me in
:> Acrobat, and it says it _does_ do 3 cycles per mul. If you want proof
:> of the speed, look at Quake. Even Abrash said he couldn't get the same
:> performance out of the pentium with fixed point as he could with
:> floating point.

I have no manuals at my hands, but i KNOW that the pentium is capable of 
doing one fmul EVERY cycle, because i DID it. For serious problems you
don't get that throughput, but something around 2 cycles per flop (fmul
or fadd/fsub) is possible, if no memory is slowing things down. See the
BLAS homepage at
	http://cip.physik.uni-wuerzburg.de/~mlkessle/blas1.html
For simple functions like dot product of short vectors coming out of the
L1 cache it's possible to achieve 79 MFLOP at a P-133. This gives one
fpu result every 1.6 cycles. Latency for both fmul and fadd is three cycles,
therefore you have to use heavily fxch, but it's mostly free anyway. 
Of course, it's not very easy to get that performance, but it's 
possible.

Ciao,
	Manuel

------------------------------------------------------------------------------
Manuel Kessler
Graduate Student at the University of Wuerzburg, Germany, Physics Department
SNAIL: Zeppelinstrasse 5, D-97074 Wuerzburg, Germany
EMAIL: mlkessle AT cip DOT physik DOT uni-wuerzburg DOT de
WWW: http://cip.physik.uni-wuerzburg.de/~mlkessle

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019