Mail Archives: djgpp/1996/12/19/14:46:13
Leath Muller (leathm AT gbrmpa DOT gov DOT au) wrote:
:> > I've got the original docs from the intel homepage for assembler
:> > programmers, and the Pentium floating point mul is nowhere near 3
:> > clocks. The throughput is something like 20 to 60 clocks depending on
:> > precision. The fastest version is about 4-6 clocks faster than the
:> > integer mul. On the other hand, the mmx can do 8 8 bit muls in a couple
:> > of clocks, and the Pentium Pro can do a 32bit mul in something like 3 or
:> > 4. Fixed point seems better by the moment. (I'd still use floating point
:> > for trig though)
:>
:> Well, I have the Pentium Programmers Manual sitting in front of me in
:> Acrobat, and it says it _does_ do 3 cycles per mul. If you want proof
:> of the speed, look at Quake. Even Abrash said he couldn't get the same
:> performance out of the pentium with fixed point as he could with
:> floating point.
I have no manuals at my hands, but i KNOW that the pentium is capable of
doing one fmul EVERY cycle, because i DID it. For serious problems you
don't get that throughput, but something around 2 cycles per flop (fmul
or fadd/fsub) is possible, if no memory is slowing things down. See the
BLAS homepage at
http://cip.physik.uni-wuerzburg.de/~mlkessle/blas1.html
For simple functions like dot product of short vectors coming out of the
L1 cache it's possible to achieve 79 MFLOP at a P-133. This gives one
fpu result every 1.6 cycles. Latency for both fmul and fadd is three cycles,
therefore you have to use heavily fxch, but it's mostly free anyway.
Of course, it's not very easy to get that performance, but it's
possible.
Ciao,
Manuel
------------------------------------------------------------------------------
Manuel Kessler
Graduate Student at the University of Wuerzburg, Germany, Physics Department
SNAIL: Zeppelinstrasse 5, D-97074 Wuerzburg, Germany
EMAIL: mlkessle AT cip DOT physik DOT uni-wuerzburg DOT de
WWW: http://cip.physik.uni-wuerzburg.de/~mlkessle
- Raw text -