www.delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp/2000/04/18/14:19:42

From: buers AT gmx DOT de (Dieter Buerssner)
Newsgroups: comp.os.msdos.djgpp
Subject: Re: inefficiency of GCC output code & -O problem
Date: 18 Apr 2000 17:08:17 GMT
Lines: 80
Message-ID: <8dib4a.3vvqvqr.0@buerssner-17104.user.cis.dfn.de>
References: <Pine DOT LNX DOT 4 DOT 10 DOT 10004180455310 DOT 1540-100000 AT darkstar DOT grendel DOT net> <38FBB719 DOT 3915C530 AT mtu-net DOT ru> <8dgvat DOT 3vvqu6v DOT 0 AT buerssner-17104 DOT user DOT cis DOT dfn DOT de> <38FC0F43 DOT 87E209B3 AT mtu-net DOT ru>
NNTP-Posting-Host: pec-114-54.tnt7.s2.uunet.de (149.225.114.54)
Mime-Version: 1.0
X-Trace: fu-berlin.de 956077697 8227418 149.225.114.54 (16 [17104])
X-Posting-Agent: Hamster/1.3.13.0
User-Agent: Xnews/03.02.04
To: djgpp AT delorie DOT com
DJ-Gateway: from newsgroup comp.os.msdos.djgpp
Reply-To: djgpp AT delorie DOT com

[This thread should be dead by now, but I really cannot leave some things
uncorrected]

Alexei A. Frounze wrote:

>Dieter Buerssner wrote:

[The >> > is from Alexei in an reply to  Kalum Somaratna aka Grendel]
>> >Yes, we have proved. We also haven't trow away all my inline ASM. The
>> >FIDIVRL trick is still alive. :)
>> 
>> Wrong. 

[In the same reply Alexei has written]
\begin{quote}
        You've forgot (in fact, Dieter haven't mentioned) about the
        FIDIVRL instruction executed in parallel to the span() function.         
        This is a real trick that makes difference. Even Dieter didn't                         
        change it and left this piece of my inline ASM AS-IS.
\end{quote} 

I did change this. And I mentioned everything. I especially
mentioned, that for one test, I changed part of the inline assembly
to C code. (I did this at places, where it seemed to me, that
the inline assembly would not have much inpact to the performance.)
I also mentioned, that for the other test, I got rid of all your inline
assembly (and adding one new line of inline assembly). So, the
quotes are just plain wrong.

>It's not wrong, since I don't get your results with (USEC=USEC2=1 and -O
>switch). I get it *slower*. And I have no idea what's up.

Don't you see, that the these sentences tell something totally
different, than the quotes. I never stated that you will be
able to reproduce my numbers. "It's not wrong, since ..."
doesn't make any sense. 

Alexei, reread the thread. I think, I has always tried to write
exactly what I have done. Your statements make me look like a lier. 
They are often out of context. I have reported the numbers exactly
like I have told you in my post about this stupid bet. Without
any of your inline assembly, I got exactly the same performance
here. I have no doubt, that you might measure something different.
I don't call you a lier. It really doesn't surprise me, that the
results are highly machine dependant. But from looking at the
asm output (I use fsdb after compiling with -g, it shows nicely
C source and asm together, but there exist other means), it seems to me,
that there shouldn't be a big difference at all for T_Map() with
and without inline assembly (besides the rounding to int, which
I coded by one inline function). I explained, that you use the
FPU stack efficiently. Some of this advantage, you lost by all
those references. Count the FPU instructions in the .s output,
and you will see, that the C version will need as many
fmul/fdiv etc. instructions. It will need quite a few fxch instructions, 
that you don't need. It will need to discard the top of the floating 
point stack a few times, where you don't need it. These things can be 
very CPU dependant. The C code will avoid many adress calculations, 
to make up for it.

Also, if you think that pairing of the fidivr with span is really
important, you *might* be able to get it with the C code as well.
I delayed that part of the C code till after the span, because
it was just a very little bit faster here. The C code is still
there in comments. Gcc will not use fidivr, it will use
fdivr instead. Obviously gcc decided, to trade an inverse
division by an integer (compile time constant), with an
inverse divisision by a floating-point constant.

You might have optimzed your code exactly for your processor.
The C code isn't optimized to any processor, it is just what
usual coding principles suggests. (And you obviously thought
the same about this, because much is almost unchanged form
your comments). The "optimization" of saving two divisions
for three multiplies, at least IHMO, is not allowed to be done 
by the compiler, but that is a whole different issue.

The numbers I have written are true, they are for the first
screen of your program. I have not bothered, to find any MIN/MAX,
but playing around a little bit, I can essentially see no difference
between the C code and the inline assembly.

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019