From: buers AT gmx DOT de (Dieter Buerssner) Newsgroups: comp.os.msdos.djgpp Subject: Re: inefficiency of GCC output code & -O problem Date: 18 Apr 2000 17:08:17 GMT Lines: 80 Message-ID: <8dib4a.3vvqvqr.0@buerssner-17104.user.cis.dfn.de> References: <38FBB719 DOT 3915C530 AT mtu-net DOT ru> <8dgvat DOT 3vvqu6v DOT 0 AT buerssner-17104 DOT user DOT cis DOT dfn DOT de> <38FC0F43 DOT 87E209B3 AT mtu-net DOT ru> NNTP-Posting-Host: pec-114-54.tnt7.s2.uunet.de (149.225.114.54) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Trace: fu-berlin.de 956077697 8227418 149.225.114.54 (16 [17104]) X-Posting-Agent: Hamster/1.3.13.0 User-Agent: Xnews/03.02.04 To: djgpp AT delorie DOT com DJ-Gateway: from newsgroup comp.os.msdos.djgpp Reply-To: djgpp AT delorie DOT com [This thread should be dead by now, but I really cannot leave some things uncorrected] Alexei A. Frounze wrote: >Dieter Buerssner wrote: [The >> > is from Alexei in an reply to Kalum Somaratna aka Grendel] >> >Yes, we have proved. We also haven't trow away all my inline ASM. The >> >FIDIVRL trick is still alive. :) >> >> Wrong. [In the same reply Alexei has written] \begin{quote} You've forgot (in fact, Dieter haven't mentioned) about the FIDIVRL instruction executed in parallel to the span() function. This is a real trick that makes difference. Even Dieter didn't change it and left this piece of my inline ASM AS-IS. \end{quote} I did change this. And I mentioned everything. I especially mentioned, that for one test, I changed part of the inline assembly to C code. (I did this at places, where it seemed to me, that the inline assembly would not have much inpact to the performance.) I also mentioned, that for the other test, I got rid of all your inline assembly (and adding one new line of inline assembly). So, the quotes are just plain wrong. >It's not wrong, since I don't get your results with (USEC=USEC2=1 and -O >switch). I get it *slower*. And I have no idea what's up. Don't you see, that the these sentences tell something totally different, than the quotes. I never stated that you will be able to reproduce my numbers. "It's not wrong, since ..." doesn't make any sense. Alexei, reread the thread. I think, I has always tried to write exactly what I have done. Your statements make me look like a lier. They are often out of context. I have reported the numbers exactly like I have told you in my post about this stupid bet. Without any of your inline assembly, I got exactly the same performance here. I have no doubt, that you might measure something different. I don't call you a lier. It really doesn't surprise me, that the results are highly machine dependant. But from looking at the asm output (I use fsdb after compiling with -g, it shows nicely C source and asm together, but there exist other means), it seems to me, that there shouldn't be a big difference at all for T_Map() with and without inline assembly (besides the rounding to int, which I coded by one inline function). I explained, that you use the FPU stack efficiently. Some of this advantage, you lost by all those references. Count the FPU instructions in the .s output, and you will see, that the C version will need as many fmul/fdiv etc. instructions. It will need quite a few fxch instructions, that you don't need. It will need to discard the top of the floating point stack a few times, where you don't need it. These things can be very CPU dependant. The C code will avoid many adress calculations, to make up for it. Also, if you think that pairing of the fidivr with span is really important, you *might* be able to get it with the C code as well. I delayed that part of the C code till after the span, because it was just a very little bit faster here. The C code is still there in comments. Gcc will not use fidivr, it will use fdivr instead. Obviously gcc decided, to trade an inverse division by an integer (compile time constant), with an inverse divisision by a floating-point constant. You might have optimzed your code exactly for your processor. The C code isn't optimized to any processor, it is just what usual coding principles suggests. (And you obviously thought the same about this, because much is almost unchanged form your comments). The "optimization" of saving two divisions for three multiplies, at least IHMO, is not allowed to be done by the compiler, but that is a whole different issue. The numbers I have written are true, they are for the first screen of your program. I have not bothered, to find any MIN/MAX, but playing around a little bit, I can essentially see no difference between the C code and the inline assembly.