From: buers AT gmx DOT de (Dieter Buerssner) Newsgroups: comp.os.msdos.djgpp Subject: Re: inefficiency of GCC output code & -O problem Date: 17 Apr 2000 19:17:53 GMT Lines: 147 Message-ID: <8dfum2.3vvqu6v.0@buerssner-17104.user.cis.dfn.de> References: <38F9D717 DOT 9438A3F6 AT mtu-net DOT ru> <8df84a DOT 3vvqu6v DOT 0 AT buerssner-17104 DOT user DOT cis DOT dfn DOT de> <38FB4094 DOT DE7B5F4C AT mtu-net DOT ru> NNTP-Posting-Host: pec-104-19.tnt5.s2.uunet.de (149.225.104.19) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Trace: fu-berlin.de 955999073 8147136 149.225.104.19 (16 [17104]) X-Posting-Agent: Hamster/1.3.13.0 User-Agent: Xnews/03.02.04 To: djgpp AT delorie DOT com DJ-Gateway: from newsgroup comp.os.msdos.djgpp Reply-To: djgpp AT delorie DOT com Alexei A. Frounze wrote: >Dieter Buerssner wrote: >> Alexei A. Frounze wrote: >> >Not really. The inner loop in my tmapper can not be written in pure C. >> >Belive me. >> >> This is not true. > >Okay, interpolate U and V over a group of pixels, and don't forget &0xFF >to be sure U and V don't exceed the 0...255 range (the span() function >does this). I doubt your C code will be as fast as my ASM. Tell me what >you've got when you're done. I refered to the T_Map() function, you posted to this group. This can clearly be (quite efficiently) written in C. I didn't look at span(). [Most of stupid bet deleted. Seriously, I wouldn't have hold the bet, because I already knew the results.] >> I get rid of all your inline assembly in T_Map. I will be allowed >> to add one single line (say less than 50 characters from __asm__ >> upto the closing ')' ) of inline assembly to your source. I bet, >> the plain C code will perform about the same, as your inline >> code. I win, when my code is no more than 2 FPS slower, or faster, than >> your code (The executable you sent reports 70 FPS here). > >How many are there such lines in your oppinion? :) I don't understand this question. To elaborate, and make this on-topic again. Some of the code Alexei posted uses just "normal" floating point math. He coded almost all of this inline. I replaced this by the equivalent C-Code, that mostly was already there. Some minor modifications where something like /* a=d/c; b=e/c; */ /* This was already there in comments */ replaced by #if USEC f=1.0/c; a=d*f; b=e*f; #else __asm__ /* ... */ #endif The same optimizatition, Alexei has made in his inline assembler. After this I recompiled, and the speed went up from 70 FPS to 72 FPS. This, I think, proves, that gcc is capable to produce quite efficient floating point code. Of course, Alexei's code would have won, if he had replaced __asm__ volatile("fldl (%0)\n ...\nfstpl (%0)" : : "r" (&dbl)); (Alexei, you got rid of the "g", but I think, here "memory" is needed in the clobber-list. I'm not totally certain, though.) with __asm__ volatile("fldl %0\n ...\nfstpl %0" : "=m" (dbl) : "0" (dbl)); This would give gcc more chances to optimize. It uses less registers, and also needs less instructions. I have not tried this, but even then I think, the C code would not produce much less efficient code, than the inline assembly. Where gcc produces considerably less efficient code, is when you have int i; double a, b; i = (int)(a*b); Here, gcc always needs to save and restore the FPU control word, and there are a few occurences of this type in Alexei's code. (I don't blame gcc here, I think it is almost impossible to do much better for a compiler.) I replaced the above code with /* can be #ifdefed and replaced by #define to_int(x) ((int)(x)) for non gcc and i386, to make it even portable. Comments for other or more efficient methods to do double -> int conversions are wellcome. */ __inline__ static int to_int(double x) { int r; __asm__ volatile ("fistl %0" : "=m" (r) : "t" (x)); /* "t" is for st0 */ return r; } ... #if USEC2 i = to_int(a*b); #else __asm__ /* ... */ #endif This is essentially, what the inline code of Alexei does. (I have not bothered to look up, whether the fistl instruction rounds, or chops, so this may be not the same as the C-code.) While the to_int function is not optimal (gcc will have to code one superflous fstp instruction, compared to fistpl), it is still quite a bit more efficient than C code. With these modifications, I got rid of all the other inline assembly. I got 70 FPS, the same as the original (either the self compiled sources, or the executable Alexei sent to me). Alexei's code will "cache" some values on the FPU stack, which gcc is not able to see (with the switches I used). Nevertheless, even here, with the help of only one line of inline assembly, it produces comparable results. Again, it would loose, when all those references and adress-off operations would be omitted. It should be clear, that the compiler won't reach the efficiency of hand optimzed assembler code. Whether the relative small difference here is worth all the trouble, ... One last comment, on the T_Map function. The C-code version actually got quite a bit slower (5 FPS, IIRC), when compiled with -O2 or -O3, compared to -O only. The assembler version, not surprisingly, was not effected. There was one bug in the other part of the sources, that may be of general interest. [All the context omitted (Alexei, it's in your linev)] int c; /* only low byte used */ __asm__ volatile("movb %0, al" : : "g" (c)); This actually compiled with -O2, but got an error with -O by gas. It should be clear why - when gcc decides that c will live in memory or in a/b/c/dx, it will work, when it is in (say) esi, it won't. So, this is a nice example, why "but it work's", doesn't buy you too much. Alexei, I have made some fun. I hope I have made up for it, by this post, that took actually longer to write, than the coding. I will send you the modified source by email. The post hopefully was of general interest. -- Regards, Dieter