www.delorie.com/archives/browse.cgi   search  
Mail Archives: pgcc/1999/06/04/08:57:12

Message-ID: <19990604145506.60816@atrey.karlin.mff.cuni.cz>
Date: Fri, 4 Jun 1999 14:55:06 +0200
From: Jan Hubicka <hubicka AT atrey DOT karlin DOT mff DOT cuni DOT cz>
To: pgcc AT delorie DOT com
Subject: Re: Pgcc 1.1.3 - bad performance on P6
References: <m10prMP-00021bC AT chkw386 DOT ch DOT pwr DOT wroc DOT pl>
Mime-Version: 1.0
X-Mailer: Mutt 0.84
In-Reply-To: <m10prMP-00021bC@chkw386.ch.pwr.wroc.pl>; from Krzysztof Strasburger on Fri, Jun 04, 1999 at 12:54:13PM +0000
Reply-To: pgcc AT delorie DOT com
X-Mailing-List: pgcc AT delorie DOT com
X-Unsubscribes-To: listserv AT delorie DOT com

> Marc Lehmann <pcg AT goof DOT com> wrote:
> >On Wed, Jun 02, 1999 at 09:14:00AM +0000, Krzysztof Strasburger wrote:
> 
> >> The obvious remark is: the code produced by pgcc for P6 is suboptimal,
> >> but why high optimizations kill the performance instead of improving it? 
> 
> >Tuning pgcc for ppro is not yet finished. But I think the bigger effect
> >you see is that pgcc is tuned for integer performance. You might want
> >to try out the hints in the pgcc faq on improving fp-performance (Yes,
> >unfortunately you can not have both at the same time yet).
> Double precision variables are already double aligned and there is nothing
> more to unroll in the function "gausil". I repeated the test under different
> conditions to remove the side effect of the function "main".
> Double variables in main have been declared static and main.c has been
> compiled with gcc 2.7.2.3 -malign-double. Gausil.c has been compiled
> for _pentium_ and different version run on _pentium_ 166 with 2000000 steps
> (times averaged for three runs each, on idle machine);
> -malign-double -mstack-align-double (for pgcc) -malign-jumps=0 -malign-loops=0
> -malign-functions=0 -ffast-math used everywhere
> -O5 = -O6 (same code)
> 1. gcc 2.7.2.3 (-m486, of course ;) -O2 : t=7.21s
> 2. pgcc 1.1.3 -O4 : t=7.16s 
> 3. pgcc 1.1.3 -O6 : t=7.26s
> So, i repeat, -O5/6 kills the performance on P5, not only on P6.
> Let us look at ealier version of pgcc (1.0.3a).
> It gave only two different codes : -O2 = -O3 = -O4, -O5 = -O6
> 4. pgcc 1.0.3a -O(2,3,4) : t=7.05s
> 5. pgcc 1.0.3a -O(5,6) : t=7.16s
> Hmmm... High optimizations always killed FP performance. Old pgcc gave better
> FP code, than new - and this is sad. Let us look at the latest snapshot.
> Again, -O2 = -O3 = -O4 and -O5 = -O6 (of course, this is not a general rule).
> 4. pgcc 2.93.03 -O(2,3,4) : t=7.15s
> 5. pgcc 2.93.03 -O(5,6) : t=7.26s
> Eh... It isn't better (in this case only, of course; i had other programs
> which were faster with pgcc 2.93.03 than with pgcc 1.1.1/2 or 1.0.3).
> The clear winner is the old version of pgcc. I'm going back to it.
> I have a cluster of pentiums, which spend about 25% of their time
> in the function "gausil".
> I really appreciate the work, which EGCS/PGCC teams do _for free_.
> Please, don't treat my words as flames or complaining, but i think
> that an important part of the compiler goes in the wrong direction.
> Many programs benefit from good FP performance (not only scientific
> software).
About the fp performance, the egcs does great progress. There has been more
than 30% improvement one the avreage in my tests since 2.7.2 times and lots
of work has been done on improving this. I believe currently egcs is better than pgcc
in fp performance, so you might try it.
(the improvement on integer code since 2.7.2 times is much lower in egcs).

For example in my XaoS application 
(lots of fp code) I am getting about 100% increase of current egcs snapshot versus
2.7.2, about 30% to egcs 1.0.0 and 40% to 1.1.0 (yes 1.1.0 was bit worse than 1.0.0)
and about 60% compared to visual C++ (and thats very unusual)
and 50% compared to current pgcc hacked to compile it.
There has been done lots of work on floating code performance (alignment changes and such)..

The problem with fp tests is, that results are more or less random, because of
reg-stack pass that depends heavily on position of code, so better scheduling
might result in worse code in many cases just because reg-stack don't like the
new order of instructions. So it is very hard to say something about egcs progress
in fp code based on single loop.
The visual C++'s and Intel solution is to preffer memory than registers
(pentium have good support for this, because memory operands of fp instructions are cheap).
That makes this pressure (and emiting extra fxch instruction)
much smaller. Egcs don't do that so far, but I am currently doing some reg-stack
improvements, that can reduce lots of this randomness. The memory solution is not ideal anyway,
because as you can see in XaoS case, when you are lucky and resulting code is reg-stack
friendly, you can do much better work in registers anyway.

I am thinking about adding support to constraints to preffer spilling (for example by placing
"m" as first letter of constraint so constraint like
like "mrf" can say to global.c that it ought to place value on stack if at all possible)
and writing integer/memory versions of some patterns (gcc is already able to move
fp data in integer registers, this can be futher extended to comparisons
and simple operations like fabs/fadd.)

This should require lots of changes to global.c to avoid integer<->fp register migration,
that can be greatly increased by such changes and also I am not sure, if egcs team
will like my changes to constraint handling. But my first test indicate quite good results
(BTW pgcc have somthing similar done in unclean way with -fopt-reg-stack I believe)

In the simple numerical loop I also got great improvement (20% in mset loop)
by fixing timming parameters of pentium fpmulnd fpdiv functions. This patch
is not in egcs yet, but Hans have applied it to pgcc yesterday, so you might
try it as well.

Also I've made working in the latest egcs/pgcc snapshots the -mno-ieee-fp
option, that can help to your code as well. 

BTW can you please send me your loop? I've probably already removed it from my inbox
and I would like to see whats wrong there.

In meantime you might try egcs snapshot and -O3 -funroll-all-loops (this helps greatly to reg-stack)
-ffast-math -mno-ieee-fp and possibly -fschedule-insns
Also in case you will be donwloading the egcs, you can apply my scheduling patch
I've sent in my long letter to this list recently.

So that just FYI to see what is happening...

Honza
> 
> >also, you could try a snapshot (i.e. from cvs). 1.1.x was made more for
> >stableness than for performance (Yes, I know 1.1.3 is not the most stable
> >release we had).
> I tried the cvs server, but the transmission breaks very often, so i still
> don't have the cvs version. And pgcc 1.1.3 is the first acceptable of 1.1.x
> for me, because fast-math didn't work correctly with earlier versions (and
> the latest snapshot).
> Krzysztof

Honza
-- 
                       OK. Lets make a signature file.
+-------------------------------------------------------------------------+
|        Jan Hubicka (Jan Hubi\v{c}ka in TeX) hubicka AT freesoft DOT cz         |
|         Czech free software foundation: http://www.freesoft.cz          |
|AA project - the new way for computer graphics - http://www.ta.jcu.cz/aa |
|  homepage: http://www.paru.cas.cz/~hubicka/, games koules, Xonix, fast  |
|  fractal zoomer XaoS, index of Czech GNU/Linux/UN*X documentation etc.  | 
+-------------------------------------------------------------------------+

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019