www.delorie.com/archives/browse.cgi   search  
Mail Archives: pgcc/1999/02/25/20:02:21.2

Date: Thu, 25 Feb 1999 23:52:32 +0100
To: pgcc AT delorie DOT com
Cc: johnny AT entity DOT netcologne DOT de
Subject: loop unrolling
Message-ID: <19990225235232.C20417@cerebro.laendle>
Mail-Followup-To: pgcc AT delorie DOT com, johnny AT entity DOT netcologne DOT de
References: <199902241423 DOT JAA29290 AT envy DOT delorie DOT com>
Mime-Version: 1.0
In-Reply-To: <199902241423.JAA29290@envy.delorie.com>; from DJ Delorie on Wed, Feb 24, 1999 at 09:23:29AM -0500
X-Operating-System: Linux version 2.2.2 (marc AT cerebro) (gcc driver version pgcc-2.93.04 19990131 (gcc2 ss-980929 experimental) executing gcc version 2.7.2.3)
From: Marc Lehmann <pcg AT goof DOT com>
Reply-To: pgcc AT delorie DOT com

> From: =?iso-8859-1?Q?Johnny_Teve=DFen?= <j DOT tevessen AT gmx DOT de>
> 
> double foo (int i, double d) {
>   int j;
>   for (j = 20; j; --j) {
>     i *= i;
>     d *= d;
>   }
>   return d*(double)i;
> }
> 
> Now compile this using -funroll-all-loops. It will result in a loop that
> runs twice and has 10 "imull" and 10 "fmul" instructions in it. What
> confused me was the way these got mixed.

Have you made a benchmark? (I haven't). Scheduling is often unintuitive.

It might indeed be the case that gcc's scheduling constants are suboptimal
for some cases. One of the problms is that the normal list scheduler
isn't up to scheduling for superscalar architectures (pentiumpro), while
the scheduling parameters aren't tuned for the haifa scheduler.

> To make a long output short,
> I replaced every imull by '.' and every fmul by '*'. Compiled using

Thats a very nice technique ;)

>     -march=3Di386: .*.*.*.*.*.*.*.*.*.*
>     -march=3Di486: .*.*.*.*.*.*.*.*.*.*
>     -march=3Di586: ....*.*.**.*.**.*.**
>     -march=3Di686: ******.*...*...*...*
>     -march=3Dk6  : *..*.*.*.*.*.*.*.*.*
> 
> Especially the pentium (i586) ones look strange to me: At the beginning

Strange, yes, but it doens't seem to run slower on pentiums (almost every
insn is dependet on each other, and, in addition, the integer multiply
unit is interlocked with the fp multiply unit on pentiums)

> of the loop, the FPU is nearly totally left alone (well, I don't think
> the load-"d"-from-stack still occupies it here). And is the pentiumpro
> (i686) really capable of collecting 6 fp multiplications in its queue?

Yes ;)

> Please don't be angry if I'm totally misunderstanding something, but some
> of the scheduler effects confused me quite a bit for the last days.

No, its good to get reminded of suboptimal code, but, esp. with
scheduling, benchmarking is better. Good scheduling parameters are much
more difficult to find since they tend to affect everything.

> Then, a little memory-juggling question:
> 
> double bar (int i, double d) {
>   return d * (double)i;
> }
> 
> Compiled using -O6, on -march=3D{i386,i486,i686,k6} I get the (good) result:
> 
> bar:    fildl 4(%esp)
>         fmull 8(%esp)
>         ret
> 
> But -march=3Dpentium (the default) gives this:
(the default for your version, as pgcc configures for pentiumpro when it
detects one)

> bar:    movl 4(%esp),%edx
>         pushl %edx
>         fildl (%esp)
>         addl $4,%esp
>         fmull 8(%esp)
>         ret

Thats the riscification going on here. There is a pass that corrects these
problems, if you specify -frecombine you should get better code.

The problem is that pgcc seems to generate slower overall code with
-frecombine, could you make a benchmark with -frecombine and with
-fno-recombine (and -O4 or higher, of course)? I cannot try this myself
at the moment (no pentium), but I was always a bit puzzled since the
benchmark said: "turn it off" but my eyes, looking at the resulting code
(like here) said: "turn it on"!

--  
      -----==-                                             |
      ----==-- _                                           |
      ---==---(_)__  __ ____  __       Marc Lehmann      +--
      --==---/ / _ \/ // /\ \/ /       pcg AT goof DOT com      |e|
      -=====/_/_//_/\_,_/ /_/\_\       XX11-RIPE         --+
    The choice of a GNU generation                       |
                                                         |

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019