www.delorie.com/archives/browse.cgi   search  
Mail Archives: pgcc/1998/07/06/15:40:25

X-pop3-spooler: POP3MAIL 2.1.0 b 4 980420 -bs-
Date: Mon, 6 Jul 1998 18:35:32 +0300 (EET DST)
From: Tuukka Toivonen <tuukkat AT ees2 DOT oulu DOT fi>
X-Sender: tuukkat AT stekt3
Reply-To: Tuukka Toivonen <tuukkat AT ees2 DOT oulu DOT fi>
To: Andrea Arcangeli <arcangeli AT mbox DOT queen DOT it>
cc: Linux Programming <linux-c-programming AT tower DOT itis DOT com>,
linuxprog AT geeky1 DOT ebtech DOT net, beastium-list <beastium-list AT Desk DOT nl>
Subject: passing args in regs speed (was:something else)
In-Reply-To: <Pine.LNX.3.96.980705003927.170A-100000@dragon.bogus>
Message-ID: <Pine.SOL.3.96.980706180936.11646B-100000@stekt3>
MIME-Version: 1.0
Sender: Marc Lehmann <pcg AT goof DOT com>
Status: RO
Lines: 122

On Sun, 5 Jul 1998, Andrea Arcangeli wrote:

>I suggest you to learn and use the gcc inline asm. The way gcc implements
>inline gcc is so far the best. It allow gcc to optimize out everything as
>best.

Yes, except that I happen to hate AT&T syntax ;)

>true since for example the eax register has not to be preserved at all. It
>would be nice to pass the last parameter of the function call in the eax
>register and the other parameters across the stack as usual. I think it
>would help a lot in performance. I' ll try to discover the improvement. 

>Fast latency: 1007, normal latency 1307
 [ not using EAX ]   [ using EAX for arg pass ]

Interesting. I made some experiments too.


Test program: bzip2 0.1pl2

I added function prototypes for all functions in the program
(and removed those already existing). I told the compiler
to use different amount of register parameters and then
compiled the program and measured how long it took to
compress uncompressed LyX 0.12.0 source tar file (7997440 bytes)
to /dev/null.

My test system: Pentium 120 MHz, 24 MB main memory, 32 MB
swap, Linux 2.0.34, gcc version 2.7.2. There were no other
active programs background eating CPU-time, but the
hard disk rotated few times showing that not everything
fit in the disk cache.

The tests show no significant speedup until I use all
3 registers, in which case it's about 6% faster.

Question: why gcc doesn't allow more than 3 registers
to be used?? x86 would have 7 or at least 6 free registers.

Each case first shows the used compiler flags, and then
the test run was made 4 times. The times are in real-time
seconds (measured using my own program using RDTSC instruction)
The last number is length of the stripped ELF executable 
(so case 4 gives smallest executables).

Patch for bzip and some more information is in file
http://www.ee.oulu.fi/~tuukkat/regpass-test.tar.gz

Considerations: 
- All libc calls used conventional stack parameter passing 
  convention. This could be changed by breaking compatibility.
- Why kernel doesn't use register parameters?? It would be
  ideal since it wouldn't break compatibility!

Can we think this test closes the case? I don't think. Especially
that the case 5 gives so much better performance than any other
case make me suspecting that a lot more testing (of different
real-life programs) is needed.

Surprise, surprise: case 2 is faster than case 1!

CASE 1: no register parameter passing. Compiler-selected
        inline functions.
-O3 -fomit-frame-pointer -funroll-loops -g
clock count: 100.54
clock count: 100.46
clock count: 100.77
clock count: 100.64
total clock count: 402.41 / 4
65544

CASE 2: no register parameter passing. No inline functions.
-O2 -fomit-frame-pointer -funroll-loops -g
clock count: 99.609
clock count: 99.731
clock count: 99.508
clock count: 99.617
total clock count: 398.46 / 4
54200

CASE 3: 1 register argument. No inline functions.
__attribute__ (( regparm(1) ))
-O2 -fomit-frame-pointer -funroll-loops -g
clock count: 100.14
clock count: 99.742
clock count: 100.12
clock count: 99.701
total clock count: 399.7 / 4
54040

CASE 4: 2 register argument. No inline functions.
__attribute__ (( regparm(2) ))
-O2 -fomit-frame-pointer -funroll-loops -g
clock count: 99.725
clock count: 99.698
clock count: 99.44
clock count: 99.209
total clock count: 398.07 / 4
53896

CASE 5: 3 register argument. No inline functions.
__attribute__ (( regparm(3) ))
-O2 -fomit-frame-pointer -funroll-loops -g
clock count: 94.509
clock count: 94.295
clock count: 94.171
clock count: 94.328
total clock count: 377.3 / 4
53912

( I'm CCing this to pgcc list since I think those people
could be interested; maybe they could implement automatic
register passing for static functions?)

--
| Tuukka Toivonen <tuukkat AT ee DOT oulu DOT fi>       [PGP public key
| Homepage: http://www.ee.oulu.fi/~tuukkat/       available]
| Try also finger -l tuukkat AT ee DOT oulu DOT fi
| Studying information engineering at the University of Oulu
+-----------------------------------------------------------


- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019