www.delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp/1996/11/28/18:02:29

Message-ID: <329E2E2B.3D02@gbrmpa.gov.au>
Date: Fri, 29 Nov 1996 08:44:00 +0800
From: Leath Muller <leathm AT gbrmpa DOT gov DOT au>
Reply-To: leathm AT gbrmpa DOT gov DOT au
Organization: Great Barrier Reef Marine Park Authority
MIME-Version: 1.0
To: Elliott Oti <e DOT oti AT stud DOT warande DOT ruu DOT nl>
CC: djgpp AT delorie DOT com
Subject: Re: Optimization
References: <57hg9b$or5 AT kannews DOT ca DOT newbridge DOT com> <329C4CD4 DOT 7474 AT cornell DOT edu> <Pine DOT SUN DOT 3 DOT 90 DOT 961127095705 DOT 25056B-100000 AT coop10> <329C62F6 DOT 23F6 AT stud DOT warande DOT ruu DOT nl>

> > What makes you say that? I can't see how this would make it faster...
> > more cache misses, and an extra shift to index non-byte sized quantities.
> > Not to mention the fact that there are more byte sized registers.
 
> I believe in 32-bit protected mode most dword register ops are faster
> than the equivalent 16-bit ones on a 486 and above. Certainly on a P6
> 16-bit instructions are disproportionately slow.
> In any case I haven't seen djgpp generate any optimizations which utilise
> the byte registers; AFAIK it uses them only in straightforward byte ops.

On the pentium, the following rule is used to decide which type of
instructions
to use:
i) If you are running your code in 32 bit protected mode, use 32 bit and
8 bit data and registers, and avoid 16 bit ones
ii) If your running in 16 bit protected/real mode, avoid 32 bits
registers
Its all in the pentium programmers manual. Go to
	http://www.x86.com/
and have a look around there... 
 
> > > did you actually profile your code to see where the bottlenecks are?

> > Yes. I know exactly where I need to improve.
 
> I have no idea how good your C coding skills are, so don't be offended,
> but careful C code can speed up a sloppy implementation by ~ 100%:
> on the other hand, there are limits.
> Check your algorithm to see what basic operations are being used
> (specifically multiplies, divides, sqrts etc) and check how many
> operations are duplicated in such a way that they can be removed with
> a little recoding -
> e.g    a1 = b1/(x*y);            c = x*y;
>        a2 = b2/(x*y);   ===>     a1 = b1/c; a2 = b1/c etc.
>        a3 = b3/(x*Y);
> Simplistic, but you get the point.

Actually, this is even faster if you:
	c = 1 / (x * y);
	a1 = b1 * c;
	a2 = b2 * c;
	a3 = b3 * c;
A divide takes 39 cycles on a normal double divide, a mul takes 3
cycles.
Using your method, you have 3 divides (117 cycles) and one mul for 120
cycles.
Using the second method, you have 39 + 9 cycles, or 48... :) 

Leathal.

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019