Mail Archives: djgpp/1996/12/18/02:28:46
Murray Stokely wrote:
>
> This is a little code snippet from a vesa lens effect I coded,
> The diameter and magnification of the lens are adjustable in realtime
> via the +/- keys so I need the lens calculating function as fast as
> possible. Someone gave me a good tip with my sqrt problem earlier in
> using the difference of squares to take out one of the multiplies but
It is interesting to note that while this optimization speeds up the
calculation of (x**2 - y**2), it slows down subsequent calculations
which could have used precalculated values of x**2 or y**2. For
example, if you need to calculate (x**2 + y**2) and (x**2 - y**2),
using the diff of squares approach requires 1 mult and 2 adds, and
then 2 more mults & an add, for a total of 3 mults and 3 adds:
dif = (x + y) * (x - y);
sum = (x*x + y*y);
If you calulate x**2 and y**2 to start with, it only takes 2 mults
and 2 adds (I'm assuming that an add and a subtract are the same:
xx = x*x;
yy = y*y;
dif = xx - yy;
sum = xx + yy;
When this is in a loop, you save even more. Using the differences of
squares approach:
for (y = 0; y < 10; y++) {
/* this block executes 10 times */
for (x=0; y<10; x++) {
/* this block executes 100 times */
dif[i][j] = (x + y) * (x - y);
sum[i][j] = x*x + y*y;
}
}
/* total is 300 adds + 300 mults */
Using the save-your-intermediate-values approach:
for (y = 0; y < 10; y++) {
/* this block executes 10 times */
yy = y*y;
for (x=0; y<10; x++) {
/* this block executes 100 times */
xx = x*x;
dif[i][j] = xx - yy;
sum[i][j] = xx + yy;
}
}
/* total is 200 adds + 110 mults */
I didn't count any adds or mults in incrementing the loop indeces
or in array offset calculations since they are the same in both
cases.
> I'd like to take it a step further. This routine is actualy faster
> than I expected it would be on my 486dx4, but still every clock cycle
> counts ;) I know theres lots of room for improvement, so any tips
> would be appreciated.
>
> ( I'll eventualy convert all doubles/floats to 16.16 fixed point, so
> skip that obvious MAJOR optimization )
In general, when you have nested loops you should strive to optimize
the innermost loop, since these instructions are executed more often
than those in the outer loops.
> void calculate_tfm(int diameter, char mag)
> {
> int a,b,x,y,z,s;
> int radius;
> z=0; ^^^
This initialization looks unnecessary to me, unless I missed a use
of z before it is assigned. However, it really doesn't cost you
much (if anything).
> radius=diameter/2;
> s=round(sqrt(abs((radius-mag) * (radius+mag)))); ^^^^^^^^^^^^^^
> for(y=-radius;y<diameter-radius;y++) {
> for(x=-radius;x<diameter-radius;x++) {
> if (((x*x) + (y*y)) >= (s*s)) { ^^^^^
This is the only place you use the variable s. You would save
quite a bit of work by calculating s**2 to begin with, rather
than calculating s by round()ing, sqrt()ing & abs()ing an
expression, and then squaring s in your inner loop.
> a=x;
> b=y;
> }
> else {
> z=round(sqrt((radius-x)*(radius+x)-(y*y)));
You should be able to re-use the calculation for z by looping from
the center to the outside and calculating for both x & -x in the
same loop iteration, since z(x) == z(-x).
> a=round(x*mag/z);
> b=round(y*mag/z);
> }
>
> tfm[(y+radius)*diameter+(x+radius)] =
> (b+radius)*diameter+(a+radius);
The calculations for both the index into tfm[] and the value you
are stuffing into it should be reuseable as well -- only some sign
changes needed.
> } // end of for x
> } // end of for y
> }
So what I would do for starters would be something like this (bear
in mind that I don't know whether the bounds of diameter are really
the same as the bounds of int (signed 32 bit), so I am assuming
worst case):
void calculate_tfm(int diameter, char mag)
{
int a,b,x,y;
int moz; /* mag / z */
long long ss, yy, xx, rr,
long long qq; /* (xx + yy) */
int radius;
radius=diameter/2;
rr = radius * radius;
ss = rr - (mag * mag);
for(y=0; y<radius; y++) {
yy = y * y;
for (x = 0; x < radius; x++) {
/* NOTE: THESE BOUNDS ARE *NOT* CORRECT, but they are close */
/* Sure, it's a cop out, but I'm working for free */
/* I'll leave it up to you to make sure they loop the */
/* right # of times, and handle the boundary */
/* conditions correctly. */
xx = x * x;
qq = xx + yy;
if (qq >= ss) {
a = x;
b = y;
}
else {
moz = mag / round(sqrt((rr - qq));
a =round(x*moz);
b =round(y*moz);
}
tfm[(y+radius)*diameter+(x+radius)] =
(b+radius)*diameter+(a+radius);
tfm[(-y+radius)*diameter+(x+radius)] =
(-b+radius)*diameter+(a+radius);
tfm[(y+radius)*diameter+(-x+radius)] =
(b+radius)*diameter+(-a+radius);
tfm[(-y+radius)*diameter+(-x+radius)] =
(-b+radius)*diameter+(-a+radius);
/* some clever ptr arithmetic *might* speed this up more */
} // end of for x
} // end of for y
}
<DISCLAIMER>
Please do not expect this rewrite to work correctly. I know for
certain that the bounds of the loop aren't quite right, and I may
have introduced other errors besides. I provided it only as an
example of how you might speed up your code by re-using common
calculations which were separated and discarded in your program,
and by moving calculations out of the inner-most loop.
Furthermore, some of what I did might get done automatically by
the common subexpression elimination compiler optimization.
</DISCLAIMER>
Hope this helps
--Phil Galbiati
=================================================
Any opinions expressed here reflect the
ignorance of the author, NOT Tektronix.
=================================================
- Raw text -