www.delorie.com/archives/browse.cgi   search  
Mail Archives: djgpp/1996/12/18/02:28:46

From: Phil Galbiati <Philip DOT S DOT Galbiati AT Tek DOT com>
Newsgroups: comp.os.msdos.djgpp
Subject: Re: math optimization
Date: Tue, 17 Dec 1996 19:03:54 -0800
Organization: Tektronix
Lines: 181
Message-ID: <32B75F1A.6782@Tek.com>
References: <32b26866 DOT 241305048 AT nntp DOT southeast DOT net>
NNTP-Posting-Host: philipga.cse.tek.com
Mime-Version: 1.0
To: murray AT southeast DOT net
DJ-Gateway: from newsgroup comp.os.msdos.djgpp

Murray Stokely wrote:
> 
> This is a little code snippet from a vesa lens effect I coded,
> The diameter and magnification of the lens are adjustable in realtime
> via the +/- keys so I need the lens calculating function as fast as
> possible.  Someone gave me a good tip with my sqrt problem earlier in
> using the difference of squares to take out one of the multiplies but

It is interesting to note that while this optimization speeds up the
calculation of (x**2 - y**2), it slows down subsequent calculations
which could have used precalculated values of x**2 or y**2.  For
example, if you need to calculate (x**2 + y**2) and (x**2 - y**2),
using the diff of squares approach requires 1 mult and 2 adds, and
then 2 more mults & an add, for a total of 3 mults and 3 adds:
   dif = (x + y) * (x - y);
   sum = (x*x + y*y);

If you calulate x**2 and y**2 to start with, it only takes 2 mults
and 2 adds (I'm assuming that an add and a subtract are the same:
   xx = x*x;
   yy = y*y;
   dif = xx - yy;
   sum = xx + yy;

When this is in a loop, you save even more.  Using the differences of 
squares approach:
   for (y = 0; y < 10; y++) {
      /* this block executes 10 times */
      for (x=0; y<10; x++) {
         /* this block executes 100 times */
         dif[i][j] = (x + y) * (x - y);
         sum[i][j] = x*x + y*y;
      }
   }
   /* total is 300 adds + 300 mults */

Using the save-your-intermediate-values approach:
   for (y = 0; y < 10; y++) {
      /* this block executes 10 times */
      yy = y*y;
      for (x=0; y<10; x++) {
         /* this block executes 100 times */
         xx = x*x;
         dif[i][j] = xx - yy;
         sum[i][j] = xx + yy;
      }
   }
   /* total is 200 adds + 110 mults */

I didn't count any adds or mults in incrementing the loop indeces
or in array offset calculations since they are the same in both
cases.

> I'd like to take it a step further.  This routine is actualy faster
> than I expected it would be on my 486dx4, but still every clock cycle
> counts ;)  I know theres lots of room for improvement, so any tips
> would be appreciated.
> 
> ( I'll eventualy convert all doubles/floats to 16.16 fixed point, so
> skip that obvious MAJOR optimization )

In general, when you have nested loops you should strive to optimize
the innermost loop, since these instructions are executed more often
than those in the outer loops.

> void calculate_tfm(int diameter, char mag)
> {
>    int a,b,x,y,z,s;
>    int radius;
>    z=0;     ^^^
This initialization looks unnecessary to me, unless I missed a use
of z before it is assigned.  However, it really doesn't cost you
much (if anything).

>    radius=diameter/2;
>    s=round(sqrt(abs((radius-mag) * (radius+mag))));       ^^^^^^^^^^^^^^

>    for(y=-radius;y<diameter-radius;y++) {
>       for(x=-radius;x<diameter-radius;x++) {
>          if (((x*x) + (y*y)) >= (s*s)) {                                  ^^^^^
This is the only place you use the variable s.  You would save
quite a bit of work by calculating s**2 to begin with, rather
than calculating s by round()ing, sqrt()ing & abs()ing an
expression, and then squaring s in your inner loop.

>             a=x;
>             b=y;
>          }
>          else {
>             z=round(sqrt((radius-x)*(radius+x)-(y*y)));

You should be able to re-use the calculation for z by looping from
the center to the outside and calculating for both x & -x in the
same loop iteration, since z(x) == z(-x).

>             a=round(x*mag/z);
>             b=round(y*mag/z);
>          }
> 
>          tfm[(y+radius)*diameter+(x+radius)] =
>                                (b+radius)*diameter+(a+radius);

The calculations for both the index into tfm[] and the value you
are stuffing into it should be reuseable as well -- only some sign
changes needed.

>       } // end of for x
>    } // end of for y
> }


So what I would do for starters would be something like this (bear
in mind that I don't know whether the bounds of diameter are really
the same as the bounds of int (signed 32 bit), so I am assuming
worst case):

void calculate_tfm(int diameter, char mag)
{
   int a,b,x,y;
   int moz;      /* mag / z */
   long long ss, yy, xx, rr, 
   long long qq;   /* (xx + yy) */
   int radius;

   radius=diameter/2;
   rr = radius * radius;
   ss = rr - (mag * mag);

   for(y=0; y<radius; y++) {
      yy = y * y;
      for (x = 0; x < radius; x++) {
         /* NOTE: THESE BOUNDS ARE *NOT* CORRECT, but they are close */
         /*       Sure, it's a cop out, but I'm working for free     */
         /*       I'll leave it up to you to make sure they loop the */
         /*       right # of times, and handle the boundary          */
         /*       conditions correctly.                              */
         xx = x * x;
         qq = xx + yy;
         if (qq >= ss) {
            a = x;
            b = y;
         }
         else {
            moz = mag / round(sqrt((rr - qq));
            a =round(x*moz);
            b =round(y*moz);
         }
 
         tfm[(y+radius)*diameter+(x+radius)] =
                                (b+radius)*diameter+(a+radius);
         tfm[(-y+radius)*diameter+(x+radius)] =
                                (-b+radius)*diameter+(a+radius);
         tfm[(y+radius)*diameter+(-x+radius)] =
                                (b+radius)*diameter+(-a+radius);
         tfm[(-y+radius)*diameter+(-x+radius)] =
                                (-b+radius)*diameter+(-a+radius);

         /* some clever ptr arithmetic *might* speed this up more */
      } // end of for x
   } // end of for y
}


<DISCLAIMER>
Please do not expect this rewrite to work correctly.  I know for
certain that the bounds of the loop aren't quite right, and I may
have introduced other errors besides.  I provided it only as an
example of how you might speed up your code by re-using common
calculations which were separated and discarded in your program,
and by moving calculations out of the inner-most loop.

Furthermore, some of what I did might get done automatically by
the common subexpression elimination compiler optimization.
</DISCLAIMER>

Hope this helps
--Phil Galbiati
=================================================
     Any opinions expressed here reflect the
     ignorance of the author, NOT Tektronix.
=================================================

- Raw text -


  webmaster     delorie software   privacy  
  Copyright © 2019   by DJ Delorie     Updated Jul 2019