From: Phil Galbiati Newsgroups: comp.os.msdos.djgpp Subject: Re: math optimization Date: Tue, 17 Dec 1996 19:03:54 -0800 Organization: Tektronix Lines: 181 Message-ID: <32B75F1A.6782@Tek.com> References: <32b26866 DOT 241305048 AT nntp DOT southeast DOT net> NNTP-Posting-Host: philipga.cse.tek.com Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit To: murray AT southeast DOT net DJ-Gateway: from newsgroup comp.os.msdos.djgpp Murray Stokely wrote: > > This is a little code snippet from a vesa lens effect I coded, > The diameter and magnification of the lens are adjustable in realtime > via the +/- keys so I need the lens calculating function as fast as > possible. Someone gave me a good tip with my sqrt problem earlier in > using the difference of squares to take out one of the multiplies but It is interesting to note that while this optimization speeds up the calculation of (x**2 - y**2), it slows down subsequent calculations which could have used precalculated values of x**2 or y**2. For example, if you need to calculate (x**2 + y**2) and (x**2 - y**2), using the diff of squares approach requires 1 mult and 2 adds, and then 2 more mults & an add, for a total of 3 mults and 3 adds: dif = (x + y) * (x - y); sum = (x*x + y*y); If you calulate x**2 and y**2 to start with, it only takes 2 mults and 2 adds (I'm assuming that an add and a subtract are the same: xx = x*x; yy = y*y; dif = xx - yy; sum = xx + yy; When this is in a loop, you save even more. Using the differences of squares approach: for (y = 0; y < 10; y++) { /* this block executes 10 times */ for (x=0; y<10; x++) { /* this block executes 100 times */ dif[i][j] = (x + y) * (x - y); sum[i][j] = x*x + y*y; } } /* total is 300 adds + 300 mults */ Using the save-your-intermediate-values approach: for (y = 0; y < 10; y++) { /* this block executes 10 times */ yy = y*y; for (x=0; y<10; x++) { /* this block executes 100 times */ xx = x*x; dif[i][j] = xx - yy; sum[i][j] = xx + yy; } } /* total is 200 adds + 110 mults */ I didn't count any adds or mults in incrementing the loop indeces or in array offset calculations since they are the same in both cases. > I'd like to take it a step further. This routine is actualy faster > than I expected it would be on my 486dx4, but still every clock cycle > counts ;) I know theres lots of room for improvement, so any tips > would be appreciated. > > ( I'll eventualy convert all doubles/floats to 16.16 fixed point, so > skip that obvious MAJOR optimization ) In general, when you have nested loops you should strive to optimize the innermost loop, since these instructions are executed more often than those in the outer loops. > void calculate_tfm(int diameter, char mag) > { > int a,b,x,y,z,s; > int radius; > z=0; ^^^ This initialization looks unnecessary to me, unless I missed a use of z before it is assigned. However, it really doesn't cost you much (if anything). > radius=diameter/2; > s=round(sqrt(abs((radius-mag) * (radius+mag)))); ^^^^^^^^^^^^^^ > for(y=-radius;y for(x=-radius;x if (((x*x) + (y*y)) >= (s*s)) { ^^^^^ This is the only place you use the variable s. You would save quite a bit of work by calculating s**2 to begin with, rather than calculating s by round()ing, sqrt()ing & abs()ing an expression, and then squaring s in your inner loop. > a=x; > b=y; > } > else { > z=round(sqrt((radius-x)*(radius+x)-(y*y))); You should be able to re-use the calculation for z by looping from the center to the outside and calculating for both x & -x in the same loop iteration, since z(x) == z(-x). > a=round(x*mag/z); > b=round(y*mag/z); > } > > tfm[(y+radius)*diameter+(x+radius)] = > (b+radius)*diameter+(a+radius); The calculations for both the index into tfm[] and the value you are stuffing into it should be reuseable as well -- only some sign changes needed. > } // end of for x > } // end of for y > } So what I would do for starters would be something like this (bear in mind that I don't know whether the bounds of diameter are really the same as the bounds of int (signed 32 bit), so I am assuming worst case): void calculate_tfm(int diameter, char mag) { int a,b,x,y; int moz; /* mag / z */ long long ss, yy, xx, rr, long long qq; /* (xx + yy) */ int radius; radius=diameter/2; rr = radius * radius; ss = rr - (mag * mag); for(y=0; y= ss) { a = x; b = y; } else { moz = mag / round(sqrt((rr - qq)); a =round(x*moz); b =round(y*moz); } tfm[(y+radius)*diameter+(x+radius)] = (b+radius)*diameter+(a+radius); tfm[(-y+radius)*diameter+(x+radius)] = (-b+radius)*diameter+(a+radius); tfm[(y+radius)*diameter+(-x+radius)] = (b+radius)*diameter+(-a+radius); tfm[(-y+radius)*diameter+(-x+radius)] = (-b+radius)*diameter+(-a+radius); /* some clever ptr arithmetic *might* speed this up more */ } // end of for x } // end of for y } Please do not expect this rewrite to work correctly. I know for certain that the bounds of the loop aren't quite right, and I may have introduced other errors besides. I provided it only as an example of how you might speed up your code by re-using common calculations which were separated and discarded in your program, and by moving calculations out of the inner-most loop. Furthermore, some of what I did might get done automatically by the common subexpression elimination compiler optimization. Hope this helps --Phil Galbiati ================================================= Any opinions expressed here reflect the ignorance of the author, NOT Tektronix. =================================================