From: leathm AT solwarra DOT gbrmpa DOT gov DOT au (Leath Muller) Message-Id: <199701202327.JAA04787@solwarra.gbrmpa.gov.au> Subject: Re: floating point is... fast??? To: gpt20 AT thor DOT cam DOT ac DOT uk (G.P. Tootell) Date: Tue, 21 Jan 1997 09:27:25 +1000 (EST) Cc: djgpp AT delorie DOT com In-Reply-To: <5bvjeb$mji@lyra.csx.cam.ac.uk> from "G.P. Tootell" at Jan 20, 97 11:03:39 am Content-Type: text > well i dug the big book of cycles out today. this is what it says.. > fdiv fmul idiv imul div mul > 486(7) 8-89 11-27 43/44 42 40 42 > pentium 39-42 1-7 22-46 10/11 17-41 11 Your book disagrees with the information provided by Intel in their programmers reference manual. Go to http://www.x86.org and get Acrobat reader, and then the PDF file from Intel (it has a link): 241430_4.pdf. This has everything you need. I recall 3 cycles for an fmul and 10 cycles for an fimul... > can anyone confirm those values? just on the offchance there's a mistake in my > book. now it strikes me that rather than do the expensive operation I would say its wrong. It gives the impression you can do a simply fmul in one clock when this isn't true. If you have: flds _x0; fmuls _x1; fstps _result; Then this will take (1 + 3 + 3) 7 cycles. However, if you have something like: flds _x0; // 1 fmuls _x1; // 2 - 4 flds _y0; // 3 fmuls _y1; // 4 - 6 flds _z0; // 5 fmuls _z1; // 6 - 8 fxch %st(2); // free faddp %st(1); // 7 - 9 faddp %st(1); // 10 - 12 fstp _result; // 12 - 15 As you are overlapping fmuls in this dot product routine, the fmul comes at one cycle... (the fstp normally takes 2 cycles, but has a 1 cycle latency when using the result of the previous operation). I haven't seen a fmul take 7 seconds, but that may be what it takes on 80bit ops. Note: I do fld, fmul etc with an s not d because I store things as floats - the conversion time is zero between the 32bit float and 80bit full precision, so it makes no difference. You just have less accuracy. > float a,b,c,d,x,y; > c=x/b; > d=y/b; > a=1/b; > c=x*a; > d=y*a; Definately - the savings are huge... > which would save a whole load of cycles, particularly on a pentium. > in fact, if i were doing the operations with signed longs instead... > signed long a,b,c,d,x,y; > i would be better writing - (and changing a to a float) > a=1.0/b; (because fdiv is still faster than idiv in most cases) > c=(float)x*a; > d=(float)y*a; You are probably better off using full floating point math until the very end where you can store the result as a float. To load/store int's is very expensive and should be avoided where possible. I think an int->float or float->int conversion takes about 14 cycles each time... > ie. to change the integers into floating wherever possible to make use of the > fmul timings, which outstrip every other timing even in worst case! Yep... :) > so there must be a catch somewhere of course ;) No, fpu stuff is fast on Intel stuff now (pentium onwards). It took a while, but they finally caught up to Motorola. > perhaps the changing from float->int and vice versa takes a lot of time? > anyone? Yep... :) Use floating point. If your programming Pentium onwards anyway. Remember, if someone says otherwise, just say one word: Quake. Leathal.