Date: Sun, 06 Nov 1994 14:25:28 +1100 From: Bill Metzenthen Subject: Re: 2.4/3-2.4/3!=0 To: djgpp AT sun DOT soe DOT clarkson DOT edu Peter Csizmadia "cspt AT ludens DOT elte DOT hu" writes (Thu, 03 Nov 1994) about the following program: > main() > { > register double a1=2.4, b1=3.0; > volatile double a2=2.4, b2=3.0; > printf("2.4/3-2.4/3 = %le\n", a1/b1-a2/b2); > } I have not tried this on djgpp, but this code should not be expected to print a zero result. Any compiler which does simple optimizations can replace a1/b1 by its result at compile time. The same cannot be done with a2/b2 because of the "volatile" declaration. When asked to apply optimizations, gcc will replace a1/b1 by a 'double' (53 bit precision) number. a2/b2 will be computed at run time at the precision set by the FPU control bits, which as I recall are set for 64 bit precision by go32. This intermediate result will not be converted to 53 bit precision, so the result will be: (53 bit precision)(2.4/3) - (64 bit precision)(2.4/3) so it is not surprising that the result is not zero! If you are _really_ concerned about this particular problem, you could force conversion to 53 bit precision by re-writing your code to store all intermediate results in 'volatile' 53 bit precision locations. Alternatively, you could change the FPU control bits to 53 bit precision (not recommended unless you know _exactly_ what you are doing). Note also that the "problem" in this program will go away if you use 'long double' variables. The program is just one example of the folly of relying upon exact results with floating point code. A good rule with floating point code is to _never_ rely upon exact results (explicitly or implicitly). Peter's other problem was with the following program. > #include > #include > > main() > { > volatile double a = 1.54734129e8; > double b = sqrt(1+a*a); > double x = a/b; > printf("1-x=%le\n", 1.0-x); > printf("1-x=%le\n", 1.0-x); > printf("x%s1\n", (x==1.0)? "==":"!="); > printf("1-x=%le\n", 1.0-x); > } The problems Peter had with this program are due to the fact that gcc uses built-in code for some standard functions by default, but g++ does not appear to do so. In the gcc case, the expression '1+a*a' is computed to 64 bit precision and passed to the fsqrt instruction as a 64 bit precision quantity. In the g++ case the 64 bit result of '1+a*a' is converted to a 53 bit precision quantity (a 'double') and passed to the sqrt() function. Note that for the particular value of 'a', the 53 bit argument passed to sqrt() is the same as would have been passed if just 'a*a' had been computed. It should therefore come as no surpise that the two cases produce different results. Peter was disturbed that a negative result was produced in the g++ case. As pointed out above, in the g++ case the program is actually effectively using 'a*a', not 'a*a+1'. The program is essentially comparing 'a' with 'sqrt(a*a)' and it should not be surprising that given the various conversions the difference is sometimes negative and other times it is positive. This particular "problem" can be worked around by passing the "--no-builtin" option to gcc. The 'C' program should then produce the same negative result as the 'C++' program. This program does not show any bug in gcc or g++, but once again demonstrates the wisdom of the rule that programs should (almost) never be written to rely upon exact (or "near exact") results with floating point computations. Cheers, Bill ps: Note that my detailed comments above about precision apply to 80386/486 machines and that the details may vary on other architectures.