Date: Sun, 06 Nov 1994 14:25:28 +1100
From: Bill Metzenthen <BILLM AT vaxc DOT cc DOT monash DOT edu DOT au>
Subject: Re: 2.4/3-2.4/3!=0
To: djgpp AT sun DOT soe DOT clarkson DOT edu


Peter Csizmadia "cspt AT ludens DOT elte DOT hu" writes (Thu, 03 Nov 1994) about
the following program:

> main()
> {
>    register double a1=2.4, b1=3.0;
>    volatile double a2=2.4, b2=3.0;
>    printf("2.4/3-2.4/3 = %le\n", a1/b1-a2/b2);
> }

I have not tried this on djgpp, but this code should not be expected
to print a zero result.  Any compiler which does simple optimizations
can replace a1/b1 by its result at compile time.  The same cannot be
done with a2/b2 because of the "volatile" declaration.

When asked to apply optimizations, gcc will replace a1/b1 by a
'double' (53 bit precision) number.  a2/b2 will be computed at run
time at the precision set by the FPU control bits, which as I recall
are set for 64 bit precision by go32.  This intermediate result will
not be converted to 53 bit precision, so the result will be:
    (53 bit precision)(2.4/3) - (64 bit precision)(2.4/3)
so it is not surprising that the result is not zero!

If you are _really_ concerned about this particular problem, you could
force conversion to 53 bit precision by re-writing your code to store
all intermediate results in 'volatile' 53 bit precision locations.
Alternatively, you could change the FPU control bits to 53 bit
precision (not recommended unless you know _exactly_ what you are
doing).  Note also that the "problem" in this program will go away if
you use 'long double' variables.

The program is just one example of the folly of relying upon exact
results with floating point code.  A good rule with floating point
code is to _never_ rely upon exact results (explicitly or implicitly).


Peter's other problem was with the following program.

> #include <stdio.h>
> #include <math.h>
>
> main()
> {
>    volatile double a = 1.54734129e8;
>    double b = sqrt(1+a*a);
>    double x = a/b;
>    printf("1-x=%le\n", 1.0-x);
>    printf("1-x=%le\n", 1.0-x);
>    printf("x%s1\n", (x==1.0)? "==":"!=");
>    printf("1-x=%le\n", 1.0-x);
> }

The problems Peter had with this program are due to the fact that gcc
uses built-in code for some standard functions by default, but g++
does not appear to do so.

In the gcc case, the expression '1+a*a' is computed to 64 bit
precision and passed to the fsqrt instruction as a 64 bit precision
quantity.

In the g++ case the 64 bit result of '1+a*a' is converted to a 53 bit
precision quantity (a 'double') and passed to the sqrt() function.
Note that for the particular value of 'a', the 53 bit argument passed
to sqrt() is the same as would have been passed if just 'a*a' had been
computed.

It should therefore come as no surpise that the two cases produce
different results.

Peter was disturbed that a negative result was produced in the g++
case.  As pointed out above, in the g++ case the program is actually
effectively using 'a*a', not 'a*a+1'.  The program is essentially
comparing 'a' with 'sqrt(a*a)' and it should not be surprising that
given the various conversions the difference is sometimes negative and
other times it is positive.

This particular "problem" can be worked around by passing the
"--no-builtin" option to gcc.  The 'C' program should then produce the
same negative result as the 'C++' program.

This program does not show any bug in gcc or g++, but once again
demonstrates the wisdom of the rule that programs should (almost)
never be written to rely upon exact (or "near exact") results with
floating point computations.


Cheers,
       Bill


ps: Note that my detailed comments above about precision apply to
80386/486 machines and that the details may vary on other
architectures.