From: frabb AT worldaccess DOT nl Newsgroups: comp.os.msdos.djgpp Subject: float, double & long double Date: Fri, 07 Feb 97 07:13:35 GMT Organization: World Access, Internet, E-mail and Videotex Lines: 90 Message-ID: NNTP-Posting-Host: hrv1-4.worldaccess.nl To: djgpp AT delorie DOT com DJ-Gateway: from newsgroup comp.os.msdos.djgpp While thinking about floats and doubles I made the following program. Please have a look at it: --------------------------------------------------------------------- // investigate double, float, etc. #include #include #define SFL sizeof(float) #define SDB sizeof(double) #define SLD sizeof(long double) union{ long double l; double d; float f; unsigned short s[8]; // too many shorts, to be on the safe side }mn; /*---------------------------------------------------------*/ void showflt(float f) { printf("\n float: %.20f, ",f); mn.f = f; for(int i = SFL/2 - 1; i>=0; i--) printf("%04X",mn.s[i]); } /*---------------------------------------------------------*/ void showdbl(double d) { printf("\n double: %.20f, ",d); mn.d = d; for(int i = SDB/2 - 1; i>=0; i--) printf("%04X",mn.s[i]); } /*---------------------------------------------------------*/ void showldb(long double ld) { printf("\nlong double: %.20Lf, ",ld); mn.l = ld; for(int i = SLD/2 - 2; i>=0; i--) printf("%04X",mn.s[i]); } /*---------------------------------------------------------*/ void main(void) { float f; double d; long double l; l = 1.2345678901234567890123456789L; d = l; f = d; clrscr(); showflt(f); showdbl(d); showldb(l); #define I 1.0 // to prove that my method is correct. The hexa result should contain // ABCDE somewhere: showldb(I/2+I/8+I/32+I/128+I/256+I/512+I/1024+I/8192+I/16384+I/65536+ I/131072+I/262144+I/524288); return 0; } ---------------------------------------------------------------------- Each long double contains 16 redundant bits, to make it fit in a 32 bit scheme. That explains the constant -2 in "showldb". The calls to showflt, showdbl and showldb should all show more-or-less the same thing. Here is the result: float: 1.23456788063049316406, 3F9E0652 double: 1.23456789012345669043, 3FF3C0CA428C59FB long double: 1.23456789012345678899, 3FFF9E06521462CFDB8D long double: 0.67111015319824218750, 3FFEABCDE00000000000 (This would look better if there were not too many digits specified for the fractional part.) When writing the hexa numbers in binary it is clear that some shifting is enough to convert float<-->double<-->long double: 3 F 9 E 0 6 5 2 00111111100111100000011001010010 3 F F 3 C 0 C A 4 2 8 C 5 9 F B 0011111111110011110000001100101001000010100011000101100111111011 3 F F F 9 E 0 6 5 2 1 4 6 2 C F D B 8 D 00111111111111111001111000000110010100100001010001100010110011111101110110001101 It is also clear that the 'shorts' of float and long double line up nicely, you only have to do some truncation or inserting zero shorts to do the conversion. The double however has an offset of shorts + 1 bit. This will always make bitshifting necessary when converting. That is the reason why programs using double run slightly slower than programs using float or long double. frank abbing