Mth 351
Numerical Analysis |
There are some minor differences between the 8087, 80287 floating point coprocessors and later versions. The major difference is probably the handling of infinity and NaN's. The 8087 and 80287 default to using projective infinity. The later fpu's default to using affine infinity, that is, plus and minus infinity are different. There are some other minor differences as well.
The exponent is stored as an integer, regarded as unsigned. This is achieved by adding an offset, a bias, to the actual (or logical) exponent.
The biased exponent with all bits 0 is reserved for zero and denormals.
The biased exponent with all bits 1 is reserved for the two infinities and for NaN's. NaN means Not a Number. NaN's propagate through a calculation and eventually may signal an exception. The details do not concern us here and are, in any case, not well enough known by me to discuss them. They may be found in the Intel literature.
A normalized number is a number in which the integer part (most significant bit) of the mantissa (significand) is 1. In the packed formats this bit is not explicitly stored. Thus the logical length of the mantissa is one bit more than the physical length. Of course, we have to unpack the number (that is, insert the missing bit) before doing any arithmetic. Internally the coprocessor uses 80 bit registers. It automatically unpacks fp numbers when loading the registers, and packs them when storing them to memory, at least for those formats which are less than 80 bits long. There is no packed 80 bit format, since the coprocessor's registers would not be long enough to unpack it.
A denormalized number (denormal) is a number in which the integer part (most significant bit) of the mantissa is 0 and the biased (physical) exponent is 0. An unnormalized number (unnormal) is a number in which the integer part of the mantissa is 0 and the exponent is arbitrary. Note that unnormals can occur only in the unpacked format (extended precision) since otherwise we have no way to recognize them!
While denormals and unnormals are necessary in the course of a calculation the denormals may also occur in the stored formats to represent (with a loss of significance) very small numbers. This scheme allows the coprocessor to underflow gracefully. The physical exponent 0 is reserved to indicate a denormal. Since the biased or physical exponent 0 signals a denormal we do not need to store the leading bit, which is 0, of the mantissa, except when we wish to do actual arithmetic. Thus in the packed formats we omit the leading bit also for denormals.
Note that we can store integers as floating point numbers, but for very large integers the gap between successive exactly storable integers becomes larger than one. Still this gives a much larger range of integers than would be available with ordinary integer storage (that is, fixed point). The c language provides some useful functions for handling floating point integers: floor(), ceil() and fmod(). Thus floor(x) returns the largest floating point integer less than or equal to x. If x is very large then floor(x) may be much smaller than the greatest integer in x, gid(x), if gid(x) can not be stored exactly.
The largest floating point integer N such that N-1 is also a floating point integer (that is, exactly representable) is called the largest fp integer with a predecessor. It is of course much smaller than the largest fp integer.
The unit round is the smallest positive exactly representable floating point number u such that 1.0 + u > 1.0. It is very easy to write code to determine the unit round but one will obtain incorrect results unless one allows for the fact that the coprocessor uses the full 80 bit precision internally and also that some compilers are very clever about eliminating "useless" loops.
It is not difficult to see that for any real number y (within the range of normalized fp numbers) if fp(y) is its floating point representative then
y = fp(y) + e y
where |e| < u and where u is the unit round. This is if we chop the mantissa of y to obtain fp(y). If instead we round the mantissa of y to obtain fp(y) then we have |e| < u/2.
The largest fp integer with a predecessor and the unit round are convenient measures of the precision of the floating point representation.
Note if M is the largest logical exponent for normalized fp numbers and n is the number of logical bits in the mantissa then the largest normalized floating point magnitude (apart from infinity) is
N = 2M (2 - 21-n).
Here we have used
(1 + 2-1 + ... + 21-n) = 2 - 21-n
and we have assumed 1<= mantissa < 2 is the normalization used for the mantissa. Since the binary point is not actually stored we have to have some agreement about where it falls. Another common convention is 1/2 <= mantisa < 1. Since M >= n in the formats below we see N is an integer, and so is the largest fp integer. If N0 it the largest floating point integer with N0 < N then
N - N0 = 2M-n+1,
a very large integer. Thus N0 and N are very far from being consecutive.
| Short Real | |
| Single Precision Real | |
| typical c data type | float |
| length of format | 32 bits |
| storage for sign | 1 bit |
| storage for exponent | 8 bits |
| storage for mantissa | 23 bits |
| packed? | yes |
| mantissa normalization | 1 <= mantissa < 2 |
| mantissa precision (logical length) | 24 bits |
| exponent bias | 127 |
| reserved exponent for denormals | physical 0, logical -127 |
| reserved exponent for infinity and NaN's | physical 255, logical 128 |
| range of physical exponent for normalized fp's | 1 to 254 |
| range of logical exponent for normalized fp's | -126 to 127 |
| smallest positive normalized fp | 2-126 = 1.175 10-38 |
| largest normalized fp 2+127 (2 - 2-23) | 2+128 - 2+104 = 3.403 10+38 |
| smallest positive denormal 2-127 2-23 | 2-150 = 7.006 10-46 |
| largest denormal 2-127 (1 - 2-23) | 2-127 - 2-150 = 5.877 10-39 |
| largest fp integer | 2+128 - 2+104 = 3.403 10+38 |
| gap from largest fp integer to previous fp integer | 2+104 = 2.028 10+31 |
| largest fp integer with a predecessor | 2+24 - 1 = 16,777,215 |
| unit round (chop precision) | 2-23 = 1.192 10-07 |
| precision (round precision) | 2-24 = 5.960 10-08 |
| Long Real | |
| Double Precision Real | |
| typical c data type | double |
| length of format | 64 bits |
| storage for sign | 1 bit |
| storage for exponent | 11 bits |
| storage for mantissa | 52 bits |
| packed? | yes |
| mantissa normalization | 1 <= mantissa < 2 |
| mantissa precision (logical length) | 53 bits |
| exponent bias | 1023 |
| reserved exponent for denormals | physical 0, logical -1023 |
| reserved exponent for infinity and NaN's | physical 2047, logical 1024 |
| range of physical exponent for normalized fp's | 1 to 2046 |
| range of logical exponent for normalized fp's | -1022 to 1023 |
| smallest positive normalized fp | 2-1022 = 2.225 10-308 |
| largest normalized fp 2+1023 (2 - 2-52) | 2+1024 - 2+971 = 1.798 10+308 |
| smallest positive denormal 2-1023 2-52 | 2-1075 = 2.470 10-324 |
| largest denormal 2-1023 (1 - 2-52) | 2-1023 - 2-1075 = 1.113 10-308 |
| largest fp integer | 2+1024 - 2+971 = 1.798 10+308 |
| gap from largest fp integer to previous fp integer | 2+971 = 1.996 10+292 |
| largest fp integer with a predecessor | 2+53 - 1 = 9,007,199,254,740,991 |
| unit round (chop precision) | 2-52 = 2.220 10-16 |
| precision (round precision) | 2-53 = 1.110 10-16 |
| Temporary Real | |
| Extended Precision Real | |
| typical c data type | long double |
| length of format | 80 bits |
| storage for sign | 1 bit |
| storage for exponent | 15 bits |
| storage for mantissa | 64 bits |
| packed? | no |
| mantissa normalization | 1 <= mantissa < 2 |
| mantissa precision (logical = physical length) | 64 bits |
| exponent bias | 16383 |
| reserved exponent for denormals | physical 0, logical -16383 |
| reserved exponent for infinity and NaN's | physical 32767, logical 16384 |
| range of physical exponent for normalized fp's | 1 to 32766 |
| range of logical exponent for normalized fp's | -16382 to 16383 |
| smallest positive normalized fp | 2-16382 = 3.632 10-4932 |
| largest normalized fp 2+16383 (2 - 2-63) | 2+16384 - 2+16320 = 1.190 10+4932 |
| smallest positive denormal 2-16383 2-63 | 2-16446 = 1.823 10-4951 |
| largest denormal 2-16383 (1 - 2-63) | 2-16383 - 2-16446 = 1.681 10-4932 |
| largest fp integer | 2+16384 - 2+16320 = 1.190 10+4932 |
| gap from largest fp integer to previous fp integer | 2+16320 = 6.450 10+4912 |
| largest fp integer with a predecessor | 2^64 - 1 = 18,446,744,073,709,551,615 |
| unit round (chop precision) | 2^(-63) = 1.084 10-19 |
Corrections are welcome! petersen@math.orst.edu
References
Robert L. Hummel, PC Magazine: Programmer's Technical Reference: The Processor and Coprocessor, Ziff-Davis Press, Emeryville, California, 1992.
Stephen P. Morse, Eric J. Isaacson, Douglas J. Albert, The 80386/387 Architecture, John Wiley & Sons, Inc., New York, 1987.
Richard Startz, 8087 Applications and Programming for the IBM PC, XT, and AT, Revised and Expanded, Prentice Hall, New York, 1985.
John F. Palmer, Stephen P. Morse, The 8087 Primer, John Wiley & Sons, New York, 1984.
Intel Corporation, i486 Processor Programmer's Reference Manual, Intel, Osborne McGraw-Hill, 1990.
| Updated Sunday, October 26, 2003 |
| Bent E. Petersen (541) 737-5163 |
| email: petersen@math.orst.edu |
| Fax: (541) 737-0517 |