Next: MPFR features, Previous: Computer Arithmetic, Up: Arbitrary Precision Arithmetic [Contents][Index]
The rest of this chapter uses a number of terms. Here are some informal definitions that should help you work your way through the material here:
A floating-point calculation’s accuracy is how close it comes to the real (paper and pencil) value.
The difference between what the result of a computation “should be” and what it actually is. It is best to minimize error as much as possible.
The order of magnitude of a value; some number of bits in a floating-point value store the exponent.
A special value representing infinity. Operations involving another number and infinity produce infinity.
“Not a number.”99 A special value that
results from attempting a calculation that has no answer as a real number.
In such a case, programs can either receive a floating-point exception,
or get NaN
back as the result. The IEEE 754 standard recommends
that systems return NaN
. Some examples:
sqrt(-1)
This makes sense in the range of complex numbers, but not in the
range of real numbers, so the result is NaN
.
log(-8)
-8 is out of the domain of log()
, so the result is NaN
.
How the significand (see later in this list) is usually stored. The value is adjusted so that the first bit is one, and then that leading one is assumed instead of physically stored. This provides one extra bit of precision.
The number of bits used to represent a floating-point number. The more bits, the more digits you can represent. Binary and decimal precisions are related approximately, according to the formula:
prec = 3.322 * dps
Here, prec denotes the binary precision (measured in bits) and dps (short for decimal places) is the decimal digits.
How numbers are rounded up or down when necessary. More details are provided later.
A floating-point value consists of the significand multiplied by 10
to the power of the exponent. For example, in 1.2345e67
,
the significand is 1.2345
.
From the Wikipedia article on numerical stability: “Calculations that can be proven not to magnify approximation errors are called numerically stable.”
See the Wikipedia article on accuracy and precision for more information on some of those terms.
On modern systems, floating-point hardware uses the representation and
operations defined by the IEEE 754 standard.
Three of the standard IEEE 754 types are 32-bit single precision,
64-bit double precision, and 128-bit quadruple precision.
The standard also specifies extended precision formats
to allow greater precisions and larger exponent ranges.
(awk
uses only the 64-bit double-precision format.)
Table 16.3 lists the precision and exponent field values for the basic IEEE 754 binary formats.
Name | Total bits | Precision | Minimum exponent | Maximum exponent |
---|---|---|---|---|
Single | 32 | 24 | -126 | +127 |
Double | 64 | 53 | -1022 | +1023 |
Quadruple | 128 | 113 | -16382 | +16383 |
NOTE: The precision numbers include the implied leading one that gives them one extra bit of significand.
Thanks to Michael Brennan for this description, which we have paraphrased, and for the examples.
Next: MPFR features, Previous: Computer Arithmetic, Up: Arbitrary Precision Arithmetic [Contents][Index]