Abstract
In this chapter, we present some short yet useful algorithms and some basic properties that can be derived from specifications of floating-point arithmetic systems, such as the ones given in the successive IEEE 754 standards. Thanks to these standards, we now have an accurate definition of floating-point formats and operations. The behavior of a sequence of operations becomes at least partially for more details on this). We therefore can build algorithms and proofs that refer to these specifications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
In some cases, intermediate calculations may be performed in a wider internal format. Some examples are given in Section 3.2.
- 2.
Beware! We remind the reader that by “no underflow” we mean that the absolute value of the result (before or after rounding, this depends on the definition) is not less than the smallest normal number \(\beta ^{e_{\mathrm{min}}}\). When subnormal numbers are available, as requested by the IEEE 754 standards, it is possible to represent smaller nonzero numbers, but with a precision that does not always suffice to represent the product exactly.
- 3.
When | μ | ≤ βp − 1, the difference s − a is representable with exponent e a , but not necessarily in normal form. This is why the availability of subnormal numbers is necessary.
- 4.
Do not forget that | a | ≥ | b | implies that the exponent of a is larger than or equal to that of b. Hence, it suffices to compare the two variables.
- 5.
- 6.
In radix 2, we will use the fact that a 2g + 1-bit number can be split into two g-bit numbers. This explains why Dekker’s algorithm works if the precision is even or if the radix is 2.
- 7.
These assumptions hold on any “reasonable” floating-point system.
- 8.
For example, in the IEEE 754 binary64 format, with x = (253 − 1) ⋅ 2940 and y = 231, we obtain x h = 2993 and y h = y. The floating-point multiplication \(\mathop{\mathrm{RN}}\nolimits (x_{h}y_{h})\) overflows, whereas \(\mathop{\mathrm{RN}}\nolimits (xy) =\varOmega = (2^{53} - 1) \cdot 2^{971}\).
- 9.
Caution: M is not necessarily the integral significand of x.
- 10.
Or q is the largest finite floating-point number Ω, in the case where x∕y is between that number and the overflow threshold (the same thing applies on the negative side).
- 11.
Part of what we are going to explain does not generalize to decimal arithmetic.
- 12.
For instance, the frcpa instruction of the IA-64 instruction set returns approximations to reciprocals with relative error less than or equal to 2−8. 886. Such tables are easily implemented using the bipartite method, See [131].
- 13.
When the radix is an odd number, values exactly halfway between two consecutive floating-point numbers are represented with infinitely many digits.
- 14.
A very similar study can be done when it is a power of 2.
- 15.
A necessary and sufficient condition for all numbers representable in radix β with a finite number of digits to be representable in radix γ with a finite number of digits is that β should divide an integer power of γ.
- 16.
This formula is valid for all possible values of p 2 and \(e_{\mathrm{min}}\) (provided \(e_{\mathrm{min}} \approx -e_{\mathrm{max}}\)). And yet, for all usual formats, it can be simplified: A simple continued fraction argument (see Section A.1) shows that for p 2 ≥ 16 and \(e_{\mathrm{min}} \geq -28000\), it is equal to
$$\displaystyle{-e_{\mathrm{min}} + p_{2} + \left \lfloor (e_{\mathrm{min}} + 1)\log _{10}(2)\right \rfloor.}$$ - 17.
At the time of writing this book, it can be obtained at http://www.netlib.org/fp/ (file dtoa.c).
- 18.
The algorithm works for other radices. See [86] for details.
- 19.
In round-to-nearest modes, it required that the error introduced by the conversion should be at most 0. 97 ulps. The major reason for this somewhat weak requirement is that the conversion algorithms presented here were not known at the time that standard was designed.
- 20.
Another solution consists in using a precomputed table of powers of 10 in the binary format.
- 21.
If a wider internal format is available, one can use it and possibly save one step.
- 22.
A straightforward analysis of the error induced by the truncation of the digit chain D∗ would give | ε 2 | ≤ 10−min{n−1, j}, but when j ≥ (n − 1), \(D^{{\ast}} =\hat{ D}\) and there is no truncation error at all.
- 23.
The IEEE 754-2008 standard allows the conversion of out-of-range numbers, infinity or NaN. In that case, either there should be a dedicated signaling mechanism or the invalid operation exception should be signaled.
- 24.
Unless u 2 is a power of 2, but this case is easily handled separately.
References
S. F. Anderson, J. G. Earle, R. E. Goldschmidt, and D. M. Powers. The IBM 360/370 model 91: floating-point execution unit. IBM Journal of Research and Development, 1967. Reprinted in [583].
M. Andrysco, R. Jhala, and S. Lerner. Printing floating-point numbers: A faster, always correct method. In 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 555–567, 2016.
G. Bohlender, W. Walter, P. Kornerup, and D. W. Matula. Semantics for exact floating point operations. In 10th IEEE Symposium on Computer Arithmetic (ARITH-10), pages 22–26, June 1991.
S. Boldo. Pitfalls of a full floating-point proof: example on the formal proof of the Veltkamp/Dekker algorithms. In 3rd International Joint Conference on Automated Reasoning (IJCAR), volume 4130 of Lecture Notes in Computer Science, pages 52–66, Seattle, WA, USA, 2006.
S. Boldo and M. Daumas. Representable correcting terms for possibly underflowing floating point operations. In 16th IEEE Symposium on Computer Arithmetic (ARITH-16), pages 79–86, Santiago de Compostela, Spain, 2003.
S. Boldo, M. Daumas, C. Moreau-Finot, and L. Théry. Computer validated proofs of a toolset for adaptable arithmetic. Technical report, École Normale Supérieure de Lyon, 2001. Available at http://arxiv.org/pdf/cs.MS/0107025.
S. Boldo, S. Graillat, and J.-M. Muller. On the robustness of the 2Sum and Fast2Sum algorithms. ACM Transactions on Mathematical Software, 44(1):4:1–4:14, 2017.
S. Boldo, J.-H. Jourdan, X. Leroy, and G. Melquiond. Verified compilation of floating-point computations. Journal of Automated Reasoning, 54(2):135–163, 2015.
S. Boldo and G. Melquiond. Emulation of FMA and correctly rounded sums: proved algorithms using rounding to odd. IEEE Transactions on Computers, 57(4):462–471, 2008.
S. Boldo and G. Melquiond. Computer Arithmetic and Formal Proofs. ISTE Press – Elsevier, 2017.
S. Boldo and J.-M. Muller. Exact and approximated error of the FMA. IEEE Transactions on Computers, 60(2):157–164, 2011.
A. D. Booth. A signed binary multiplication technique. Quarterly Journal of Mechanics and Applied Mathematics, 4(2):236–240, 1951. Reprinted in [583].
N. Brisebarre and J.-M. Muller. Correctly rounded multiplication by arbitrary precision constants. IEEE Transactions on Computers, 57(2):165–174, 2008.
R. G. Burger and R. K. Dybvig. Printing floating-point numbers quickly and accurately. In SIGPLAN’96 Conference on Programming Languages Design and Implementation (PLDI), pages 108–116, June 1996.
W. D. Clinger. How to read floating-point numbers accurately. ACM SIGPLAN Notices, 25(6):92–101, 1990.
W. D. Clinger. Retrospective: how to read floating-point numbers accurately. ACM SIGPLAN Notices, 39(4):360–371, 2004.
M. Cornea, J. Harrison, and P. T. P. Tang. Scientific Computing on Itanium ®; -based Systems. Intel Press, Hillsboro, OR, 2002.
M. A. Cornea-Hasegan, R. A. Golliver, and P. Markstein. Correctness proofs outline for Newton–Raphson based floating-point divide and square root algorithms. In 14th IEEE Symposium on Computer Arithmetic (ARITH-14), pages 96–105, April 1999.
A. Dahan-Dalmedico and J. Pfeiffer. Histoire des Mathématiques. Editions du Seuil, Paris, 1986. In French.
C. Daramy-Loirat, D. Defour, F. de Dinechin, M. Gallet, N. Gast, C. Q. Lauter, and J.-M. Muller. CR-LIBM, a library of correctly-rounded elementary functions in double-precision. Technical report, LIP Laboratory, Arenaire team, December 2006. Available at https://hal-ens-lyon.archives-ouvertes.fr/ensl-01529804.
D. Das Sarma and D. W. Matula. Measuring the accuracy of ROM reciprocal tables. IEEE Transactions on Computers, 43(8):932–940, 1994.
D. Das Sarma and D. W. Matula. Faithful bipartite ROM reciprocal tables. In 12th IEEE Symposium on Computer Arithmetic (ARITH-12), pages 17–28, June 1995.
D. Das Sarma and D. W. Matula. Faithful interpolation in reciprocal tables. In 13th IEEE Symposium on Computer Arithmetic (ARITH-13), pages 82–91, July 1997.
F. de Dinechin, A. V. Ershov, and N. Gast. Towards the post-ultimate libm. In 17th IEEE Symposium on Computer Arithmetic (ARITH-17), pages 288–295, 2005.
F. de Dinechin and A. Tisserand. Multipartite table methods. IEEE Transactions on Computers, 54(3):319–330, 2005.
T. J. Dekker. A floating-point technique for extending the available precision. Numerische Mathematik, 18(3):224–242, 1971.
M. D. Ercegovac and T. Lang. Division and Square Root: Digit-Recurrence Algorithms and Implementations. Kluwer Academic Publishers, Boston, MA, 1994.
M. D. Ercegovac and T. Lang. Digital Arithmetic. Morgan Kaufmann Publishers, San Francisco, CA, 2004.
M. Ercegovac, J.-M. Muller, and A. Tisserand. Simple seed architectures for reciprocal and square root reciprocal. In 39th Asilomar Conference on Signals, Systems, and Computers, November 2005.
G. Even, P.-M. Seidel, and W. E. Ferguson. A parametric error analysis of Goldschmidt’s division algorithm. Journal of Computer and System Sciences, 70(1):118–139, 2005.
W. E. Ferguson, Jr. Exact computation of a sum or difference with applications to argument reduction. In 12th IEEE Symposium on Computer Arithmetic (ARITH-12), pages 216–221, Bath, UK, July 1995.
D. Fowler and E. Robson. Square root approximations in old Babylonian mathematics: YBC 7289 in context. Historia Mathematica, 25:366–378, 1998.
D. M. Gay. Correctly-rounded binary-decimal and decimal-binary conversions. Technical Report Numerical Analysis Manuscript 90–10, ATT & Bell Laboratories (Murray Hill, NJ), November 1990.
W. M. Gentleman and S. B. Marovitch. More on algorithms that reveal properties of floating-point arithmetic units. Communications of the ACM, 17(5):276–277, 1974.
I. B. Goldberg. 27 bits are not enough for 8-digit accuracy. Commun. ACM, 10(2):105–106, 1967.
R. E. Goldschmidt. Applications of division by convergence. Master’s thesis, Dept. of Electrical Engineering, Massachusetts Institute of Technology, Cambridge, MA, June 1964.
J. R. Hauser. Handling floating-point exceptions in numeric programs. ACM Transactions on Programming Languages and Systems, 18(2):139–174, 1996.
IEEE Computer Society. IEEE Standard for Floating-Point Arithmetic. IEEE Standard 754-2008, August 2008. Available at http://ieeexplore.ieee.org/servlet/opac?punumber=4610933.
W. Kahan. Pracniques: further remarks on reducing truncation errors. Communications of the ACM, 8(1):40, 1965.
R. Karpinsky. PARANOIA: a floating-point benchmark. BYTE, 10(2), 1985.
D. E. Knuth. The Art of Computer Programming, volume 2. Addison-Wesley, Reading, MA, 3rd edition, 1998.
P. Kornerup, V. Lefèvre, N. Louvet, and J.-M. Muller. On the computation of correctly rounded sums. IEEE Transactions on Computers, 61(3):289–298, 2012.
P. Kornerup and J.-M. Muller. Choosing starting values for certain Newton–Raphson iterations. Theoretical Computer Science, 351(1):101–110, 2006.
S. Linnainmaa. Software for doubled-precision floating-point computations. ACM Transactions on Mathematical Software, 7(3):272–283, 1981.
F. Loitsch. Printing floating-point numbers quickly and accurately with integers. In 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ‘10), pages 233–243, 2010.
M. A. Malcolm. Algorithms to reveal properties of floating-point arithmetic. Communications of the ACM, 15(11):949–951, 1972.
P. Markstein. Computation of elementary functions on the IBM RISC System/6000 processor. IBM Journal of Research and Development, 34(1):111–119, 1990.
P. Markstein. IA-64 and Elementary Functions: Speed and Precision. Hewlett-Packard Professional Books. Prentice-Hall, Englewood Cliffs, NJ, 2000.
D. W. Matula. In-and-out conversions. Communications of the ACM, 11(1):47–50, 1968.
O. Møller. Quasi double-precision in floating-point addition. BIT, 5:37–50, 1965.
J.-M. Muller. Avoiding double roundings in scaled Newton-Raphson division. In 47th Asilomar Conference on Signals, Systems, and Computers, pages 396–399, November 2013.
J.-M. Muller. Elementary Functions, Algorithms and Implementation. Birkhäuser Boston, MA, 3rd edition, 2016.
I. Newton. Methodus Fluxionum et Serierum Infinitarum. 1664–1671.
A. Panhaleux. Génération d’itérations de type Newton-Raphson pour la division de deux flottants à l’aide d’un FMA. Master’s thesis, École Normale Supérieure de Lyon, Lyon, France, 2008. In French.
B. Parhami. On the complexity of table lookup for iterative division. IEEE Transactions on Computers, C-36(10):1233–1236, 1987.
J. A. Pineiro and J. D. Bruguera. High-speed double-precision computation of reciprocal, division, square root, and inverse square root. IEEE Transactions on Computers, 51(12):1377–1388, 2002.
D. M. Priest. On Properties of Floating-Point Arithmetics: Numerical Stability and the Cost of Accurate Computations. Ph.D. thesis, University of California at Berkeley, 1992.
S. M. Rump. Solving algebraic problems with high accuracy (Habilitationsschrift). In A New Approach to Scientific Computation, pages 51–120, 1983.
S. M. Rump, T. Ogita, and S. Oishi. Accurate floating-point summation part I: Faithful rounding. SIAM Journal on Scientific Computing, 31(1):189–224, 2008.
J. R. Shewchuk. Adaptive precision floating-point arithmetic and fast robust geometric predicates. Discrete Computational Geometry, 18:305–363, 1997.
T. Simpson. Essays on several curious and useful subjects in speculative and mix’d mathematicks, illustrated by a variety of examples. London, 1740.
G. L. Steele, Jr. and J. L. White. How to print floating-point numbers accurately. ACM SIGPLAN Notices, 25(6):112–126, 1990.
G. L. Steele, Jr. and J. L. White. Retrospective: how to print floating-point numbers accurately. ACM SIGPLAN Notices, 39(4):372–389, 2004.
P. H. Sterbenz. Floating-Point Computation. Prentice-Hall, Englewood Cliffs, NJ, 1974.
J. E. Stine and M. J. Schulte. The symmetric table addition method for accurate function approximation. Journal of VLSI Signal Processing, 21:167–177, 1999.
G. W. Veltkamp. ALGOL procedures voor het berekenen van een inwendig product in dubbele precisie. Technical Report 22, RC-Informatie, Technishe Hogeschool Eindhoven, 1968.
G. W. Veltkamp. ALGOL procedures voor het rekenen in dubbele lengte. Technical Report 21, RC-Informatie, Technishe Hogeschool Eindhoven, 1969.
T. J. Ypma. Historical development of the Newton-Raphson method. SIAM Rev., 37(4):531–551, 1995.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Muller, JM. et al. (2018). Basic Properties and Algorithms. In: Handbook of Floating-Point Arithmetic. Birkhäuser, Cham. https://doi.org/10.1007/978-3-319-76526-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-76526-6_4
Published:
Publisher Name: Birkhäuser, Cham
Print ISBN: 978-3-319-76525-9
Online ISBN: 978-3-319-76526-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)