Basic Properties and Algorithms

Muller, Jean-Michel; Brunie, Nicolas; de Dinechin, Florent; Jeannerod, Claude-Pierre; Joldes, Mioara; Lefèvre, Vincent; Melquiond, Guillaume; Revol, Nathalie; Torres, Serge

doi:10.1007/978-3-319-76526-6_4

Jean-Michel Muller¹⁰,
Nicolas Brunie¹¹,
Florent de Dinechin¹²,
Claude-Pierre Jeannerod¹³,
Mioara Joldes¹⁴,
Vincent Lefèvre¹³,
Guillaume Melquiond¹⁵,
Nathalie Revol¹³ &
…
Serge Torres¹⁶

2686 Accesses

Abstract

In this chapter, we present some short yet useful algorithms and some basic properties that can be derived from specifications of floating-point arithmetic systems, such as the ones given in the successive IEEE 754 standards. Thanks to these standards, we now have an accurate definition of floating-point formats and operations. The behavior of a sequence of operations becomes at least partially for more details on this). We therefore can build algorithms and proofs that refer to these specifications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In some cases, intermediate calculations may be performed in a wider internal format. Some examples are given in Section 3.2.
2.
Beware! We remind the reader that by “no underflow” we mean that the absolute value of the result (before or after rounding, this depends on the definition) is not less than the smallest normal number $\beta ^{e_{\mathrm{min}}}$. When subnormal numbers are available, as requested by the IEEE 754 standards, it is possible to represent smaller nonzero numbers, but with a precision that does not always suffice to represent the product exactly.
3.
When | μ | ≤ β^p − 1, the difference s − a is representable with exponent e _a, but not necessarily in normal form. This is why the availability of subnormal numbers is necessary.
4.
Do not forget that | a | ≥ | b | implies that the exponent of a is larger than or equal to that of b. Hence, it suffices to compare the two variables.
5.
http://flocq.gforge.inria.fr/.
6.
In radix 2, we will use the fact that a 2g + 1-bit number can be split into two g-bit numbers. This explains why Dekker’s algorithm works if the precision is even or if the radix is 2.
7.
These assumptions hold on any “reasonable” floating-point system.
8.
For example, in the IEEE 754 binary64 format, with x = (2⁵³ − 1) ⋅ 2⁹⁴⁰ and y = 2³¹, we obtain x _h = 2⁹⁹³ and y _h = y. The floating-point multiplication $\mathop{\mathrm{RN}}\nolimits (x_{h}y_{h})$ overflows, whereas $\mathop{\mathrm{RN}}\nolimits (xy) =\varOmega = (2^{53} - 1) \cdot 2^{971}$.
9.
Caution: M is not necessarily the integral significand of x.
10.
Or q is the largest finite floating-point number Ω, in the case where x∕y is between that number and the overflow threshold (the same thing applies on the negative side).
11.
Part of what we are going to explain does not generalize to decimal arithmetic.
12.
For instance, the frcpa instruction of the IA-64 instruction set returns approximations to reciprocals with relative error less than or equal to 2^−8. 886. Such tables are easily implemented using the bipartite method, See [131].
13.
When the radix is an odd number, values exactly halfway between two consecutive floating-point numbers are represented with infinitely many digits.
14.
A very similar study can be done when it is a power of 2.
15.
A necessary and sufficient condition for all numbers representable in radix β with a finite number of digits to be representable in radix γ with a finite number of digits is that β should divide an integer power of γ.
16.
This formula is valid for all possible values of p ₂ and $e_{\mathrm{min}}$ (provided $e_{\mathrm{min}} \approx -e_{\mathrm{max}}$). And yet, for all usual formats, it can be simplified: A simple continued fraction argument (see Section A.1) shows that for p ₂ ≥ 16 and $e_{\mathrm{min}} \geq -28000$, it is equal to
$$\displaystyle{-e_{\mathrm{min}} + p_{2} + \left \lfloor (e_{\mathrm{min}} + 1)\log _{10}(2)\right \rfloor.}$$
17.
At the time of writing this book, it can be obtained at http://www.netlib.org/fp/ (file dtoa.c).
18.
The algorithm works for other radices. See [86] for details.
19.
In round-to-nearest modes, it required that the error introduced by the conversion should be at most 0. 97 ulps. The major reason for this somewhat weak requirement is that the conversion algorithms presented here were not known at the time that standard was designed.
20.
Another solution consists in using a precomputed table of powers of 10 in the binary format.
21.
If a wider internal format is available, one can use it and possibly save one step.
22.
A straightforward analysis of the error induced by the truncation of the digit chain D^∗ would give | ε ₂ | ≤ 10^{−min{n−1, j}}, but when j ≥ (n − 1), $D^{{\ast}} =\hat{ D}$ and there is no truncation error at all.
23.
The IEEE 754-2008 standard allows the conversion of out-of-range numbers, infinity or NaN. In that case, either there should be a dedicated signaling mechanism or the invalid operation exception should be signaled.
24.
Unless u ₂ is a power of 2, but this case is easily handled separately.

References

S. F. Anderson, J. G. Earle, R. E. Goldschmidt, and D. M. Powers. The IBM 360/370 model 91: floating-point execution unit. IBM Journal of Research and Development, 1967. Reprinted in [583].
Google Scholar
M. Andrysco, R. Jhala, and S. Lerner. Printing floating-point numbers: A faster, always correct method. In 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 555–567, 2016.
Google Scholar
G. Bohlender, W. Walter, P. Kornerup, and D. W. Matula. Semantics for exact floating point operations. In 10th IEEE Symposium on Computer Arithmetic (ARITH-10), pages 22–26, June 1991.
Google Scholar
S. Boldo. Pitfalls of a full floating-point proof: example on the formal proof of the Veltkamp/Dekker algorithms. In 3rd International Joint Conference on Automated Reasoning (IJCAR), volume 4130 of Lecture Notes in Computer Science, pages 52–66, Seattle, WA, USA, 2006.
Chapter Google Scholar
S. Boldo and M. Daumas. Representable correcting terms for possibly underflowing floating point operations. In 16th IEEE Symposium on Computer Arithmetic (ARITH-16), pages 79–86, Santiago de Compostela, Spain, 2003.
Google Scholar
S. Boldo, M. Daumas, C. Moreau-Finot, and L. Théry. Computer validated proofs of a toolset for adaptable arithmetic. Technical report, École Normale Supérieure de Lyon, 2001. Available at http://arxiv.org/pdf/cs.MS/0107025.
S. Boldo, S. Graillat, and J.-M. Muller. On the robustness of the 2Sum and Fast2Sum algorithms. ACM Transactions on Mathematical Software, 44(1):4:1–4:14, 2017.
Article MathSciNet Google Scholar
S. Boldo, J.-H. Jourdan, X. Leroy, and G. Melquiond. Verified compilation of floating-point computations. Journal of Automated Reasoning, 54(2):135–163, 2015.
Article MathSciNet Google Scholar
S. Boldo and G. Melquiond. Emulation of FMA and correctly rounded sums: proved algorithms using rounding to odd. IEEE Transactions on Computers, 57(4):462–471, 2008.
Article MathSciNet Google Scholar
S. Boldo and G. Melquiond. Computer Arithmetic and Formal Proofs. ISTE Press – Elsevier, 2017.
Google Scholar
S. Boldo and J.-M. Muller. Exact and approximated error of the FMA. IEEE Transactions on Computers, 60(2):157–164, 2011.
Article MathSciNet Google Scholar
A. D. Booth. A signed binary multiplication technique. Quarterly Journal of Mechanics and Applied Mathematics, 4(2):236–240, 1951. Reprinted in [583].
Google Scholar
N. Brisebarre and J.-M. Muller. Correctly rounded multiplication by arbitrary precision constants. IEEE Transactions on Computers, 57(2):165–174, 2008.
Article MathSciNet Google Scholar
R. G. Burger and R. K. Dybvig. Printing floating-point numbers quickly and accurately. In SIGPLAN’96 Conference on Programming Languages Design and Implementation (PLDI), pages 108–116, June 1996.
Google Scholar
W. D. Clinger. How to read floating-point numbers accurately. ACM SIGPLAN Notices, 25(6):92–101, 1990.
Article Google Scholar
W. D. Clinger. Retrospective: how to read floating-point numbers accurately. ACM SIGPLAN Notices, 39(4):360–371, 2004.
Article Google Scholar
M. Cornea, J. Harrison, and P. T. P. Tang. Scientific Computing on Itanium ^®; -based Systems. Intel Press, Hillsboro, OR, 2002.
Google Scholar
M. A. Cornea-Hasegan, R. A. Golliver, and P. Markstein. Correctness proofs outline for Newton–Raphson based floating-point divide and square root algorithms. In 14th IEEE Symposium on Computer Arithmetic (ARITH-14), pages 96–105, April 1999.
Google Scholar
A. Dahan-Dalmedico and J. Pfeiffer. Histoire des Mathématiques. Editions du Seuil, Paris, 1986. In French.
Google Scholar
C. Daramy-Loirat, D. Defour, F. de Dinechin, M. Gallet, N. Gast, C. Q. Lauter, and J.-M. Muller. CR-LIBM, a library of correctly-rounded elementary functions in double-precision. Technical report, LIP Laboratory, Arenaire team, December 2006. Available at https://hal-ens-lyon.archives-ouvertes.fr/ensl-01529804.
D. Das Sarma and D. W. Matula. Measuring the accuracy of ROM reciprocal tables. IEEE Transactions on Computers, 43(8):932–940, 1994.
Article Google Scholar
D. Das Sarma and D. W. Matula. Faithful bipartite ROM reciprocal tables. In 12th IEEE Symposium on Computer Arithmetic (ARITH-12), pages 17–28, June 1995.
Google Scholar
D. Das Sarma and D. W. Matula. Faithful interpolation in reciprocal tables. In 13th IEEE Symposium on Computer Arithmetic (ARITH-13), pages 82–91, July 1997.
Google Scholar
F. de Dinechin, A. V. Ershov, and N. Gast. Towards the post-ultimate libm. In 17th IEEE Symposium on Computer Arithmetic (ARITH-17), pages 288–295, 2005.
Google Scholar
F. de Dinechin and A. Tisserand. Multipartite table methods. IEEE Transactions on Computers, 54(3):319–330, 2005.
Article Google Scholar
T. J. Dekker. A floating-point technique for extending the available precision. Numerische Mathematik, 18(3):224–242, 1971.
Article MathSciNet Google Scholar
M. D. Ercegovac and T. Lang. Division and Square Root: Digit-Recurrence Algorithms and Implementations. Kluwer Academic Publishers, Boston, MA, 1994.
MATH Google Scholar
M. D. Ercegovac and T. Lang. Digital Arithmetic. Morgan Kaufmann Publishers, San Francisco, CA, 2004.
Google Scholar
M. Ercegovac, J.-M. Muller, and A. Tisserand. Simple seed architectures for reciprocal and square root reciprocal. In 39th Asilomar Conference on Signals, Systems, and Computers, November 2005.
Google Scholar
G. Even, P.-M. Seidel, and W. E. Ferguson. A parametric error analysis of Goldschmidt’s division algorithm. Journal of Computer and System Sciences, 70(1):118–139, 2005.
Article MathSciNet Google Scholar
W. E. Ferguson, Jr. Exact computation of a sum or difference with applications to argument reduction. In 12th IEEE Symposium on Computer Arithmetic (ARITH-12), pages 216–221, Bath, UK, July 1995.
Google Scholar
D. Fowler and E. Robson. Square root approximations in old Babylonian mathematics: YBC 7289 in context. Historia Mathematica, 25:366–378, 1998.
Article MathSciNet Google Scholar
D. M. Gay. Correctly-rounded binary-decimal and decimal-binary conversions. Technical Report Numerical Analysis Manuscript 90–10, ATT & Bell Laboratories (Murray Hill, NJ), November 1990.
Google Scholar
W. M. Gentleman and S. B. Marovitch. More on algorithms that reveal properties of floating-point arithmetic units. Communications of the ACM, 17(5):276–277, 1974.
Article Google Scholar
I. B. Goldberg. 27 bits are not enough for 8-digit accuracy. Commun. ACM, 10(2):105–106, 1967.
Article Google Scholar
R. E. Goldschmidt. Applications of division by convergence. Master’s thesis, Dept. of Electrical Engineering, Massachusetts Institute of Technology, Cambridge, MA, June 1964.
Google Scholar
J. R. Hauser. Handling floating-point exceptions in numeric programs. ACM Transactions on Programming Languages and Systems, 18(2):139–174, 1996.
Article Google Scholar
IEEE Computer Society. IEEE Standard for Floating-Point Arithmetic. IEEE Standard 754-2008, August 2008. Available at http://ieeexplore.ieee.org/servlet/opac?punumber=4610933.
W. Kahan. Pracniques: further remarks on reducing truncation errors. Communications of the ACM, 8(1):40, 1965.
Article Google Scholar
R. Karpinsky. PARANOIA: a floating-point benchmark. BYTE, 10(2), 1985.
Google Scholar
D. E. Knuth. The Art of Computer Programming, volume 2. Addison-Wesley, Reading, MA, 3rd edition, 1998.
Google Scholar
P. Kornerup, V. Lefèvre, N. Louvet, and J.-M. Muller. On the computation of correctly rounded sums. IEEE Transactions on Computers, 61(3):289–298, 2012.
Article MathSciNet Google Scholar
P. Kornerup and J.-M. Muller. Choosing starting values for certain Newton–Raphson iterations. Theoretical Computer Science, 351(1):101–110, 2006.
Article MathSciNet Google Scholar
S. Linnainmaa. Software for doubled-precision floating-point computations. ACM Transactions on Mathematical Software, 7(3):272–283, 1981.
Article MathSciNet Google Scholar
F. Loitsch. Printing floating-point numbers quickly and accurately with integers. In 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ‘10), pages 233–243, 2010.
Google Scholar
M. A. Malcolm. Algorithms to reveal properties of floating-point arithmetic. Communications of the ACM, 15(11):949–951, 1972.
Article Google Scholar
P. Markstein. Computation of elementary functions on the IBM RISC System/6000 processor. IBM Journal of Research and Development, 34(1):111–119, 1990.
Article MathSciNet Google Scholar
P. Markstein. IA-64 and Elementary Functions: Speed and Precision. Hewlett-Packard Professional Books. Prentice-Hall, Englewood Cliffs, NJ, 2000.
Google Scholar
D. W. Matula. In-and-out conversions. Communications of the ACM, 11(1):47–50, 1968.
Article MathSciNet Google Scholar
O. Møller. Quasi double-precision in floating-point addition. BIT, 5:37–50, 1965.
Article MathSciNet Google Scholar
J.-M. Muller. Avoiding double roundings in scaled Newton-Raphson division. In 47th Asilomar Conference on Signals, Systems, and Computers, pages 396–399, November 2013.
Google Scholar
J.-M. Muller. Elementary Functions, Algorithms and Implementation. Birkhäuser Boston, MA, 3rd edition, 2016.
Chapter Google Scholar
I. Newton. Methodus Fluxionum et Serierum Infinitarum. 1664–1671.
Google Scholar
A. Panhaleux. Génération d’itérations de type Newton-Raphson pour la division de deux flottants à l’aide d’un FMA. Master’s thesis, École Normale Supérieure de Lyon, Lyon, France, 2008. In French.
Google Scholar
B. Parhami. On the complexity of table lookup for iterative division. IEEE Transactions on Computers, C-36(10):1233–1236, 1987.
Article MathSciNet Google Scholar
J. A. Pineiro and J. D. Bruguera. High-speed double-precision computation of reciprocal, division, square root, and inverse square root. IEEE Transactions on Computers, 51(12):1377–1388, 2002.
Article MathSciNet Google Scholar
D. M. Priest. On Properties of Floating-Point Arithmetics: Numerical Stability and the Cost of Accurate Computations. Ph.D. thesis, University of California at Berkeley, 1992.
Google Scholar
S. M. Rump. Solving algebraic problems with high accuracy (Habilitationsschrift). In A New Approach to Scientific Computation, pages 51–120, 1983.
Google Scholar
S. M. Rump, T. Ogita, and S. Oishi. Accurate floating-point summation part I: Faithful rounding. SIAM Journal on Scientific Computing, 31(1):189–224, 2008.
Article MathSciNet Google Scholar
J. R. Shewchuk. Adaptive precision floating-point arithmetic and fast robust geometric predicates. Discrete Computational Geometry, 18:305–363, 1997.
Article MathSciNet Google Scholar
T. Simpson. Essays on several curious and useful subjects in speculative and mix’d mathematicks, illustrated by a variety of examples. London, 1740.
Google Scholar
G. L. Steele, Jr. and J. L. White. How to print floating-point numbers accurately. ACM SIGPLAN Notices, 25(6):112–126, 1990.
Article Google Scholar
G. L. Steele, Jr. and J. L. White. Retrospective: how to print floating-point numbers accurately. ACM SIGPLAN Notices, 39(4):372–389, 2004.
Article Google Scholar
P. H. Sterbenz. Floating-Point Computation. Prentice-Hall, Englewood Cliffs, NJ, 1974.
Google Scholar
J. E. Stine and M. J. Schulte. The symmetric table addition method for accurate function approximation. Journal of VLSI Signal Processing, 21:167–177, 1999.
Article Google Scholar
G. W. Veltkamp. ALGOL procedures voor het berekenen van een inwendig product in dubbele precisie. Technical Report 22, RC-Informatie, Technishe Hogeschool Eindhoven, 1968.
Google Scholar
G. W. Veltkamp. ALGOL procedures voor het rekenen in dubbele lengte. Technical Report 21, RC-Informatie, Technishe Hogeschool Eindhoven, 1969.
Google Scholar
T. J. Ypma. Historical development of the Newton-Raphson method. SIAM Rev., 37(4):531–551, 1995.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

CNRS - LIP, Lyon, France
Jean-Michel Muller
Kalray, Grenoble, France
Nicolas Brunie
INSA-Lyon - CITI, Villeurbanne, France
Florent de Dinechin
Inria - LIP, Lyon, France
Claude-Pierre Jeannerod, Vincent Lefèvre & Nathalie Revol
CNRS - LAAS, Toulouse, France
Mioara Joldes
Inria - LRI, Orsay, France
Guillaume Melquiond
ENS-Lyon - LIP, Lyon, France
Serge Torres

Authors

Jean-Michel Muller
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Brunie
View author publications
You can also search for this author in PubMed Google Scholar
Florent de Dinechin
View author publications
You can also search for this author in PubMed Google Scholar
Claude-Pierre Jeannerod
View author publications
You can also search for this author in PubMed Google Scholar
Mioara Joldes
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Lefèvre
View author publications
You can also search for this author in PubMed Google Scholar
Guillaume Melquiond
View author publications
You can also search for this author in PubMed Google Scholar
Nathalie Revol
View author publications
You can also search for this author in PubMed Google Scholar
Serge Torres
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Muller, JM. et al. (2018). Basic Properties and Algorithms. In: Handbook of Floating-Point Arithmetic. Birkhäuser, Cham. https://doi.org/10.1007/978-3-319-76526-6_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-76526-6_4
Published: 03 May 2018
Publisher Name: Birkhäuser, Cham
Print ISBN: 978-3-319-76525-9
Online ISBN: 978-3-319-76526-6
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics