Skip to main content

Basic Properties and Algorithms

  • Chapter
  • First Online:
Handbook of Floating-Point Arithmetic

Abstract

In this chapter, we present some short yet useful algorithms and some basic properties that can be derived from specifications of floating-point arithmetic systems, such as the ones given in the successive IEEE 754 standards. Thanks to these standards, we now have an accurate definition of floating-point formats and operations. The behavior of a sequence of operations becomes at least partially for more details on this). We therefore can build algorithms and proofs that refer to these specifications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In some cases, intermediate calculations may be performed in a wider internal format. Some examples are given in Section 3.2.

  2. 2.

    Beware! We remind the reader that by “no underflow” we mean that the absolute value of the result (before or after rounding, this depends on the definition) is not less than the smallest normal number \(\beta ^{e_{\mathrm{min}}}\). When subnormal numbers are available, as requested by the IEEE 754 standards, it is possible to represent smaller nonzero numbers, but with a precision that does not always suffice to represent the product exactly.

  3. 3.

    When | μ | ≤ βp − 1, the difference sa is representable with exponent e a , but not necessarily in normal form. This is why the availability of subnormal numbers is necessary.

  4. 4.

    Do not forget that | a | ≥ | b | implies that the exponent of a is larger than or equal to that of b. Hence, it suffices to compare the two variables.

  5. 5.

    http://flocq.gforge.inria.fr/.

  6. 6.

    In radix 2, we will use the fact that a 2g + 1-bit number can be split into two g-bit numbers. This explains why Dekker’s algorithm works if the precision is even or if the radix is 2.

  7. 7.

    These assumptions hold on any “reasonable” floating-point system.

  8. 8.

    For example, in the IEEE 754 binary64 format, with x = (253 − 1) ⋅ 2940 and y = 231, we obtain x h = 2993 and y h = y. The floating-point multiplication \(\mathop{\mathrm{RN}}\nolimits (x_{h}y_{h})\) overflows, whereas \(\mathop{\mathrm{RN}}\nolimits (xy) =\varOmega = (2^{53} - 1) \cdot 2^{971}\).

  9. 9.

    Caution: M is not necessarily the integral significand of x.

  10. 10.

    Or q is the largest finite floating-point number Ω, in the case where xy is between that number and the overflow threshold (the same thing applies on the negative side).

  11. 11.

    Part of what we are going to explain does not generalize to decimal arithmetic.

  12. 12.

    For instance, the frcpa instruction of the IA-64 instruction set returns approximations to reciprocals with relative error less than or equal to 2−8. 886. Such tables are easily implemented using the bipartite method, See [131].

  13. 13.

    When the radix is an odd number, values exactly halfway between two consecutive floating-point numbers are represented with infinitely many digits.

  14. 14.

    A very similar study can be done when it is a power of 2.

  15. 15.

    A necessary and sufficient condition for all numbers representable in radix β with a finite number of digits to be representable in radix γ with a finite number of digits is that β should divide an integer power of γ.

  16. 16.

    This formula is valid for all possible values of p 2 and \(e_{\mathrm{min}}\) (provided \(e_{\mathrm{min}} \approx -e_{\mathrm{max}}\)). And yet, for all usual formats, it can be simplified: A simple continued fraction argument (see Section A.1) shows that for p 2 ≥ 16 and \(e_{\mathrm{min}} \geq -28000\), it is equal to

    $$\displaystyle{-e_{\mathrm{min}} + p_{2} + \left \lfloor (e_{\mathrm{min}} + 1)\log _{10}(2)\right \rfloor.}$$
  17. 17.

    At the time of writing this book, it can be obtained at http://www.netlib.org/fp/ (file dtoa.c).

  18. 18.

    The algorithm works for other radices. See [86] for details.

  19. 19.

    In round-to-nearest modes, it required that the error introduced by the conversion should be at most 0. 97 ulps. The major reason for this somewhat weak requirement is that the conversion algorithms presented here were not known at the time that standard was designed.

  20. 20.

    Another solution consists in using a precomputed table of powers of 10 in the binary format.

  21. 21.

    If a wider internal format is available, one can use it and possibly save one step.

  22. 22.

    A straightforward analysis of the error induced by the truncation of the digit chain D would give | ε 2 | ≤ 10−min{n−1, j}, but when j ≥ (n − 1), \(D^{{\ast}} =\hat{ D}\) and there is no truncation error at all.

  23. 23.

    The IEEE 754-2008 standard allows the conversion of out-of-range numbers, infinity or NaN. In that case, either there should be a dedicated signaling mechanism or the invalid operation exception should be signaled.

  24. 24.

    Unless u 2 is a power of 2, but this case is easily handled separately.

References

  1. S. F. Anderson, J. G. Earle, R. E. Goldschmidt, and D. M. Powers. The IBM 360/370 model 91: floating-point execution unit. IBM Journal of Research and Development, 1967. Reprinted in [583].

    Google Scholar 

  2. M. Andrysco, R. Jhala, and S. Lerner. Printing floating-point numbers: A faster, always correct method. In 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 555–567, 2016.

    Google Scholar 

  3. G. Bohlender, W. Walter, P. Kornerup, and D. W. Matula. Semantics for exact floating point operations. In 10th IEEE Symposium on Computer Arithmetic (ARITH-10), pages 22–26, June 1991.

    Google Scholar 

  4. S. Boldo. Pitfalls of a full floating-point proof: example on the formal proof of the Veltkamp/Dekker algorithms. In 3rd International Joint Conference on Automated Reasoning (IJCAR), volume 4130 of Lecture Notes in Computer Science, pages 52–66, Seattle, WA, USA, 2006.

    Chapter  Google Scholar 

  5. S. Boldo and M. Daumas. Representable correcting terms for possibly underflowing floating point operations. In 16th IEEE Symposium on Computer Arithmetic (ARITH-16), pages 79–86, Santiago de Compostela, Spain, 2003.

    Google Scholar 

  6. S. Boldo, M. Daumas, C. Moreau-Finot, and L. Théry. Computer validated proofs of a toolset for adaptable arithmetic. Technical report, École Normale Supérieure de Lyon, 2001. Available at http://arxiv.org/pdf/cs.MS/0107025.

  7. S. Boldo, S. Graillat, and J.-M. Muller. On the robustness of the 2Sum and Fast2Sum algorithms. ACM Transactions on Mathematical Software, 44(1):4:1–4:14, 2017.

    Article  MathSciNet  Google Scholar 

  8. S. Boldo, J.-H. Jourdan, X. Leroy, and G. Melquiond. Verified compilation of floating-point computations. Journal of Automated Reasoning, 54(2):135–163, 2015.

    Article  MathSciNet  Google Scholar 

  9. S. Boldo and G. Melquiond. Emulation of FMA and correctly rounded sums: proved algorithms using rounding to odd. IEEE Transactions on Computers, 57(4):462–471, 2008.

    Article  MathSciNet  Google Scholar 

  10. S. Boldo and G. Melquiond. Computer Arithmetic and Formal Proofs. ISTE Press – Elsevier, 2017.

    Google Scholar 

  11. S. Boldo and J.-M. Muller. Exact and approximated error of the FMA. IEEE Transactions on Computers, 60(2):157–164, 2011.

    Article  MathSciNet  Google Scholar 

  12. A. D. Booth. A signed binary multiplication technique. Quarterly Journal of Mechanics and Applied Mathematics, 4(2):236–240, 1951. Reprinted in [583].

    Google Scholar 

  13. N. Brisebarre and J.-M. Muller. Correctly rounded multiplication by arbitrary precision constants. IEEE Transactions on Computers, 57(2):165–174, 2008.

    Article  MathSciNet  Google Scholar 

  14. R. G. Burger and R. K. Dybvig. Printing floating-point numbers quickly and accurately. In SIGPLAN’96 Conference on Programming Languages Design and Implementation (PLDI), pages 108–116, June 1996.

    Google Scholar 

  15. W. D. Clinger. How to read floating-point numbers accurately. ACM SIGPLAN Notices, 25(6):92–101, 1990.

    Article  Google Scholar 

  16. W. D. Clinger. Retrospective: how to read floating-point numbers accurately. ACM SIGPLAN Notices, 39(4):360–371, 2004.

    Article  Google Scholar 

  17. M. Cornea, J. Harrison, and P. T. P. Tang. Scientific Computing on Itanium ®; -based Systems. Intel Press, Hillsboro, OR, 2002.

    Google Scholar 

  18. M. A. Cornea-Hasegan, R. A. Golliver, and P. Markstein. Correctness proofs outline for Newton–Raphson based floating-point divide and square root algorithms. In 14th IEEE Symposium on Computer Arithmetic (ARITH-14), pages 96–105, April 1999.

    Google Scholar 

  19. A. Dahan-Dalmedico and J. Pfeiffer. Histoire des Mathématiques. Editions du Seuil, Paris, 1986. In French.

    Google Scholar 

  20. C. Daramy-Loirat, D. Defour, F. de Dinechin, M. Gallet, N. Gast, C. Q. Lauter, and J.-M. Muller. CR-LIBM, a library of correctly-rounded elementary functions in double-precision. Technical report, LIP Laboratory, Arenaire team, December 2006. Available at https://hal-ens-lyon.archives-ouvertes.fr/ensl-01529804.

  21. D. Das Sarma and D. W. Matula. Measuring the accuracy of ROM reciprocal tables. IEEE Transactions on Computers, 43(8):932–940, 1994.

    Article  Google Scholar 

  22. D. Das Sarma and D. W. Matula. Faithful bipartite ROM reciprocal tables. In 12th IEEE Symposium on Computer Arithmetic (ARITH-12), pages 17–28, June 1995.

    Google Scholar 

  23. D. Das Sarma and D. W. Matula. Faithful interpolation in reciprocal tables. In 13th IEEE Symposium on Computer Arithmetic (ARITH-13), pages 82–91, July 1997.

    Google Scholar 

  24. F. de Dinechin, A. V. Ershov, and N. Gast. Towards the post-ultimate libm. In 17th IEEE Symposium on Computer Arithmetic (ARITH-17), pages 288–295, 2005.

    Google Scholar 

  25. F. de Dinechin and A. Tisserand. Multipartite table methods. IEEE Transactions on Computers, 54(3):319–330, 2005.

    Article  Google Scholar 

  26. T. J. Dekker. A floating-point technique for extending the available precision. Numerische Mathematik, 18(3):224–242, 1971.

    Article  MathSciNet  Google Scholar 

  27. M. D. Ercegovac and T. Lang. Division and Square Root: Digit-Recurrence Algorithms and Implementations. Kluwer Academic Publishers, Boston, MA, 1994.

    MATH  Google Scholar 

  28. M. D. Ercegovac and T. Lang. Digital Arithmetic. Morgan Kaufmann Publishers, San Francisco, CA, 2004.

    Google Scholar 

  29. M. Ercegovac, J.-M. Muller, and A. Tisserand. Simple seed architectures for reciprocal and square root reciprocal. In 39th Asilomar Conference on Signals, Systems, and Computers, November 2005.

    Google Scholar 

  30. G. Even, P.-M. Seidel, and W. E. Ferguson. A parametric error analysis of Goldschmidt’s division algorithm. Journal of Computer and System Sciences, 70(1):118–139, 2005.

    Article  MathSciNet  Google Scholar 

  31. W. E. Ferguson, Jr. Exact computation of a sum or difference with applications to argument reduction. In 12th IEEE Symposium on Computer Arithmetic (ARITH-12), pages 216–221, Bath, UK, July 1995.

    Google Scholar 

  32. D. Fowler and E. Robson. Square root approximations in old Babylonian mathematics: YBC 7289 in context. Historia Mathematica, 25:366–378, 1998.

    Article  MathSciNet  Google Scholar 

  33. D. M. Gay. Correctly-rounded binary-decimal and decimal-binary conversions. Technical Report Numerical Analysis Manuscript 90–10, ATT & Bell Laboratories (Murray Hill, NJ), November 1990.

    Google Scholar 

  34. W. M. Gentleman and S. B. Marovitch. More on algorithms that reveal properties of floating-point arithmetic units. Communications of the ACM, 17(5):276–277, 1974.

    Article  Google Scholar 

  35. I. B. Goldberg. 27 bits are not enough for 8-digit accuracy. Commun. ACM, 10(2):105–106, 1967.

    Article  Google Scholar 

  36. R. E. Goldschmidt. Applications of division by convergence. Master’s thesis, Dept. of Electrical Engineering, Massachusetts Institute of Technology, Cambridge, MA, June 1964.

    Google Scholar 

  37. J. R. Hauser. Handling floating-point exceptions in numeric programs. ACM Transactions on Programming Languages and Systems, 18(2):139–174, 1996.

    Article  Google Scholar 

  38. IEEE Computer Society. IEEE Standard for Floating-Point Arithmetic. IEEE Standard 754-2008, August 2008. Available at http://ieeexplore.ieee.org/servlet/opac?punumber=4610933.

  39. W. Kahan. Pracniques: further remarks on reducing truncation errors. Communications of the ACM, 8(1):40, 1965.

    Article  Google Scholar 

  40. R. Karpinsky. PARANOIA: a floating-point benchmark. BYTE, 10(2), 1985.

    Google Scholar 

  41. D. E. Knuth. The Art of Computer Programming, volume 2. Addison-Wesley, Reading, MA, 3rd edition, 1998.

    Google Scholar 

  42. P. Kornerup, V. Lefèvre, N. Louvet, and J.-M. Muller. On the computation of correctly rounded sums. IEEE Transactions on Computers, 61(3):289–298, 2012.

    Article  MathSciNet  Google Scholar 

  43. P. Kornerup and J.-M. Muller. Choosing starting values for certain Newton–Raphson iterations. Theoretical Computer Science, 351(1):101–110, 2006.

    Article  MathSciNet  Google Scholar 

  44. S. Linnainmaa. Software for doubled-precision floating-point computations. ACM Transactions on Mathematical Software, 7(3):272–283, 1981.

    Article  MathSciNet  Google Scholar 

  45. F. Loitsch. Printing floating-point numbers quickly and accurately with integers. In 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ‘10), pages 233–243, 2010.

    Google Scholar 

  46. M. A. Malcolm. Algorithms to reveal properties of floating-point arithmetic. Communications of the ACM, 15(11):949–951, 1972.

    Article  Google Scholar 

  47. P. Markstein. Computation of elementary functions on the IBM RISC System/6000 processor. IBM Journal of Research and Development, 34(1):111–119, 1990.

    Article  MathSciNet  Google Scholar 

  48. P. Markstein. IA-64 and Elementary Functions: Speed and Precision. Hewlett-Packard Professional Books. Prentice-Hall, Englewood Cliffs, NJ, 2000.

    Google Scholar 

  49. D. W. Matula. In-and-out conversions. Communications of the ACM, 11(1):47–50, 1968.

    Article  MathSciNet  Google Scholar 

  50. O. Møller. Quasi double-precision in floating-point addition. BIT, 5:37–50, 1965.

    Article  MathSciNet  Google Scholar 

  51. J.-M. Muller. Avoiding double roundings in scaled Newton-Raphson division. In 47th Asilomar Conference on Signals, Systems, and Computers, pages 396–399, November 2013.

    Google Scholar 

  52. J.-M. Muller. Elementary Functions, Algorithms and Implementation. Birkhäuser Boston, MA, 3rd edition, 2016.

    Chapter  Google Scholar 

  53. I. Newton. Methodus Fluxionum et Serierum Infinitarum. 1664–1671.

    Google Scholar 

  54. A. Panhaleux. Génération d’itérations de type Newton-Raphson pour la division de deux flottants à l’aide d’un FMA. Master’s thesis, École Normale Supérieure de Lyon, Lyon, France, 2008. In French.

    Google Scholar 

  55. B. Parhami. On the complexity of table lookup for iterative division. IEEE Transactions on Computers, C-36(10):1233–1236, 1987.

    Article  MathSciNet  Google Scholar 

  56. J. A. Pineiro and J. D. Bruguera. High-speed double-precision computation of reciprocal, division, square root, and inverse square root. IEEE Transactions on Computers, 51(12):1377–1388, 2002.

    Article  MathSciNet  Google Scholar 

  57. D. M. Priest. On Properties of Floating-Point Arithmetics: Numerical Stability and the Cost of Accurate Computations. Ph.D. thesis, University of California at Berkeley, 1992.

    Google Scholar 

  58. S. M. Rump. Solving algebraic problems with high accuracy (Habilitationsschrift). In A New Approach to Scientific Computation, pages 51–120, 1983.

    Google Scholar 

  59. S. M. Rump, T. Ogita, and S. Oishi. Accurate floating-point summation part I: Faithful rounding. SIAM Journal on Scientific Computing, 31(1):189–224, 2008.

    Article  MathSciNet  Google Scholar 

  60. J. R. Shewchuk. Adaptive precision floating-point arithmetic and fast robust geometric predicates. Discrete Computational Geometry, 18:305–363, 1997.

    Article  MathSciNet  Google Scholar 

  61. T. Simpson. Essays on several curious and useful subjects in speculative and mix’d mathematicks, illustrated by a variety of examples. London, 1740.

    Google Scholar 

  62. G. L. Steele, Jr. and J. L. White. How to print floating-point numbers accurately. ACM SIGPLAN Notices, 25(6):112–126, 1990.

    Article  Google Scholar 

  63. G. L. Steele, Jr. and J. L. White. Retrospective: how to print floating-point numbers accurately. ACM SIGPLAN Notices, 39(4):372–389, 2004.

    Article  Google Scholar 

  64. P. H. Sterbenz. Floating-Point Computation. Prentice-Hall, Englewood Cliffs, NJ, 1974.

    Google Scholar 

  65. J. E. Stine and M. J. Schulte. The symmetric table addition method for accurate function approximation. Journal of VLSI Signal Processing, 21:167–177, 1999.

    Article  Google Scholar 

  66. G. W. Veltkamp. ALGOL procedures voor het berekenen van een inwendig product in dubbele precisie. Technical Report 22, RC-Informatie, Technishe Hogeschool Eindhoven, 1968.

    Google Scholar 

  67. G. W. Veltkamp. ALGOL procedures voor het rekenen in dubbele lengte. Technical Report 21, RC-Informatie, Technishe Hogeschool Eindhoven, 1969.

    Google Scholar 

  68. T. J. Ypma. Historical development of the Newton-Raphson method. SIAM Rev., 37(4):531–551, 1995.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Muller, JM. et al. (2018). Basic Properties and Algorithms. In: Handbook of Floating-Point Arithmetic. Birkhäuser, Cham. https://doi.org/10.1007/978-3-319-76526-6_4

Download citation

Publish with us

Policies and ethics