On the definition of unit roundoff

Rump, Siegfried M.; Lange, Marko

doi:10.1007/s10543-015-0554-0

On the definition of unit roundoff

Published: 17 March 2015

Volume 56, pages 309–317, (2016)
Cite this article

BIT Numerical Mathematics Aims and scope Submit manuscript

Siegfried M. Rump^1,2 &
Marko Lange¹

591 Accesses
6 Citations
Explore all metrics

Abstract

The result of a floating-point operation is usually defined to be the floating-point number nearest to the exact real result together with a tie-breaking rule. This is called the first standard model of floating-point arithmetic, and the analysis of numerical algorithms is often solely based on that. In addition, a second standard model is used specifying the maximum relative error with respect to the computed result. In this note we take a more general perspective. For an arbitrary finite set of real numbers we identify the rounding to minimize the relative error in the first or the second standard model. The optimal “switching points” are the arithmetic or the harmonic means of adjacent floating-point numbers. Moreover, the maximum relative error of both models is minimized by taking the geometric mean. If the maximum relative error in one model is \(\alpha \), then \(\alpha /(1-\alpha )\) is the maximum relative error in the other model. Those maximal errors, that is the unit roundoff, are characteristic constants of a given finite set of reals: The floating-point model to be optimized identifies the rounding and the unit roundoff.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the maximum relative error when computing integer powers by iterated multiplications in floating-point arithmetic

Article 01 February 2015

Error estimates for the summation of real numbers with application to floating-point summation

Article 03 May 2017

A note on Dekker’s FastTwoSum algorithm

Article Open access 24 April 2020

Notes

For the standard formats \(\mathbb {F}\) in IEEE 754 the range could be slightly wider: For \(f\) denoting the rounded-to-nearest result in \(\mathbb {F}\) with infinite exponent range, return this \(f\) if it belongs to \(\mathbb {F}\) with the bounded exponent range. Since we are aiming on general sets \(\mathfrak {F}\), there is no notion of “exponent range”.

References

Arnold, M.G., Collange, S.: The denormal logarithmic number system. In: 24th IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP), pp. 117–124 (2013)
Clenshaw, C.W., Olver, F.W.J.: Beyond floating point. J. ACM 31(2), 319–328 (1984)
Article MathSciNet MATH Google Scholar
Clenshaw, C.W., Olver, F.W.J., Turner, P.R.: Level-index arithmetic: an introductory survey. Lect. Notes Math. 1397, 95–168 (1989)
Article MathSciNet MATH Google Scholar
Higham, N.J.: Accuracy and stability of numerical algorithms, 2nd edn. SIAM Publications, Philadelphia (2002)
Book MATH Google Scholar
IEEE Standard 754–2008: IEEE Standard for Floating-Point Arithmetic. IEEE Computer Society, New York (2008)
Jeannerod, C.-P., Rump, S.M.: On relative errors of floating-point operations: optimal bounds and applications. Preprint (2014)
Kingsburg, N.G., Rayner, P.J.W.: Digital filtering using logarithmic arithmetic. Electron. Lett. 7, 56–58 (1971)
Article Google Scholar
Knuth, D.E.: The art of computer programming, 3rd edn. In: Seminumerical Algorithms, vol. 2. Addison-Wesley, Reading, Massachusetts (1998)
Lee, S.C., Edgar, A.D.: The focus number system. IEEE Trans. Comput. C–26, 1167–1170 (1977)
Article Google Scholar
Swartzlander Jr, E.E., Alexopoulos, A.G.: The sign/logarithm number system. IEEE Trans. Comput. C–24, 1238–1243 (1975)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

Our dearest thanks go to Claude-Pierre Jeannerod from Lyon for his many detailed comments and for very helpful discussions and suggestions. Moreover, many thanks to the anonymous referees for their valuable and constructive comments.

Author information

Authors and Affiliations

Institute for Reliable Computing, Hamburg University of Technology, Schwarzenbergstraße 95, 21071, Hamburg, Germany
Siegfried M. Rump & Marko Lange
Faculty of Science and Engineering, Waseda University, 3-4-1 Okubo, Shinjuku-ku, Tokyo, 169-8555, Japan
Siegfried M. Rump

Authors

Siegfried M. Rump
View author publications
You can also search for this author in PubMed Google Scholar
Marko Lange
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Siegfried M. Rump.

Additional information

Communicated by Axel Ruhe.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rump, S.M., Lange, M. On the definition of unit roundoff. Bit Numer Math 56, 309–317 (2016). https://doi.org/10.1007/s10543-015-0554-0

Download citation

Received: 25 August 2014
Accepted: 02 March 2015
Published: 17 March 2015
Issue Date: March 2016
DOI: https://doi.org/10.1007/s10543-015-0554-0

Keywords

Mathematics Subject Classification

65G50

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the definition of unit roundoff

Abstract

Access this article

Similar content being viewed by others

On the maximum relative error when computing integer powers by iterated multiplications in floating-point arithmetic

Error estimates for the summation of real numbers with application to floating-point summation

A note on Dekker’s FastTwoSum algorithm

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

On the definition of unit roundoff

Abstract

Access this article

Similar content being viewed by others

On the maximum relative error when computing integer powers by iterated multiplications in floating-point arithmetic

Error estimates for the summation of real numbers with application to floating-point summation

A note on Dekker’s FastTwoSum algorithm

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation