Skip to main content

Parallel Vectorized Implementations of Compensated Summation Algorithms

  • Conference paper
  • First Online:
Parallel Processing and Applied Mathematics (PPAM 2022)

Abstract

The aim of this paper is to show that Kahan’s and Gill-Møller compensated summation algorithms that allow to achieve high accuracy of summing long sequences of floating-point numbers can be efficiently vectorized and parallelized. The new implementation uses Intel AVX-512 intrinsics together with OpenMP constructs in order to utilize SIMD extension of modern multicore processors. We describe in detail the vectorization technique and show how to define custom reduction operators in OpenMP. Numerical experiments performed on a server with Intel Xeon Gold 6342 processors show that the new implementations of the compensated summation algorithms achieve much better accuracy than ordinary summation and their performance is comparable with the performance of the ordinary summation algorithm optimized automatically. Moreover, the experiments show that the vectorized implementation of the Gill-Møller algorithm is faster than the vectorized implementation of Kahan’s algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Ahrens, P., Demmel, J., Nguyen, H.D.: Algorithms for efficient reproducible floating point summation. ACM Trans. Math. Softw. 46, 22:1–22:49 (2020). https://doi.org/10.1145/3389360

  2. Amiri, H., Shahbahrami, A.: SIMD programming using intel vector extensions. J. Parallel Distrib. Comput. 135, 83–100 (2020). https://doi.org/10.1016/j.jpdc.2019.09.012

    Article  Google Scholar 

  3. Collange, S., Defour, D., Graillat, S., Iakymchuk, R.: Numerical reproducibility for the parallel reduction on multi- and many-core architectures. Parallel Comput. 49, 83–97 (2015). https://doi.org/10.1016/j.parco.2015.09.001

    Article  MathSciNet  Google Scholar 

  4. Goldberg, D.: What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 5–48 (1991). https://doi.org/10.1145/103162.103163

    Article  Google Scholar 

  5. He, Y., Ding, C.H.Q.: Using accurate arithmetics to improve numerical reproducibility and stability in parallel applications. J. Supercomput. 18, 259–277 (2001). https://doi.org/10.1023/A:1008153532043

    Article  MATH  Google Scholar 

  6. Higham, N.J.: The accuracy of floating point summation. SIAM J. Sci. Comput. 14, 783–799 (1993). https://doi.org/10.1137/0914050

    Article  MathSciNet  MATH  Google Scholar 

  7. Higham, N.: Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia (1996)

    MATH  Google Scholar 

  8. Hofmann, J., Fey, D., Riedmann, M., Eitzinger, J., Hager, G., Wellein, G.: Performance analysis of the Kahan-enhanced scalar product on current multicore processors. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9573, pp. 63–73. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32149-3_7

    Chapter  Google Scholar 

  9. Hofmann, J., Fey, D., Riedmann, M., Eitzinger, J., Hager, G., Wellein, G.: Performance analysis of the Kahan-enhanced scalar product on current multi-core and many-core processors. Concurr. Comput. Pract. Exp. 29(9) (2017). https://doi.org/10.1002/cpe.3921

  10. Jankowski, M., Smoktunowicz, A., Woźniakowski, H.: A note on floating-point summation of very many terms. Elektronische Informationsverarbeitung und Kybernetik 19, 435–440 (1983)

    MathSciNet  MATH  Google Scholar 

  11. Jankowski, M., Woźniakowski, H.: The accurate solution of certain continuous problems using only single precision arithmetic. BIT Num.l Math. (1985). https://doi.org/10.1007/BF01936142

    Article  MATH  Google Scholar 

  12. Jeffers, J., Reinders, J., Sodani, A.: Intel Xeon Phi Processor High-Performance Programming. Knights Landing Edition. Morgan Kaufman, Cambridge (2016)

    Google Scholar 

  13. Kahan, W.: Pracniques: further remarks on reducing truncation errors. Commun. ACM 8, 40 (1965). https://doi.org/10.1145/363707.363723

    Article  Google Scholar 

  14. Kiełbasiński, A.: The summation algorithm with correction and their applications. Math. Appl. (Matematyka Stosowana) (1973). 10.14708/ma.v1i1.295

    Google Scholar 

  15. Lefèvre, V.: Correctly rounded arbitrary-precision floating-point summation. IEEE Trans. Comput. 66, 2111–2124 (2017). https://doi.org/10.1109/TC.2017.2690632

    Article  MathSciNet  MATH  Google Scholar 

  16. Lei, X., Gu, T., Graillat, S., Jiang, H., Qi, J.: A fast parallel high-precision summation algorithm based on AccSumK. J. Computut. Appl. Math. 406, 113827 (2022). https://doi.org/10.1016/j.cam.2021.113827

    Article  MathSciNet  MATH  Google Scholar 

  17. Lutz, D.R., Hinds, C.N.: High-precision anchored accumulators for reproducible floating-point summation. In: Burgess, N., Bruguera, J.D., de Dinechin, F. (eds.) 24th IEEE Symposium on Computer Arithmetic, ARITH 2017, London, UK, 24–26 July 2017, pp. 98–105. IEEE Computer Society (2017). https://doi.org/10.1109/ARITH.2017.20

  18. Møller, O.: Quasi double-precision in floating point addition. BIT Num.l Math. 5, 37–50 (1965). https://doi.org/10.1007/BF01975722

    Article  MathSciNet  MATH  Google Scholar 

  19. Neuman, B., Dubois, A., Monroe, L., Robey, R.W.: Fast, good, and repeatable: Summations, vectorization, and reproducibility. Int. J. High Perform. Comput. Appl. 34 (2020). https://doi.org/10.1177/1094342020938425

  20. van der Pas, R., Stotzer, E., Terboven, C.: Using OpenMP - The Next Step. Affinity, Accelerators, Tasking, and SIMD. MIT Press, Cambridge (2017)

    Google Scholar 

  21. Stojanov, A., Toskov, I., Rompf, T., Püschel, M.: SIMD intrinsics on managed language runtimes. In: Proceedings of the 2018 International Symposium on Code Generation and Optimization, pp. 2–15. ACM, New York, NY (2018). https://doi.org/10.1145/3168810

  22. Stpiczyński, P.: Language-based vectorization and parallelization using intrinsics, OpenMP, TBB and Cilk Plus. J. Supercomput. 74(4), 1461–1472 (2018). https://doi.org/10.1007/s11227-017-2231-3

    Article  Google Scholar 

  23. Stpiczyński, P.: Algorithmic and language-based optimization of Marsa-LFIB4 pseudorandom number generator using OpenMP, OpenACC and CUDA. J. Parallel Distrib. Comput. 137, 238–245 (2020). https://doi.org/10.1016/j.jpdc.2019.12.004

    Article  Google Scholar 

  24. Uguen, Y., de Dinechin, F., Derrien, S.: Bridging high-level synthesis and application-specific arithmetic: the case study of floating-point summations. In: Santambrogio, M.D., Göhringer, D., Stroobandt, D., Mentens, N., Nurmi, J. (eds.) 27th International Conference on Field Programmable Logic and Applications, FPL 2017, Ghent, Belgium, 4–8 September 2017, pp. 1–8. IEEE (2017). https://doi.org/10.23919/FPL.2017.8056792

  25. Wang, H., Wu, P., Tanase, I.G., Serrano, M.J., Moreira, J.E.: Simple, portable and fast SIMD intrinsic programming: generic SIMD library. In: Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing, pp. 9–16. ACM, New York, NY (2014). https://doi.org/10.1145/2568058.2568059

  26. Wilkinson, J.: Rounding Errors in Algebraic Processes. Prentice-Hall, Englewood Cliffs (1963)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Przemysław Stpiczyński .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dmitruk, B., Stpiczyński, P. (2023). Parallel Vectorized Implementations of Compensated Summation Algorithms. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2022. Lecture Notes in Computer Science, vol 13827. Springer, Cham. https://doi.org/10.1007/978-3-031-30445-3_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-30445-3_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-30444-6

  • Online ISBN: 978-3-031-30445-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics