Abstract
The aim of this paper is to show that Kahan’s and Gill-Møller compensated summation algorithms that allow to achieve high accuracy of summing long sequences of floating-point numbers can be efficiently vectorized and parallelized. The new implementation uses Intel AVX-512 intrinsics together with OpenMP constructs in order to utilize SIMD extension of modern multicore processors. We describe in detail the vectorization technique and show how to define custom reduction operators in OpenMP. Numerical experiments performed on a server with Intel Xeon Gold 6342 processors show that the new implementations of the compensated summation algorithms achieve much better accuracy than ordinary summation and their performance is comparable with the performance of the ordinary summation algorithm optimized automatically. Moreover, the experiments show that the vectorized implementation of the Gill-Møller algorithm is faster than the vectorized implementation of Kahan’s algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ahrens, P., Demmel, J., Nguyen, H.D.: Algorithms for efficient reproducible floating point summation. ACM Trans. Math. Softw. 46, 22:1–22:49 (2020). https://doi.org/10.1145/3389360
Amiri, H., Shahbahrami, A.: SIMD programming using intel vector extensions. J. Parallel Distrib. Comput. 135, 83–100 (2020). https://doi.org/10.1016/j.jpdc.2019.09.012
Collange, S., Defour, D., Graillat, S., Iakymchuk, R.: Numerical reproducibility for the parallel reduction on multi- and many-core architectures. Parallel Comput. 49, 83–97 (2015). https://doi.org/10.1016/j.parco.2015.09.001
Goldberg, D.: What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 5–48 (1991). https://doi.org/10.1145/103162.103163
He, Y., Ding, C.H.Q.: Using accurate arithmetics to improve numerical reproducibility and stability in parallel applications. J. Supercomput. 18, 259–277 (2001). https://doi.org/10.1023/A:1008153532043
Higham, N.J.: The accuracy of floating point summation. SIAM J. Sci. Comput. 14, 783–799 (1993). https://doi.org/10.1137/0914050
Higham, N.: Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia (1996)
Hofmann, J., Fey, D., Riedmann, M., Eitzinger, J., Hager, G., Wellein, G.: Performance analysis of the Kahan-enhanced scalar product on current multicore processors. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9573, pp. 63–73. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32149-3_7
Hofmann, J., Fey, D., Riedmann, M., Eitzinger, J., Hager, G., Wellein, G.: Performance analysis of the Kahan-enhanced scalar product on current multi-core and many-core processors. Concurr. Comput. Pract. Exp. 29(9) (2017). https://doi.org/10.1002/cpe.3921
Jankowski, M., Smoktunowicz, A., Woźniakowski, H.: A note on floating-point summation of very many terms. Elektronische Informationsverarbeitung und Kybernetik 19, 435–440 (1983)
Jankowski, M., Woźniakowski, H.: The accurate solution of certain continuous problems using only single precision arithmetic. BIT Num.l Math. (1985). https://doi.org/10.1007/BF01936142
Jeffers, J., Reinders, J., Sodani, A.: Intel Xeon Phi Processor High-Performance Programming. Knights Landing Edition. Morgan Kaufman, Cambridge (2016)
Kahan, W.: Pracniques: further remarks on reducing truncation errors. Commun. ACM 8, 40 (1965). https://doi.org/10.1145/363707.363723
Kiełbasiński, A.: The summation algorithm with correction and their applications. Math. Appl. (Matematyka Stosowana) (1973). 10.14708/ma.v1i1.295
Lefèvre, V.: Correctly rounded arbitrary-precision floating-point summation. IEEE Trans. Comput. 66, 2111–2124 (2017). https://doi.org/10.1109/TC.2017.2690632
Lei, X., Gu, T., Graillat, S., Jiang, H., Qi, J.: A fast parallel high-precision summation algorithm based on AccSumK. J. Computut. Appl. Math. 406, 113827 (2022). https://doi.org/10.1016/j.cam.2021.113827
Lutz, D.R., Hinds, C.N.: High-precision anchored accumulators for reproducible floating-point summation. In: Burgess, N., Bruguera, J.D., de Dinechin, F. (eds.) 24th IEEE Symposium on Computer Arithmetic, ARITH 2017, London, UK, 24–26 July 2017, pp. 98–105. IEEE Computer Society (2017). https://doi.org/10.1109/ARITH.2017.20
Møller, O.: Quasi double-precision in floating point addition. BIT Num.l Math. 5, 37–50 (1965). https://doi.org/10.1007/BF01975722
Neuman, B., Dubois, A., Monroe, L., Robey, R.W.: Fast, good, and repeatable: Summations, vectorization, and reproducibility. Int. J. High Perform. Comput. Appl. 34 (2020). https://doi.org/10.1177/1094342020938425
van der Pas, R., Stotzer, E., Terboven, C.: Using OpenMP - The Next Step. Affinity, Accelerators, Tasking, and SIMD. MIT Press, Cambridge (2017)
Stojanov, A., Toskov, I., Rompf, T., Püschel, M.: SIMD intrinsics on managed language runtimes. In: Proceedings of the 2018 International Symposium on Code Generation and Optimization, pp. 2–15. ACM, New York, NY (2018). https://doi.org/10.1145/3168810
Stpiczyński, P.: Language-based vectorization and parallelization using intrinsics, OpenMP, TBB and Cilk Plus. J. Supercomput. 74(4), 1461–1472 (2018). https://doi.org/10.1007/s11227-017-2231-3
Stpiczyński, P.: Algorithmic and language-based optimization of Marsa-LFIB4 pseudorandom number generator using OpenMP, OpenACC and CUDA. J. Parallel Distrib. Comput. 137, 238–245 (2020). https://doi.org/10.1016/j.jpdc.2019.12.004
Uguen, Y., de Dinechin, F., Derrien, S.: Bridging high-level synthesis and application-specific arithmetic: the case study of floating-point summations. In: Santambrogio, M.D., Göhringer, D., Stroobandt, D., Mentens, N., Nurmi, J. (eds.) 27th International Conference on Field Programmable Logic and Applications, FPL 2017, Ghent, Belgium, 4–8 September 2017, pp. 1–8. IEEE (2017). https://doi.org/10.23919/FPL.2017.8056792
Wang, H., Wu, P., Tanase, I.G., Serrano, M.J., Moreira, J.E.: Simple, portable and fast SIMD intrinsic programming: generic SIMD library. In: Proceedings of the 2014 Workshop on Programming Models for SIMD/Vector Processing, pp. 9–16. ACM, New York, NY (2014). https://doi.org/10.1145/2568058.2568059
Wilkinson, J.: Rounding Errors in Algebraic Processes. Prentice-Hall, Englewood Cliffs (1963)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Dmitruk, B., Stpiczyński, P. (2023). Parallel Vectorized Implementations of Compensated Summation Algorithms. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2022. Lecture Notes in Computer Science, vol 13827. Springer, Cham. https://doi.org/10.1007/978-3-031-30445-3_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-30445-3_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30444-6
Online ISBN: 978-3-031-30445-3
eBook Packages: Computer ScienceComputer Science (R0)