Abstract
For complex problems in scientific computing, parallel computing is almost the only way to solve them, in which global reduction is one of the most frequently used operations. Due to the existence of floating-point rounding errors, the existing global reduction algorithm may result in inaccurate or different between two runs, which are difficult to meet the needs of complex applications. Since the communication cost of RingAllreduce is a constant, independent of the number of processes, it is an effective algorithm when a large amount of data needs to be communicated. However, it faces the same problem as the general global reduction operation, and it is necessary to develop a high-precision RingAllreduce algorithm. In this paper, by combining double-double arithmetic and RingAllreduce algorithm, we propose a high-precision RingAllreduce algorithm, called ddRingAllreduce algorithm. The theoretical error of the proposed algorithm is analyzed and the compact error bounds are derived. We have carried out a large number of parallel numerical experiments and obtained numerical results consistent with the theoretical analysis, and ddRingAllreduce is accurate in the case that RingAllreduce is inaccurate or miscalculated. At the same time, we also analyze the relationship between the problem size and the cost of using double-double arithmetic through experiments, at a small scale, the ddRingAllreduce algorithm can achieve higher accuracy with relatively less time overhead.
Similar content being viewed by others
References
Ahrens, P., Nguyen, H., Demmel, J.: Efficient reproducible floating point summation and BLAS. ACM Trans. Math. Softw. 46(3), 1–49 (2015)
Ahrens, P., Demmel, J., Nguyen, H.D.: Algorithms for efficient reproducible floating point summation. ACM Trans. Math. Softw. 46(3), 1–49 (2020)
ANSI/IEEE.: IEEE Standard for Binary Floating Point Arithmetic, Std 754–2019. IEEE, New York (2019)
Blanchard, P., Higham, N., Mary, T.: A class of fast and accurate summation algorithms. SIAM J. Sci. Comput. 42(3), 1541–1557 (2020)
Dekker, T.J.: A floating-point technique for extending the available precision. Numer. Math. 18, 224–242 (1971)
Demmel, J., Hida, Y.: Fast and accurate floating point summation with application to computational geometry. Numer. Algorithms 37, 101–112 (2004)
Demmel, J., Nguyen, H.D.: Fast reproducible floating-point summation. In: Prof of the 21th IEEE Symposium on Computer Arithmetic, pp. 163–172 (2013)
Demmel, J., Nguyen, H.D.: Parallel reproducible summation. IEEE Trans. Comput. 64(7), 2060–2070 (2015)
Dogru, A.H., Fung, L.S., Middya, U., Al-Shaalan, T.M., Tom B., Hahn H., Werner A.H., Al-Zamel, N., Pita, J., Hemanthkumar, K., et al.: Newfrontiers in large scale reservoir simulation. SPE (2011)
Fousse, L., Hanrot, G., Lefevre, V., Pelissier, P., Zimmermann, P.: MPFR: a multiple-precision binary floating-point library with correct rounding. ACM Trans. Math. Softw. 33, 13-es (2007)
Hida, Y., Li, X.S., Bailey, D.H.: Algorithms for quad-double precision floating point arithmetic. In: ARITH01, pp. 55–162 (2001)
Higham, N.: Accuracy and Stability of Numerical Algorithms, 2nd edn. SIAM Publications, Philadelphia (2002)
Iakymchuk, R., Collange, S., Defour, D., Graillat, S.: ExBLAS: reproducible and accurate BLAS library. NRE2015 (SC15) (2015)
Jiang, H.: Study on reliable computing and rounding error analysis in floating-point arithmetic (in Chinese). PhD Thesis, Changsha, National University of Defense Technology (2013)
Kimura, R.: Numerical weather prediction. J. Wind. Eng. Ind. Aerodyn. 90, 1403–1414 (2002)
Knuth, D.E.: The Art of Computer Programming: Seminumerical Algorithms, vol. 2. Addison-Wesley, Reading (1969)
Lei, X., Tongxiang, G., Graillat, S., et al.: A fast parallel high-precision summation algorithm based on AccSumK. J. Comput. Appl. Math. 406, 0377–0427 (2021)
Lei, X., Gu, T., Graillat, S., Xu, X., Meng, J.: Comparison of reproducible parallel preconditioned BiCGSTAB algorithm based on ExBLAS and ReproBLAS. In: HPC Asia’23, Association for Computing Machinery, New York, pp 46–54 (2023)
Li, X.S., Demmel, J., Bailey, D.H., et al.: Design, implementation and testing of extended and mixed precision BLAS. ACM Trans. Math. Softw. 28(2), 152–205 (2002)
Muller, J.M., Brisebarre, N., Dinechin, F.D.: Handbook of Floating-Point Arithmetic. Birkhäuser (2010)
Ogita, T., Rump, S., Oishi, S.: Accurate sum and dot product. SIAM J. Sci. Comput. 26(6), 1955–1988 (2005)
Patarasuk, P., Xin, Y.: Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput. 69(2), 117–124 (2009)
Rabenseifner, R.: Optimization of collective reduction operations. In: LNCS 3036: International Conference on Computational Science, pp. 1–9 (2004)
Rabenseifner, R., Traff, J.L.: More efficient reduction algorithms for nonpower-of-two number of processors in message-passing parallel systems. In: LNCS 3241: EuroPVM/MPI, pp. 36–46 (2004)
Rump, S.: Ultimately fast accurate summation. SIAM J. Sci. Comput. 31(5), 3466–3502 (2009)
Rump, S., Ogita, T., Oishi, S.: Accurate floating-point summation I: faithful rounding. SIAM J. Sci. Comput. 31(1), 189–224 (2008)
Rump, S., Ogita, T., Oishi, S.: Accurate floating-point summation part II: sign K-Fold faithful and rounding to nearest. SIAM J. Sci. Comput. 31(2), 1269–1302 (2008)
The MPI forum.: MPI: A Message-Passing Interface Standard, version 1.3 (2008). https://www.mpi-forum.org/docs/mpi-1.3/mpi-report-1.3-2008-05-30.pdf
van de Geijn, R.: On global combine operations. J. Parallel Distrib. Comput. 22(2), 324–328 (1994)
Xiaowen, X., Zeyao, M., Hengbin, A.: Algebraic two-level iterative method for 2-D 3-T radiation diffusion equations. Chin. J. Comput. Phys. 26(1), 1 (2009)
Yamanaka, N., Ogita, T., Rump, S., Oishi, S.: A parallel algorithm for accurate dot product. Parallel Comput. 34(6–8), 392–410 (2008)
Zhou, Y.: A discussion on the matching relations among the word length, speed and memory space of digital electronic computer for the use of scientific calculation (in Chinese). J. Numer. Method Comput. Appl. 1(3), 181–192 (1980)
Acknowledgements
The second author was supported by the foundation of key laboratory of computational physics, China. The third author is financially supported by the National Natural Science Foundation of China(62032023).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lei, X., Gu, T. & Xu, X. ddRingAllreduce: a high-precision RingAllreduce algorithm. CCF Trans. HPC 5, 245–257 (2023). https://doi.org/10.1007/s42514-023-00150-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42514-023-00150-2