Abstract
Infinite-precision operations do not incur rounding errors except when rounding the computed result to a finite-precision value. This can be an effective solution for the accuracy and reproducibility concerns associated with floating-point operations. This research presents an infinite-precision inner product (IP-DOT) and sparse matrix-vector multiplication (IP-SpMV) on FP64 data for manycore processors. We propose using a 106-bit computation using Dot2 in the Ozaki scheme, which is an existing IP-DOT method. First, we discuss the theoretical performance of our method using the roofline model. Then, we demonstrate the actual performance as IP-DOT and reproducible conjugate gradient (CG) solvers, with IP-SpMV as their primary operation, using an Ice Lake CPU and an Ampere GPU. Although the benefits and performance are dependent on the input data, our experiments on IP-DOT demonstrated a speedup of approximately 1.9–3.4 times compared to the previous method, and an execution time overhead of approximately 10–25 times compared to the standard FP64 operation. On reproducible CG, a speedup of 1.1–1.7 times was achieved compared to the existing method, and an execution time overhead of approximately 3–19 times was observed compared to the non-reproducible standard solvers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Be aware, however, that infinite-precision operations do not necessarily improve the stability or accuracy of numerical algorithms.
- 2.
The concept of reproducibility is independent of accuracy. It is simply intended to be able to reproduce the same result.
- 3.
The original Dot2 algorithm is designed to obtain the output as an FP64 value with \(\texttt{fl}(u+v)\) at the end.
- 4.
In fact, it is even possible to adjust the accuracy of the result by varying the number of split vectors. The result will no longer be infinite precision, but reproducibility can still be preserved. See [17] for details.
- 5.
This problem is not encountered in SpMM.
- 6.
For example, XBLAS [14] supports 106-bit operations.
- 7.
9.7 TFlops is the performance without Tensor Cores. 19.5 TFlops with Tensor Cores but cannot be used for Dot2.
- 8.
Major improvements: (1) use of [15], (2) use of in-house GEMM and SpMM with asymmetric splitting on CPUs, (3) use of more recent vendor libraries.
References
Arteaga, A., Fuhrer, O., Hoefler, T.: Designing bit-reproducible portable high-performance applications. In: Proceedings of IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS 2014), pp. 1235–1244 (2014). https://doi.org/10.1109/IPDPS.2014.127
Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2009), pp. 1–11. No. 18 (2009). https://doi.org/10.1145/1654059.1654078
Briggs, K.: Implementing exact real arithmetic in python, c++ and c. Theoret. Comput. Sci. 351(1), 74–81 (2006). https://doi.org/10.1016/j.tcs.2005.09.058
Chohra, C., Langlois, P., Parello, D.: Reproducible, accurately rounded and efficient BLAS. In: 22nd International European Conference on Parallel and Distributed Computing (Euro-Par 2016), pp. 609–620 (2016). https://doi.org/10.1007/978-3-319-58943-5_49
Collange, S., Defour, D., Graillat, S., Iakymchuk, R.: Numerical reproducibility for the parallel reduction on multi- and many-core architectures. Parallel Comput. 49, 83–97 (2015). https://doi.org/10.1016/j.parco.2015.09.001
Davis, T.A., Hu, Y.: The university of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1–1:25 (2011). https://doi.org/10.1145/2049662.2049663
Demmel, J., Ahrens, P., Nguyen, H.D.: Efficient Reproducible Floating Point Summation and BLAS. Technical report. UCB/EECS-2016-121, EECS Department, University of California, Berkeley (2016)
Demmel, J., Eliahu, D., Fox, A., Kamil, S., Lipshitz, B., Schwartz, O., Spillinger, O.: Communication-optimal parallel recursive rectangular matrix multiplication. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 261–272 (2013). https://doi.org/10.1109/IPDPS.2013.80
Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P., Zimmermann, P.: MPFR: a multiple-precision binary floating-point library with correct rounding. ACM Trans. Math. Softw. 33(2), 13:1–13:15 (2007). https://doi.org/10.1145/1236463.1236468
Iakymchuk, R., Barreda, M., Graillat, S., Aliaga, J.I., Quintana-Ortí, E.S.: Reproducibility of parallel preconditioned conjugate gradient in hybrid programming environments. IJHPCA (2020). https://doi.org/10.1177/1094342020932650
Karp, A.H., Markstein, P.: High-precision division and square root. ACM Trans. Math. Softw. 23, 561–589 (1997). https://doi.org/10.1145/279232.279237
Knuth, D.E.: The Art of Computer Programming. Seminumerical Algorithms, vol. 2. Addison-Wesley, Boston (1969)
Lambov, B.: Reallib: an efficient implementation of exact real arithmetic. Math. Struct. Comp. Sci. 17(1), 81–98 (2007). https://doi.org/10.1017/S0960129506005822
Li, X.S., et al.: Design, implementation and testing of extended and mixed precision BLAS. ACM Trans. Math. Softw. 28(2), 152–205 (2000). https://doi.org/10.1145/567806.567808
Minamihata, A., Ozaki, K., Ogita, T., Oishi, S.: Preconditioner for ill-conditioned tall and skinny matrices. In: The 40th JSST Annual International Conference on Simulation Technology (JSST2016) (2016)
Mukunoki, D., Ogita, T.: Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs. J. Comput. Appl. Math. 372, 112701 (2020). https://doi.org/10.1016/j.cam.2019.112701
Mukunoki, D., Ogita, T., Ozaki, K.: Reproducible BLAS routines with tunable accuracy using Ozaki scheme for many-core architectures. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K. (eds.) PPAM 2019. LNCS, vol. 12043, pp. 516–527. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-43229-4_44
Mukunoki, D., Ozaki, K., Ogita, T., Imamura, T.: DGEMM using tensor cores, and its accurate and reproducible versions. In: Sadayappan, P., Chamberlain, B.L., Juckeland, G., Ltaief, H. (eds.) ISC High Performance 2020. LNCS, vol. 12151, pp. 230–248. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50743-5_12
Mukunoki, D., Ozaki, K., Ogita, T., Iakymchuk, R.: Conjugate gradient solvers with high accuracy and bit-wise reproducibility between CPU and GPU using Ozaki scheme. In: Proceedings of The International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2021), pp. 100–109 (2021). https://doi.org/10.1145/3432261.3432270
Müller, N.T.: The irram: Exact arithmetic in c++. In: Computability and Complexity in Analysis. pp. 222–252. Springer, Berlin Heidelberg (2001). DOI: 10.1007/3-540-45335-0_14
Nakata, M.: Mplapack version 1.0.0 user manual (2021)
Ogita, T., Rump, S.M., Oishi, S.: Accurate sum and dot product. SIAM J. Sci. Comput. 26, 1955–1988 (2005). https://doi.org/10.1137/030601818
Ozaki, K., Ogita, T., Oishi, S., Rump, S.M.: Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications. Numer. Algorithms 59(1), 95–118 (2012). https://doi.org/10.1007/s11075-011-9478-1
Ozaki, K., Ogita, T., Oishi, S., Rump, S.M.: Generalization of error-free transformation for matrix multiplication and its application. Nonlinear Theory Appl. IEICE 4, 2–11 (2013). https://doi.org/10.1587/nolta.4.2
Rump, S.M., Ogita, T., Oishi, S.: Accurate floating-point summation Part II: sign, K-Fold faithful and rounding to nearest. SIAM J. Sci. Comput. 31(2), 1269–1302 (2009). https://doi.org/10.1137/07068816X
Todd, R.: Introduction to Conditional Numerical Reproducibility (CNR) (2012). https://software.intel.com/en-us/articles/introduction-to-the-conditional-numerical-reproducibility-cnr
Wei, S., Tang, E., Liu, T., Müller, N.T., Chen, Z.: Automatic numerical analysis based on infinite-precision arithmetic. In: 2014 Eighth International Conference on Software Security and Reliability (SERE), pp. 216–224 (2014). https://doi.org/10.1109/SERE.2014.35
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009). https://doi.org/10.1145/1498765.1498785
Acknowledgment
This research was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant #19K20286. This research was conducted using the FUJITSU Server PRIMERGY GX2570 (Wisteria/BDEC-01) at the Information Technology Center, The University of Tokyo (project #jh220022).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Mukunoki, D., Ozaki, K., Ogita, T., Imamura, T. (2023). Infinite-Precision Inner Product and Sparse Matrix-Vector Multiplication Using Ozaki Scheme with Dot2 on Manycore Processors. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2022. Lecture Notes in Computer Science, vol 13826. Springer, Cham. https://doi.org/10.1007/978-3-031-30442-2_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-30442-2_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30441-5
Online ISBN: 978-3-031-30442-2
eBook Packages: Computer ScienceComputer Science (R0)