Infinite-Precision Inner Product and Sparse Matrix-Vector Multiplication Using Ozaki Scheme with Dot2 on Manycore Processors

Mukunoki, Daichi; Ozaki, Katsuhisa; Ogita, Takeshi; Imamura, Toshiyuki

doi:10.1007/978-3-031-30442-2_4

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13826))

Included in the following conference series:

International Conference on Parallel Processing and Applied Mathematics

435 Accesses

Abstract

Infinite-precision operations do not incur rounding errors except when rounding the computed result to a finite-precision value. This can be an effective solution for the accuracy and reproducibility concerns associated with floating-point operations. This research presents an infinite-precision inner product (IP-DOT) and sparse matrix-vector multiplication (IP-SpMV) on FP64 data for manycore processors. We propose using a 106-bit computation using Dot2 in the Ozaki scheme, which is an existing IP-DOT method. First, we discuss the theoretical performance of our method using the roofline model. Then, we demonstrate the actual performance as IP-DOT and reproducible conjugate gradient (CG) solvers, with IP-SpMV as their primary operation, using an Ice Lake CPU and an Ampere GPU. Although the benefits and performance are dependent on the input data, our experiments on IP-DOT demonstrated a speedup of approximately 1.9–3.4 times compared to the previous method, and an execution time overhead of approximately 10–25 times compared to the standard FP64 operation. On reproducible CG, a speedup of 1.1–1.7 times was achieved compared to the existing method, and an execution time overhead of approximately 3–19 times was observed compared to the non-reproducible standard solvers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Article Open access 11 March 2024

Investigating the Benefit of FP16-Enabled Mixed-Precision Solvers for Symmetric Positive Definite Matrices Using GPUs

Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-Core Architectures

Notes

1.
Be aware, however, that infinite-precision operations do not necessarily improve the stability or accuracy of numerical algorithms.
2.
The concept of reproducibility is independent of accuracy. It is simply intended to be able to reproduce the same result.
3.
The original Dot2 algorithm is designed to obtain the output as an FP64 value with \(\texttt{fl}(u+v)\) at the end.
4.
In fact, it is even possible to adjust the accuracy of the result by varying the number of split vectors. The result will no longer be infinite precision, but reproducibility can still be preserved. See [17] for details.
5.
This problem is not encountered in SpMM.
6.
For example, XBLAS [14] supports 106-bit operations.
7.
9.7 TFlops is the performance without Tensor Cores. 19.5 TFlops with Tensor Cores but cannot be used for Dot2.
8.
Major improvements: (1) use of [15], (2) use of in-house GEMM and SpMM with asymmetric splitting on CPUs, (3) use of more recent vendor libraries.

References

Arteaga, A., Fuhrer, O., Hoefler, T.: Designing bit-reproducible portable high-performance applications. In: Proceedings of IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS 2014), pp. 1235–1244 (2014). https://doi.org/10.1109/IPDPS.2014.127
Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2009), pp. 1–11. No. 18 (2009). https://doi.org/10.1145/1654059.1654078
Briggs, K.: Implementing exact real arithmetic in python, c++ and c. Theoret. Comput. Sci. 351(1), 74–81 (2006). https://doi.org/10.1016/j.tcs.2005.09.058
Article MathSciNet MATH Google Scholar
Chohra, C., Langlois, P., Parello, D.: Reproducible, accurately rounded and efficient BLAS. In: 22nd International European Conference on Parallel and Distributed Computing (Euro-Par 2016), pp. 609–620 (2016). https://doi.org/10.1007/978-3-319-58943-5_49
Collange, S., Defour, D., Graillat, S., Iakymchuk, R.: Numerical reproducibility for the parallel reduction on multi- and many-core architectures. Parallel Comput. 49, 83–97 (2015). https://doi.org/10.1016/j.parco.2015.09.001
Article MathSciNet Google Scholar
Davis, T.A., Hu, Y.: The university of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1–1:25 (2011). https://doi.org/10.1145/2049662.2049663
Demmel, J., Ahrens, P., Nguyen, H.D.: Efficient Reproducible Floating Point Summation and BLAS. Technical report. UCB/EECS-2016-121, EECS Department, University of California, Berkeley (2016)
Google Scholar
Demmel, J., Eliahu, D., Fox, A., Kamil, S., Lipshitz, B., Schwartz, O., Spillinger, O.: Communication-optimal parallel recursive rectangular matrix multiplication. In: 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp. 261–272 (2013). https://doi.org/10.1109/IPDPS.2013.80
Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P., Zimmermann, P.: MPFR: a multiple-precision binary floating-point library with correct rounding. ACM Trans. Math. Softw. 33(2), 13:1–13:15 (2007). https://doi.org/10.1145/1236463.1236468
Iakymchuk, R., Barreda, M., Graillat, S., Aliaga, J.I., Quintana-Ortí, E.S.: Reproducibility of parallel preconditioned conjugate gradient in hybrid programming environments. IJHPCA (2020). https://doi.org/10.1177/1094342020932650
Article MATH Google Scholar
Karp, A.H., Markstein, P.: High-precision division and square root. ACM Trans. Math. Softw. 23, 561–589 (1997). https://doi.org/10.1145/279232.279237
Article MathSciNet MATH Google Scholar
Knuth, D.E.: The Art of Computer Programming. Seminumerical Algorithms, vol. 2. Addison-Wesley, Boston (1969)
MATH Google Scholar
Lambov, B.: Reallib: an efficient implementation of exact real arithmetic. Math. Struct. Comp. Sci. 17(1), 81–98 (2007). https://doi.org/10.1017/S0960129506005822
Article MathSciNet MATH Google Scholar
Li, X.S., et al.: Design, implementation and testing of extended and mixed precision BLAS. ACM Trans. Math. Softw. 28(2), 152–205 (2000). https://doi.org/10.1145/567806.567808
Article MathSciNet Google Scholar
Minamihata, A., Ozaki, K., Ogita, T., Oishi, S.: Preconditioner for ill-conditioned tall and skinny matrices. In: The 40th JSST Annual International Conference on Simulation Technology (JSST2016) (2016)
Google Scholar
Mukunoki, D., Ogita, T.: Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs. J. Comput. Appl. Math. 372, 112701 (2020). https://doi.org/10.1016/j.cam.2019.112701
Article MathSciNet MATH Google Scholar
Mukunoki, D., Ogita, T., Ozaki, K.: Reproducible BLAS routines with tunable accuracy using Ozaki scheme for many-core architectures. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K. (eds.) PPAM 2019. LNCS, vol. 12043, pp. 516–527. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-43229-4_44
Chapter Google Scholar
Mukunoki, D., Ozaki, K., Ogita, T., Imamura, T.: DGEMM using tensor cores, and its accurate and reproducible versions. In: Sadayappan, P., Chamberlain, B.L., Juckeland, G., Ltaief, H. (eds.) ISC High Performance 2020. LNCS, vol. 12151, pp. 230–248. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50743-5_12
Chapter Google Scholar
Mukunoki, D., Ozaki, K., Ogita, T., Iakymchuk, R.: Conjugate gradient solvers with high accuracy and bit-wise reproducibility between CPU and GPU using Ozaki scheme. In: Proceedings of The International Conference on High Performance Computing in Asia-Pacific Region (HPC Asia 2021), pp. 100–109 (2021). https://doi.org/10.1145/3432261.3432270
Müller, N.T.: The irram: Exact arithmetic in c++. In: Computability and Complexity in Analysis. pp. 222–252. Springer, Berlin Heidelberg (2001). DOI: 10.1007/3-540-45335-0_14
Google Scholar
Nakata, M.: Mplapack version 1.0.0 user manual (2021)
Google Scholar
Ogita, T., Rump, S.M., Oishi, S.: Accurate sum and dot product. SIAM J. Sci. Comput. 26, 1955–1988 (2005). https://doi.org/10.1137/030601818
Article MathSciNet MATH Google Scholar
Ozaki, K., Ogita, T., Oishi, S., Rump, S.M.: Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications. Numer. Algorithms 59(1), 95–118 (2012). https://doi.org/10.1007/s11075-011-9478-1
Article MathSciNet MATH Google Scholar
Ozaki, K., Ogita, T., Oishi, S., Rump, S.M.: Generalization of error-free transformation for matrix multiplication and its application. Nonlinear Theory Appl. IEICE 4, 2–11 (2013). https://doi.org/10.1587/nolta.4.2
Article Google Scholar
Rump, S.M., Ogita, T., Oishi, S.: Accurate floating-point summation Part II: sign, K-Fold faithful and rounding to nearest. SIAM J. Sci. Comput. 31(2), 1269–1302 (2009). https://doi.org/10.1137/07068816X
Article MATH Google Scholar
Todd, R.: Introduction to Conditional Numerical Reproducibility (CNR) (2012). https://software.intel.com/en-us/articles/introduction-to-the-conditional-numerical-reproducibility-cnr
Wei, S., Tang, E., Liu, T., Müller, N.T., Chen, Z.: Automatic numerical analysis based on infinite-precision arithmetic. In: 2014 Eighth International Conference on Software Security and Reliability (SERE), pp. 216–224 (2014). https://doi.org/10.1109/SERE.2014.35
Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009). https://doi.org/10.1145/1498765.1498785
Article Google Scholar

Download references

Acknowledgment

This research was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant #19K20286. This research was conducted using the FUJITSU Server PRIMERGY GX2570 (Wisteria/BDEC-01) at the Information Technology Center, The University of Tokyo (project #jh220022).

Author information

Authors and Affiliations

RIKEN Center for Computational Science, Kobe, Hyogo, Japan
Daichi Mukunoki & Toshiyuki Imamura
Shibaura Institute of Technology, Saitama, Japan
Katsuhisa Ozaki
Tokyo Woman’s Christian University, Tokyo, Japan
Takeshi Ogita

Authors

Daichi Mukunoki
View author publications
You can also search for this author in PubMed Google Scholar
Katsuhisa Ozaki
View author publications
You can also search for this author in PubMed Google Scholar
Takeshi Ogita
View author publications
You can also search for this author in PubMed Google Scholar
Toshiyuki Imamura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daichi Mukunoki .

Editor information

Editors and Affiliations

Czestochowa University of Technology, Czestochowa, Poland
Roman Wyrzykowski
University of Tennessee, Knoxville, TN, USA
Jack Dongarra
University of Southern California, Marina del Rey, CA, USA
Ewa Deelman
Czestochowa University of Technology, Czestochowa, Poland
Konrad Karczewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mukunoki, D., Ozaki, K., Ogita, T., Imamura, T. (2023). Infinite-Precision Inner Product and Sparse Matrix-Vector Multiplication Using Ozaki Scheme with Dot2 on Manycore Processors. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2022. Lecture Notes in Computer Science, vol 13826. Springer, Cham. https://doi.org/10.1007/978-3-031-30442-2_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-30442-2_4
Published: 28 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30441-5
Online ISBN: 978-3-031-30442-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Infinite-Precision Inner Product and Sparse Matrix-Vector Multiplication Using Ozaki Scheme with Dot2 on Manycore Processors

Abstract

Access this chapter

Similar content being viewed by others

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Investigating the Benefit of FP16-Enabled Mixed-Precision Solvers for Symmetric Positive Definite Matrices Using GPUs

Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-Core Architectures

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Infinite-Precision Inner Product and Sparse Matrix-Vector Multiplication Using Ozaki Scheme with Dot2 on Manycore Processors

Abstract

Access this chapter

Similar content being viewed by others

Block-wise dynamic mixed-precision for sparse matrix-vector multiplication on GPUs

Investigating the Benefit of FP16-Enabled Mixed-Precision Solvers for Symmetric Positive Definite Matrices Using GPUs

Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-Core Architectures

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation