A Batched Jacobi SVD Algorithm on GPUs and Its Application to Quantum Lattice Systems

Huang, Rongfeng; Yu, Tianyu; Liu, Shifang; Zhang, Xinyin; Zhao, Yonghua

doi:10.1007/978-3-030-96772-7_7

Rongfeng Huang^16,17,
Tianyu Yu^16,17,
Shifang Liu^16,17,
Xinyin Zhang^16,17 &
…
Yonghua Zhao¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 13148))

Included in the following conference series:

International Conference on Parallel and Distributed Computing: Applications and Technologies

1530 Accesses

Abstract

Batched linear algebra problems are becoming increasingly important in engineering and scientific applications. As the performance of graphics processing units (GPUs) improves rapidly, GPUs are very attractive to solve this class of problems. This paper presents a parallel blocked Jacobi SVD algorithm for many small matrices on GPUs. The parallelism of the Jacobi algorithm is squeezed sufficiently. Our algorithm can be mapped to the GPU memory hierarchy properly due to the blocking structure. Reduction operations used for computing inner products and having low thread utilization are instead by performing the Jacobi rotation on the Gram matrix in parallel. We identify the kernels with sharing data and fuse them to improve memory locality by placing shared data, originally passed via off-chip global memory, into the on-chip shared memory. Numerical results on an NVIDIA Tesla V100 GPU show that our batched SVD routine outperforms state-of-the-art approaches between \(2.0\times \) and \(4.1\times \) for the examples tested. As one of the applications for our routine, the numerical simulation of quantum lattice systems is tested and achieves a maximum of \(54.1\times \) speedups over the CPU implementation running on a 48-core Xeon CPU.

This work is supported by National Key Research and Development Program of China (2017YFB0202202) and Strategic Priority Research Program of Chinese Academy of Sciences (XDC05000000).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abdelfattah, A., Baboulin, M., Dobrev, V., et al.: High-performance tensor contractions for GPUs. Procedia Comput. Sci. 80(1), 108–118 (2016)
Article Google Scholar
Molero, J.M., Garzón, E.M., García, I., Quintana-Ortí, E.S., Plaza, A.: Efficient implementation of hyperspectral anomaly detection techniques on GPUs and multicore processors. IEEE J. Sel. Topics Appl. Earth Obs. Remote Sens. 7(6), 2256–2266 (2014)
Article Google Scholar
Villa, O., Gawande, N., Tumeo, A.: Accelerating subsurface transport simulation on heterogeneous clusters. In: IEEE International Conference on Cluster Computing, Indianapolis, pp. 1–8. IEEE (2013)
Google Scholar
Zhang, T., Liu, X., Wang, X., Walid, A.: cuTensor-tubal: efficient primitives for tubal-rank tensor learning operations on GPUs. IEEE Trans. Parallel Distrib. Syst. 31(3), 595–610 (2020)
Article Google Scholar
NVIDIA cuBLAS Homepage. https://docs.nvidia.com/cuda/pdf/cublas_Library.pdf
AMD rocBLAS Homepage. https://github.com/ROCmSoftwarePlatform/rocBLAS
Abdelfattah, A., Costa, T., Dongarra, J., et al.: A set of batched basic linear algebra subprograms and LAPACK routines. ACM Trans. Math. Softw. 47(7), 1–23 (2021)
Article MathSciNet Google Scholar
Dong, T., Haidar, A., Tomov, S., Dongarra. J.: A fast batched cholesky factorization on a GPU. In: International Conference on Parallel Processing, pp. 432–440. IEEE (2014)
Google Scholar
Abdelfattah, A., Haidar, A., Tomov, S., Dongarra. J.: Factorization and inversion of a million matrices using GPUs: challenges and countermeasures. In: International Conference on Computational Science, pp. 606–615 (2017)
Google Scholar
NVIDIA cuSOLVER Homepage. https://docs.nvidia.com/cuda/cusolver/index.html
Dong, T., Haidar, A., Tomov, S., Dongarra, J.: Accelerating the SVD bi-diagonalization of a batch of small matrices using GPUs. J. Comput. Sci. 26(5), 237–245 (2018)
Article Google Scholar
Badolato, I., Paula, L.D., Farias, R.: Many SVDs on GPU for image mosaic assemble. In: IEEE International Symposium on Computer Architecture and High Performance Computing Workshop, pp. 37–42 (2015)
Google Scholar
Boukaram, W.H., Turkiyyah, G., Ltaief, H., Keyes, D.E.: Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression. Parallel Comput. 74(5), 19–33 (2018)
Article MathSciNet Google Scholar
KBLAS Homepage. https://github.com/ecrc/kblas-gpu, Accessed 30 Nov 2020
Brent, P.P., Luk, F.T.: The solution of singular-value and symmetric eigenvalue problems on multiprocessor arrays. SIAM J. Sci. Stat. Comput. 6(1), 69–84 (1985)
Article MathSciNet Google Scholar
Luk, F.T., Park, H.: On parallel Jacobi orderings. SIAM J. Sci. Stat. Comput. 10(1), 18–26 (1989)
Article MathSciNet Google Scholar
Luk, F.T., Park, H.: A proof of convergence for two parallel jacobi SVD algorithms. IEEE Trans. Comput. 38(6), 806–811 (1989)
Article MathSciNet Google Scholar
Rivera, C., Chen, J., Xiong, N., Zhang, J., Song, S.: TSM2X: high-performance tall-and-skinny matrix-matrix multiplication on GPUs. J. Parallel Distrib. Comput. 151(3), 70–85 (2021)
Article Google Scholar
Filipovič, J., Madzin, M., Fousek, J., Matyska, L.: Optimizing CUDA code by kernel fusion: application on BLAS. J. Supercomput. 71(10), 3934–3957 (2015). https://doi.org/10.1007/s11227-015-1483-z
Article Google Scholar
Vidal, G.: Efficient simulation of one-dimensional quantum many-body systems. Phys. Rev. Lett. 93(4), 40502–40505 (2004)
Article Google Scholar
Dong, S., Liu, W., Wang, C., Han, Y., Guo, G., He, L.: TNSPackage: a fortran 2003 library designed for tensor network state methods. Comput. Phys. Commun. 228(7), 163–177 (2018)
Article Google Scholar

Download references

Acknowledgment

We would like to acknowledge He L. and Dong S. for helpful conversations and insights on numerical simulations of quantum lattice systems.

Author information

Authors and Affiliations

Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
Rongfeng Huang, Tianyu Yu, Shifang Liu, Xinyin Zhang & Yonghua Zhao
University of Chinese Academy of Sciences, Beijing, China
Rongfeng Huang, Tianyu Yu, Shifang Liu & Xinyin Zhang

Authors

Rongfeng Huang
View author publications
You can also search for this author in PubMed Google Scholar
Tianyu Yu
View author publications
You can also search for this author in PubMed Google Scholar
Shifang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xinyin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yonghua Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yonghua Zhao .

Editor information

Editors and Affiliations

Sun Yat-sen University, Guangzhou, Guangdong, China
Hong Shen
Sun Yat-sen University, Guangzhou, China
Yingpeng Sang
Shenzhen Institute of Advanced Technology, Shenzhen, China
Yong Zhang
Sun Yat-sen University, Guangzhou, China
Nong Xiao
University of Georgia, Athens, GA, USA
Hamid R. Arabnia
University of Utah, Salt Lake City, USA
Geoffrey Fox
Western Michigan University, Kalamazoo, MI, USA
Ajay Gupta
Stevens Institute of Technology, Hoboken, NJ, USA
Manu Malek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, R., Yu, T., Liu, S., Zhang, X., Zhao, Y. (2022). A Batched Jacobi SVD Algorithm on GPUs and Its Application to Quantum Lattice Systems. In: Shen, H., et al. Parallel and Distributed Computing, Applications and Technologies. PDCAT 2021. Lecture Notes in Computer Science(), vol 13148. Springer, Cham. https://doi.org/10.1007/978-3-030-96772-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-96772-7_7
Published: 16 March 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-96771-0
Online ISBN: 978-3-030-96772-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Batched Jacobi SVD Algorithm on GPUs and Its Application to Quantum Lattice Systems