Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators

Abdelfattah, Ahmad; Dongarra, Jack; Keyes, David; Ltaief, Hatem

doi:10.1007/978-3-642-38718-0_10

Ahmad Abdelfattah¹⁹,
Jack Dongarra²⁰,
David Keyes¹⁹ &
…
Hatem Ltaief²¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7851))

Included in the following conference series:

International Conference on High Performance Computing for Computational Science

2062 Accesses
5 Citations

Abstract

Hardware accelerators are becoming ubiquitous high performance scientific computing. They are capable of delivering an unprecedented level of concurrent execution contexts. High-level programming language extensions (e.g., CUDA), profiling tools (e.g., PAPI-CUDA, CUDA Profiler) are paramount to improve productivity, while effectively exploiting the underlying hardware. We present an optimized numerical kernel for computing the symmetric matrix-vector product on nVidia Fermi GPUs. Due to its inherent memory-bound nature, this kernel is very critical in the tridiagonalization of a symmetric dense matrix, which is a preprocessing step to calculate the eigenpairs. Using a novel design to address the irregular memory accesses by hiding latency and increasing bandwidth, our preliminary asymptotic results show 3.5x and 2.5x fold speedups over the similar CUBLAS 4.0 kernel, and 7-8% and 30% fold improvement over the Matrix Algebra on GPU and Multicore Architectures (MAGMA) library in single and double precision arithmetics, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Matrix Algebra on GPU and Multicore Architectures. Innovative Computing Laboratory, University of Tennessee, http://icl.cs.utk.edu/magma/
Nvidia visual profiler, http://developer.nvidia.com/nvidia-visual-profiler
Performance Application Programming Interface (PAPI). Innovative Computing Laboratory, University of Tennessee, http://icl.cs.utk.edu/papi/
Datta, K., Williams, S., Volkov, V., Carter, J., Oliker, L., Shalf, J., Yelick, K.: Auto-tuning the 27-Point Stencil for Multicore. In: Proc. iWAPT 2009: The Fourth International Workshop on Automatic Performance Tuning (2009)
Google Scholar
Glaskowsky, P.N.: nVidia’s Fermi: The first complete gpu computing architecture. Technical report (2009)
Google Scholar
Kirk, D., Mei Hwu, W.: Programming Massively Parallel Processors, A Hands-on Approach. Morgan Kaufmann (2010)
Google Scholar
Kurzak, J., Buttari, A., Dongarra, J.J.: Solving systems of linear equations on the CELL processor using Cholesky factorization. IEEE Transactions on Parallel and Distributed Systems 19(9), 1–11 (2008)
Article Google Scholar
McCalpin, J.: Stream: Sustainable memory bandwidth in high performance computers, http://www.cs.virginia.edu/stream/
Nath, R., Tomov, S., Dong, T., Dongarra, J.: Optimizing symmetric dense matrix-vector multiplication on gpus. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 6:1–6:10. ACM, New York (2011)
Chapter Google Scholar
Nath, R., Tomov, S., Dongarra, J.: Accelerating GPU Kernels for Dense Linear Algebra. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 83–92. Springer, Heidelberg (2011)
Chapter Google Scholar
Volkov, V., Demmel, J.W.: Benchmarking GPUs to Tune Dense Linear Algebra. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 2008, pp. 31:1–31:11. IEEE Press, Piscataway (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Division of Mathematical and Computer Sciences and Engineering, KAUST, Thuwal, Saudi Arabia
Ahmad Abdelfattah & David Keyes
Innovative Computing Laboratory, University of Tennessee, Knoxville, TN, USA
Jack Dongarra
Supercomputing Laboratory, KAUST, Thuwal, Saudi Arabia
Hatem Ltaief

Authors

Ahmad Abdelfattah
View author publications
You can also search for this author in PubMed Google Scholar
Jack Dongarra
View author publications
You can also search for this author in PubMed Google Scholar
David Keyes
View author publications
You can also search for this author in PubMed Google Scholar
Hatem Ltaief
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

INPT (ENSEEIHT) - IRIT, University of Toulouse, 31062, Toulouse, France
Michel Daydé
Lawrence Berkeley National Laboratory, 94720-8139, Berkeley, CA, USA
Osni Marques
Information Technology Center, The University of Tokyo, 113-8658, Tokyo, Japan
Kengo Nakajima

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abdelfattah, A., Dongarra, J., Keyes, D., Ltaief, H. (2013). Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators. In: Daydé, M., Marques, O., Nakajima, K. (eds) High Performance Computing for Computational Science - VECPAR 2012. VECPAR 2012. Lecture Notes in Computer Science, vol 7851. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38718-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-642-38718-0_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38717-3
Online ISBN: 978-3-642-38718-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics