Skip to main content

Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators

  • Conference paper
High Performance Computing for Computational Science - VECPAR 2012 (VECPAR 2012)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7851))

Abstract

Hardware accelerators are becoming ubiquitous high performance scientific computing. They are capable of delivering an unprecedented level of concurrent execution contexts. High-level programming language extensions (e.g., CUDA), profiling tools (e.g., PAPI-CUDA, CUDA Profiler) are paramount to improve productivity, while effectively exploiting the underlying hardware. We present an optimized numerical kernel for computing the symmetric matrix-vector product on nVidia Fermi GPUs. Due to its inherent memory-bound nature, this kernel is very critical in the tridiagonalization of a symmetric dense matrix, which is a preprocessing step to calculate the eigenpairs. Using a novel design to address the irregular memory accesses by hiding latency and increasing bandwidth, our preliminary asymptotic results show 3.5x and 2.5x fold speedups over the similar CUBLAS 4.0 kernel, and 7-8% and 30% fold improvement over the Matrix Algebra on GPU and Multicore Architectures (MAGMA) library in single and double precision arithmetics, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Matrix Algebra on GPU and Multicore Architectures. Innovative Computing Laboratory, University of Tennessee, http://icl.cs.utk.edu/magma/

  2. Nvidia visual profiler, http://developer.nvidia.com/nvidia-visual-profiler

  3. Performance Application Programming Interface (PAPI). Innovative Computing Laboratory, University of Tennessee, http://icl.cs.utk.edu/papi/

  4. Datta, K., Williams, S., Volkov, V., Carter, J., Oliker, L., Shalf, J., Yelick, K.: Auto-tuning the 27-Point Stencil for Multicore. In: Proc. iWAPT 2009: The Fourth International Workshop on Automatic Performance Tuning (2009)

    Google Scholar 

  5. Glaskowsky, P.N.: nVidia’s Fermi: The first complete gpu computing architecture. Technical report (2009)

    Google Scholar 

  6. Kirk, D., Mei Hwu, W.: Programming Massively Parallel Processors, A Hands-on Approach. Morgan Kaufmann (2010)

    Google Scholar 

  7. Kurzak, J., Buttari, A., Dongarra, J.J.: Solving systems of linear equations on the CELL processor using Cholesky factorization. IEEE Transactions on Parallel and Distributed Systems 19(9), 1–11 (2008)

    Article  Google Scholar 

  8. McCalpin, J.: Stream: Sustainable memory bandwidth in high performance computers, http://www.cs.virginia.edu/stream/

  9. Nath, R., Tomov, S., Dong, T., Dongarra, J.: Optimizing symmetric dense matrix-vector multiplication on gpus. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 6:1–6:10. ACM, New York (2011)

    Chapter  Google Scholar 

  10. Nath, R., Tomov, S., Dongarra, J.: Accelerating GPU Kernels for Dense Linear Algebra. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 83–92. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  11. Volkov, V., Demmel, J.W.: Benchmarking GPUs to Tune Dense Linear Algebra. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 2008, pp. 31:1–31:11. IEEE Press, Piscataway (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Abdelfattah, A., Dongarra, J., Keyes, D., Ltaief, H. (2013). Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators. In: Daydé, M., Marques, O., Nakajima, K. (eds) High Performance Computing for Computational Science - VECPAR 2012. VECPAR 2012. Lecture Notes in Computer Science, vol 7851. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38718-0_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-38718-0_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-38717-3

  • Online ISBN: 978-3-642-38718-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics