Optimizing Memory-Bound SYMV Kernel on GPU Hardware Accelerators

  • Ahmad Abdelfattah
  • Jack Dongarra
  • David Keyes
  • Hatem Ltaief
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7851)


Hardware accelerators are becoming ubiquitous high performance scientific computing. They are capable of delivering an unprecedented level of concurrent execution contexts. High-level programming language extensions (e.g., CUDA), profiling tools (e.g., PAPI-CUDA, CUDA Profiler) are paramount to improve productivity, while effectively exploiting the underlying hardware. We present an optimized numerical kernel for computing the symmetric matrix-vector product on nVidia Fermi GPUs. Due to its inherent memory-bound nature, this kernel is very critical in the tridiagonalization of a symmetric dense matrix, which is a preprocessing step to calculate the eigenpairs. Using a novel design to address the irregular memory accesses by hiding latency and increasing bandwidth, our preliminary asymptotic results show 3.5x and 2.5x fold speedups over the similar CUBLAS 4.0 kernel, and 7-8% and 30% fold improvement over the Matrix Algebra on GPU and Multicore Architectures (MAGMA) library in single and double precision arithmetics, respectively.


Shared Memory Global Memory Double Precision Memory Bandwidth Thread Block 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Matrix Algebra on GPU and Multicore Architectures. Innovative Computing Laboratory, University of Tennessee,
  2. 2.
  3. 3.
    Performance Application Programming Interface (PAPI). Innovative Computing Laboratory, University of Tennessee,
  4. 4.
    Datta, K., Williams, S., Volkov, V., Carter, J., Oliker, L., Shalf, J., Yelick, K.: Auto-tuning the 27-Point Stencil for Multicore. In: Proc. iWAPT 2009: The Fourth International Workshop on Automatic Performance Tuning (2009)Google Scholar
  5. 5.
    Glaskowsky, P.N.: nVidia’s Fermi: The first complete gpu computing architecture. Technical report (2009)Google Scholar
  6. 6.
    Kirk, D., Mei Hwu, W.: Programming Massively Parallel Processors, A Hands-on Approach. Morgan Kaufmann (2010)Google Scholar
  7. 7.
    Kurzak, J., Buttari, A., Dongarra, J.J.: Solving systems of linear equations on the CELL processor using Cholesky factorization. IEEE Transactions on Parallel and Distributed Systems 19(9), 1–11 (2008)CrossRefGoogle Scholar
  8. 8.
    McCalpin, J.: Stream: Sustainable memory bandwidth in high performance computers,
  9. 9.
    Nath, R., Tomov, S., Dong, T., Dongarra, J.: Optimizing symmetric dense matrix-vector multiplication on gpus. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2011, pp. 6:1–6:10. ACM, New York (2011)CrossRefGoogle Scholar
  10. 10.
    Nath, R., Tomov, S., Dongarra, J.: Accelerating GPU Kernels for Dense Linear Algebra. In: Palma, J.M.L.M., Daydé, M., Marques, O., Lopes, J.C. (eds.) VECPAR 2010. LNCS, vol. 6449, pp. 83–92. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  11. 11.
    Volkov, V., Demmel, J.W.: Benchmarking GPUs to Tune Dense Linear Algebra. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC 2008, pp. 31:1–31:11. IEEE Press, Piscataway (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Ahmad Abdelfattah
    • 1
  • Jack Dongarra
    • 2
  • David Keyes
    • 1
  • Hatem Ltaief
    • 3
  1. 1.Division of Mathematical and Computer Sciences and EngineeringKAUSTThuwalSaudi Arabia
  2. 2.Innovative Computing LaboratoryUniversity of TennesseeKnoxvilleUSA
  3. 3.Supercomputing LaboratoryKAUSTThuwalSaudi Arabia

Personalised recommendations