Efficient Triangular Matrix Vector Multiplication on the GPU

  • Takahiro Inoue
  • Hiroki Tokura
  • Koji NakanoEmail author
  • Yasuaki Ito
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12043)


The main purpose of this paper is to present a very efficient GPU implementation to compute the trmv, the product of a triangular matrix and a vector. Usually, developers use cuBLAS, a linear algebra library optimized for each of various generations of GPUs, to compute the trmv. To attain better performance than cuBLAS, our GPU implementation of the trmv uses various acceleration technique for latest GPUs. More specifically, our GPU implementation has the following features: (1) only one kernel is called; (2) maximum number of threads are invoked; (3) all memory access to the global memory is coalesced; (4) all memory access to the shared memory has no bank conflict; and (5) shared memory access is minimized by a warp shuffle function. Experimental results for five generations of NVIDIA GPUs for matrices of sizes from \(32\times 32\) to \(\mathrm {16K}\times \mathrm {16K}\) for fp32 show that our GPU implementation is faster than cuBLAS and muBLAS for almost all matrix sizes and GPU generations.


Matrix multiplication Trmv Parallel algorithm GPGPU 


  1. 1.
    Charara, A., Ltaief, H., Keyes, D.: Redesigning triangular dense matrix computations on GPUs. In: Dutot, P.-F., Trystram, D. (eds.) Euro-Par 2016. LNCS, vol. 9833, pp. 477–489. Springer, Cham (2016). Scholar
  2. 2.
    Fujimoto, N.: Faster matrix-vector multiplication on GeForce 8800GTX. In: Proceedings of International Symposium on Parallel and Distributed Processing, April 2008Google Scholar
  3. 3.
    He, G., Gao, J., Wang, J.: Efficient dense matrix-vector multiplication on GPU. Concurr. Comput. Pract. Exp. 30(19), e4705 (2018)CrossRefGoogle Scholar
  4. 4.
    Honda, T., Yamamoto, S., Honda, H., Nakano, K., Ito, Y.: Simple and fast parallel algorithms for the Voronoi map and the Euclidean distance map, with GPU implementations. In: Proceedings of International Conference on Parallel Processing, pp. 362–371, August 2017Google Scholar
  5. 5.
    Hwu, W.W.: GPU Computing Gems Emerald Edition. Morgan Kaufmann, Burlington (2011)Google Scholar
  6. 6.
    Karwacki, M., Stpiczynski, P.: Improving performance of triangular matrix-vector BLAS routines on GPUs. Adv. Parallel Comput. 22, 405–412 (2012)Google Scholar
  7. 7.
    Matsumura, N., Tokura, H., Kuroda, Y., Ito, Y., Nakano, K.: Tile art image generation using conditional generative adversarial networks. In: Proceedings of International Symposium on Computing and Networking Workshops, pp. 209–215 (2018)Google Scholar
  8. 8.
    Mukunoki, D., Imamura, T., Takahashi, D.: Automatic thread-block size adjustment for memory-bound BLAS kernels on GPUs. In: Proceedings of International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, June 2016Google Scholar
  9. 9.
    Muramatsu, J., Fukaya, T., Zhang, S.L., Kimura, K., Yamamoto, Y.: Acceleration of Hessenberg reduction for nonsymmetric eigenvalue problems in a hybrid CPU-GPU computing environment. Int. J. Netw. Comput. 1(2), 132–143 (2011)Google Scholar
  10. 10.
    NVIDIA Corporation: NVIDIA CUDA C programming guide version 4.0 (2011)Google Scholar
  11. 11.
    NVIDIA Corporation: CUBLAS LIBRARY user guide, February 2019.
  12. 12.
    Ogawa, K., Ito, Y., Nakano, K.: Efficient Canny edge detection using a GPU. In: Proceedings of International Conference on Networking and Computing, pp. 279–280. IEEE CS Press, November 2010Google Scholar
  13. 13.
    Takeuchi, Y., Takafuji, D., Ito, Y., Nakano, K.: ASCII art generation using the local exhaustive search on the GPU. In: Proceedings of International Symposium on Computing and Networking, pp. 194–200, December 2013Google Scholar
  14. 14.
    Tokura, H., et al.: An efficient GPU implementation of bulk computation of the eigenvalue problem for many small real non-symmetric matrices. Int. J. Netw. Comput. 7(2), 227–247 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Takahiro Inoue
    • 1
  • Hiroki Tokura
    • 1
  • Koji Nakano
    • 1
    Email author
  • Yasuaki Ito
    • 1
  1. 1.Department of Information EngineeringHiroshima UniversityHigashi-HiroshimaJapan

Personalised recommendations