A Note on Auto-tuning GEMM for GPUs

  • Yinan Li
  • Jack Dongarra
  • Stanimire Tomov
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5544)

Abstract

The development of high performance dense linear algebra (DLA) critically depends on highly optimized BLAS, and especially on the matrix multiplication routine (GEMM). This is especially true for Graphics Processing Units (GPUs), as evidenced by recently published results on DLA for GPUs that rely on highly optimized GEMM. However, the current best GEMM performance, e.g. of up to 375 GFlop/s in single precision and of up to 75 GFlop/s in double precision arithmetic on NVIDIA’s GTX 280, is difficult to achieve. The development involves extensive GPU knowledge and even backward engineering to understand some undocumented insides about the architecture that have been of key importance in the development. In this paper, we describe some GPU GEMM auto-tuning optimization techniques that allow us to keep up with changing hardware by rapidly reusing, rather than reinventing, the existing ideas. Auto-tuning, as we show in this paper, is a very practical solution where in addition to getting an easy portability, we can often get substantial speedups even on current GPUs (e.g. up to 27% in certain cases for both single and double precision GEMMs on the GTX 280).

Keywords

Auto-tuning matrix multiply dense linear algebra GPUs 

References

  1. 1.
    Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK user’s guide, 3rd edn. SIAM, Philadelphia (1999)MATHGoogle Scholar
  2. 2.
    Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: A view from berkeley, Tech. Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley (December 2006)Google Scholar
  3. 3.
    Baboulin, M., Demmel, J., Dongarra, J., Tomov, S., Volkov, V.: Enhancing the performance of dense linear algebra solvers on GPUs [in the MAGMA project]. Poster at Supercomputing 2008, November 18 (2008), http://www.cs.utk.edu/~tomov/SC08-poster.pdf
  4. 4.
    Barrachina, S., Castillo, M., Igual, F., Mayo, R., Quintana-Orti, E., Quintana-Orti, G.: Exploiting the capabilities of modern GPUs for dense matrix computations, Technical Report ICC 01-11-2008, Universidad Jaime I, Spain (2008)Google Scholar
  5. 5.
    Bilmes, J., Asanovic, K., Chin, C.-W., Demmel, J.: Optimizing Matrix Multiply Using PHiPAC: A Portable, High-Performance, ANSI C Coding Methodology. In: International Conference on Supercomputing, pp. 340–347 (1997)Google Scholar
  6. 6.
    Bosilca, G., Chen, Z., Dongarra, J., Eijkhout, V., Fagg, G., Fuentes, E., Langou, J., Luszczek, P., Pjesivac-Grbovic, J., Seymour, K., You, H., Vadiyar, S.S.: Self adapting numerical software (SANS) effort. IBM Journal of Reseach and Development 50(2/3), 223–238 (2006)Google Scholar
  7. 7.
    Demmel, J., Dongarra, J., Eijkhout, V., Fuentes, E., Petitet, A., Vuduc, R., Whaley, C., Yelick, K.: Self adapting linear algebra algorithms and software. Proceedings of the IEEE 93(2) (2005); special issue on Program Generation, Optimization, and AdaptationGoogle Scholar
  8. 8.
    Dongarra, J., Moore, S., Peterson, G., Tomov, S., Allred, J., Natoli, V., Richie, D.: Exploring new architectures in accelerating CFD for Air Force applications. In: Proceedings of HPCMP Users Group Conference 2008, July 14-17 (2008), http://www.cs.utk.edu/~tomov/ugc2008_final.pdf
  9. 9.
    Frigo, M., Johnson, S.G.: FFTW: An Adaptive Software Architecture for the FFT. In: Proc. 1998 IEEE Intl. Conf. Acoustics Speech and Signal Processing, vol. 3, pp. 1381–1384. IEEE, Los Alamitos (1998)Google Scholar
  10. 10.
    Gunnels, J.A., Van De Geijn, R.A., Henry, G.M.: Flame: Formal linear algebra methods environment. ACM Transactions on Mathematical Software 27, 422–455 (2001)MATHCrossRefGoogle Scholar
  11. 11.
    Volkov, V., Demmel, J.: Benchmarking GPUs to tune dense linear algebra. In: Supercomputing 2008. IEEE, Los Alamitos (2008) (to appear)Google Scholar
  12. 12.
    Whaley, R.C., Petitet, A., Dongarra, J.: Automated Empirical Optimizations of Software and the ATLAS Project. Parallel Computing 27(1-2), 3–35 (2001)MATHCrossRefGoogle Scholar
  13. 13.
    Wolfe, M.: Compilers and More: Optimizing GPU Kernels, 10/2008, HPC Wire, http://www.hpcwire.com/features/33607434.html

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Yinan Li
    • 1
  • Jack Dongarra
    • 1
    • 2
    • 3
  • Stanimire Tomov
    • 1
  1. 1.University of TennesseeUSA
  2. 2.Oak Ridge National LaboratoryUSA
  3. 3.University of ManchesterUK

Personalised recommendations