Cluster Computing

, Volume 20, Issue 3, pp 2133–2142 | Cite as

Parameter based tuning model for optimizing performance on GPU

  • Nhat-Phuong Tran
  • Myungho LeeEmail author
  • Jaeyoung Choi


Recently, the graphic processing units (GPUs) are becoming increasingly popular for the high performance computing applications. Although the GPUs provide high peak performance, exploiting the full performance potential for application programs, however, leaves a challenging task to the programmers. When launching a parallel kernel of an application on the GPU, the programmer needs to carefully select the number of blocks (grid size) and the number of threads per block (block size). These values determine the degree of SIMD parallelism and the multithreading, and greatly influence the performance. With a huge range of possible combinations of these values, choosing the right grid size and the block size is not straightforward. In this paper, we propose a mathematical model for tuning the grid size and the block size based on the GPU architecture parameters. Using our model we first calculate a small set of candidate grid size and block size values, then search for the optimal values out of the candidate values through experiments. Our approach significantly reduces the potential search space instead of exhaustive search approaches in the previous research. Thus our approach can be practically applied to the real applications.


GPU High performance computing Performance tuning Multi-threading Micro-benchmark 



This research was supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science, and Technology (NRF-2015M3C4A7065662).


  1. 1.
    Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proc. IEEE 93(2), 216–231 (2005)CrossRefGoogle Scholar
  2. 2.
    Group Khronos: OpenCL. (2015)
  3. 3.
    Nath, R., Tomov, S., Dongarra, J., Agullo, E.: Autotuning dense linear algebra libraries on gpus and overview of the magma library. In: 6th International Workshop on Parallel Matrix Algorithms and Applications (PMAA’10), June (2010)Google Scholar
  4. 4.
    NVIDA: CUDA Toolkit Documentation., September (2015)
  5. 5. OpenACC., March (2012)
  6. 6.
    Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.W.: Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, pp. 73–82. ACM, New York (2008)Google Scholar
  7. 7.
    Torres, Y., Gonzalez-Escribano, A., Llanos, D.R.: Understanding the impact of cuda tuning techniques for fermi. In: 2011 International Conference on High Performance Computing and Simulation (HPCS), pp. 631–639. IEEE (2011)Google Scholar
  8. 8.
    Vuduc, R., Demmel, J.W., Yelick, K.A.: Oski: A library of automatically tuned sparse matrix kernels. J. Phys. 16, 521 (2005)Google Scholar
  9. 9.
    Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, pp. 1–27. IEEE Computer Society (1998)Google Scholar
  10. 10.
    Yang, Y., Xiang, P., Kong, J., Zhou, H.: An optimizing compiler for gpgpu programs with input-data sharing. In: ACM Sigplan Notices, vol. 45, pp. 343–344. ACM, New York (2010)Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringMyongji UniversityYonginKorea
  2. 2.School of Computer Science and EngineeringSoongsil UniversitySeoulKorea

Personalised recommendations