Parameter based tuning model for optimizing performance on GPU
- 162 Downloads
Recently, the graphic processing units (GPUs) are becoming increasingly popular for the high performance computing applications. Although the GPUs provide high peak performance, exploiting the full performance potential for application programs, however, leaves a challenging task to the programmers. When launching a parallel kernel of an application on the GPU, the programmer needs to carefully select the number of blocks (grid size) and the number of threads per block (block size). These values determine the degree of SIMD parallelism and the multithreading, and greatly influence the performance. With a huge range of possible combinations of these values, choosing the right grid size and the block size is not straightforward. In this paper, we propose a mathematical model for tuning the grid size and the block size based on the GPU architecture parameters. Using our model we first calculate a small set of candidate grid size and block size values, then search for the optimal values out of the candidate values through experiments. Our approach significantly reduces the potential search space instead of exhaustive search approaches in the previous research. Thus our approach can be practically applied to the real applications.
KeywordsGPU High performance computing Performance tuning Multi-threading Micro-benchmark
This research was supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science, and Technology (NRF-2015M3C4A7065662).
- 2.Group Khronos: OpenCL. https://www.khronos.org/opencl/ (2015)
- 3.Nath, R., Tomov, S., Dongarra, J., Agullo, E.: Autotuning dense linear algebra libraries on gpus and overview of the magma library. In: 6th International Workshop on Parallel Matrix Algorithms and Applications (PMAA’10), June (2010)Google Scholar
- 4.NVIDA: CUDA Toolkit Documentation. http://docs.nvidia.com/cuda/index.html, September (2015)
- 5.OpenACC-standard.org.: OpenACC. http://www.openacc.org/, March (2012)
- 6.Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.W.: Optimization principles and application performance evaluation of a multithreaded gpu using cuda. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, pp. 73–82. ACM, New York (2008)Google Scholar
- 7.Torres, Y., Gonzalez-Escribano, A., Llanos, D.R.: Understanding the impact of cuda tuning techniques for fermi. In: 2011 International Conference on High Performance Computing and Simulation (HPCS), pp. 631–639. IEEE (2011)Google Scholar
- 8.Vuduc, R., Demmel, J.W., Yelick, K.A.: Oski: A library of automatically tuned sparse matrix kernels. J. Phys. 16, 521 (2005)Google Scholar
- 9.Whaley, R.C., Dongarra, J.J.: Automatically tuned linear algebra software. In: Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, pp. 1–27. IEEE Computer Society (1998)Google Scholar
- 10.Yang, Y., Xiang, P., Kong, J., Zhou, H.: An optimizing compiler for gpgpu programs with input-data sharing. In: ACM Sigplan Notices, vol. 45, pp. 343–344. ACM, New York (2010)Google Scholar