Implementing a GPU Programming Model on a Non-GPU Accelerator Architecture
Parallel codes are written primarily for the purpose of performance. It is highly desirable that parallel codes be portable between parallel architectures without significant performance degradation or code rewrites. While performance portability and its limits have been studied thoroughly on single processor systems, this goal has been less extensively studied and is more difficult to achieve for parallel systems. Emerging single-chip parallel platforms are no exception; writing code that obtains good performance across GPUs and other many-core CMPs can be challenging. In this paper, we focus on CUDA codes, noting that programs must obey a number of constraints to achieve high performance on an NVIDIA GPU. Under such constraints, we develop optimizations that improve the performance of CUDA code on a MIMD accelerator architecture that we are developing called Rigel. We demonstrate performance improvements with these optimizations over naïve translations, and final performance results comparable to those of codes that were hand-optimized for Rigel.
KeywordsGlobal Memory Cache Line Thread Block Load Imbalance Performance Portability
Unable to display preview. Download preview PDF.
- 1.Diamos, G.: The design and implementation ocelot’s dynamic binary translator from ptx to multi-core x86. Technical Report GIT-CERCS-09-18, Georgia Institute of Technology (2009)Google Scholar
- 2.Frigo, M., Johnson, S.: FFTW: an adaptive software architecture for the fft, vol. 3, pp. 1381–1384 (May 1998)Google Scholar
- 3.Dongarra, R.C.J.: Automatically tuned linear algebra software. Technical report, Knoxville, TN, USA (1997)Google Scholar
- 4.Kelm, J.H., Johnson, D.R., Johnson, M.R., Crago, N.C., Tuohy, W., Mahesri, A., Lumetta, S.S., Frank, M.I., Patel, S.J.: Rigel: An architecture and scalable programming interface for a 1000-core accelerator. In: Proceedings of the International Symposium on Computer Architecture, pp. 140–151 (June 2009)Google Scholar
- 5.Khronos OpenCL Working Group. OpenCL Specification, 1.0 edn. (December 2008)Google Scholar
- 6.Kofsky, S.M.: Achieving performance portability across parallel accelerator architectures. Technical Report (to Appear), Center for Reliable and High-Performance Computing, University of Illinois, Urbana, IL (2010)Google Scholar
- 7.Lin, C.: The portability of parallel programs across MIMD computers. PhD thesis, Seattle, WA, USA (1992)Google Scholar
- 9.Moura, J.M.F., Johnson, J., Johnson, R.W., Padua, D., Prasanna, V.K., Püschel, M., Veloso, M.: SPIRAL: Automatic implementation of signal processing algorithms. In: High Performance Embedded Computing, HPEC (2000)Google Scholar
- 10.Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. Queue 6(2) (2008)Google Scholar
- 11.Ryoo, S., Rodrigues, C.I., Stone, S.S., Baghsorkhi, S.S., Ueng, S.-Z., Stratton, J.A., mei, W., Hwu, W.: Program optimization space pruning for a multithreaded GPU. In: CGO 2008: Proceedings of the Sixth Annual IEEE/ACM International Symposium on Code Generation and Optimization, pp. 195–204. ACM, New York (2008)Google Scholar
- 12.Snyder, L.: The design and development of ZPL. In: HOPL III: Proceedings of the third ACM SIGPLAN conference on History of programming languages, pp. 8–1–8–37. ACM, New York (2007)Google Scholar
- 14.Stratton, J.A., Stone, S.S., Hwu, W.-M.W.: MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs, pp. 16–30 (2008)Google Scholar