Advertisement

Implementing a GPU Programming Model on a Non-GPU Accelerator Architecture

  • Stephen M. Kofsky
  • Daniel R. Johnson
  • John A. Stratton
  • Wen-mei W. Hwu
  • Sanjay J. Patel
  • Steven S. Lumetta
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6161)

Abstract

Parallel codes are written primarily for the purpose of performance. It is highly desirable that parallel codes be portable between parallel architectures without significant performance degradation or code rewrites. While performance portability and its limits have been studied thoroughly on single processor systems, this goal has been less extensively studied and is more difficult to achieve for parallel systems. Emerging single-chip parallel platforms are no exception; writing code that obtains good performance across GPUs and other many-core CMPs can be challenging. In this paper, we focus on CUDA codes, noting that programs must obey a number of constraints to achieve high performance on an NVIDIA GPU. Under such constraints, we develop optimizations that improve the performance of CUDA code on a MIMD accelerator architecture that we are developing called Rigel. We demonstrate performance improvements with these optimizations over naïve translations, and final performance results comparable to those of codes that were hand-optimized for Rigel.

Keywords

Global Memory Cache Line Thread Block Load Imbalance Performance Portability 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Diamos, G.: The design and implementation ocelot’s dynamic binary translator from ptx to multi-core x86. Technical Report GIT-CERCS-09-18, Georgia Institute of Technology (2009)Google Scholar
  2. 2.
    Frigo, M., Johnson, S.: FFTW: an adaptive software architecture for the fft, vol. 3, pp. 1381–1384 (May 1998)Google Scholar
  3. 3.
    Dongarra, R.C.J.: Automatically tuned linear algebra software. Technical report, Knoxville, TN, USA (1997)Google Scholar
  4. 4.
    Kelm, J.H., Johnson, D.R., Johnson, M.R., Crago, N.C., Tuohy, W., Mahesri, A., Lumetta, S.S., Frank, M.I., Patel, S.J.: Rigel: An architecture and scalable programming interface for a 1000-core accelerator. In: Proceedings of the International Symposium on Computer Architecture, pp. 140–151 (June 2009)Google Scholar
  5. 5.
    Khronos OpenCL Working Group. OpenCL Specification, 1.0 edn. (December 2008)Google Scholar
  6. 6.
    Kofsky, S.M.: Achieving performance portability across parallel accelerator architectures. Technical Report (to Appear), Center for Reliable and High-Performance Computing, University of Illinois, Urbana, IL (2010)Google Scholar
  7. 7.
    Lin, C.: The portability of parallel programs across MIMD computers. PhD thesis, Seattle, WA, USA (1992)Google Scholar
  8. 8.
    Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro 28(2), 39–55 (2008)CrossRefGoogle Scholar
  9. 9.
    Moura, J.M.F., Johnson, J., Johnson, R.W., Padua, D., Prasanna, V.K., Püschel, M., Veloso, M.: SPIRAL: Automatic implementation of signal processing algorithms. In: High Performance Embedded Computing, HPEC (2000)Google Scholar
  10. 10.
    Nickolls, J., Buck, I., Garland, M., Skadron, K.: Scalable parallel programming with CUDA. Queue 6(2) (2008)Google Scholar
  11. 11.
    Ryoo, S., Rodrigues, C.I., Stone, S.S., Baghsorkhi, S.S., Ueng, S.-Z., Stratton, J.A., mei, W., Hwu, W.: Program optimization space pruning for a multithreaded GPU. In: CGO 2008: Proceedings of the Sixth Annual IEEE/ACM International Symposium on Code Generation and Optimization, pp. 195–204. ACM, New York (2008)Google Scholar
  12. 12.
    Snyder, L.: The design and development of ZPL. In: HOPL III: Proceedings of the third ACM SIGPLAN conference on History of programming languages, pp. 8–1–8–37. ACM, New York (2007)Google Scholar
  13. 13.
    Stone, S.S., Haldar, J.P., Tsao, S.C., Hwu, W.W., Sutton, B.P., Liang, Z.P.: Accelerating advanced mri reconstructions on GPUs. J. Parallel Distrib. Comput. 68(10), 1307–1318 (2008)CrossRefGoogle Scholar
  14. 14.
    Stratton, J.A., Stone, S.S., Hwu, W.-M.W.: MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs, pp. 16–30 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Stephen M. Kofsky
    • 1
  • Daniel R. Johnson
    • 1
  • John A. Stratton
    • 1
  • Wen-mei W. Hwu
    • 1
  • Sanjay J. Patel
    • 1
  • Steven S. Lumetta
    • 1
  1. 1.University of Illinois at Urbana-ChampaignUrbanaUSA

Personalised recommendations