Skip to main content

Improving Performance Portability in OpenCL Programs

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7905))

Abstract

We study the performance portability of OpenCL across diverse architectures including NVIDIA GPU, Intel Ivy Bridge CPU, and AMD Fusion APU. We present detailed performance analysis at assembly level on three exemplar OpenCL benchmarks: SGEMM, SpMV, and FFT. We also identify a number of tuning knobs that are critical to performance portability, including threads-data mapping, data layout, tiling size, data caching, and operation-specific factors. We further demonstrate that proper tuning could improve the OpenCL portable performance from the current 15% to a potential 67% of the state-of-the-art performance on the Ivy Bridge CPU. Finally, we evaluate the current OpenCL programming model, and propose a list of extensions that improve performance portability.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. The OpenACC application programming interface 1.0 (November 2011), http://www.openacc-standard.org/

  2. The OpenCL specification 1.2 (November 2011), http://www.khronos.org/registry/cl/

  3. Baghsorkhi, S.S., Delahaye, M., Patel, S.J., Gropp, W.D., Hwu, W.W.: An adaptive performance modeling tool for GPU architectures. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2010), pp. 105–114 (January 2010)

    Google Scholar 

  4. Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughput-oriented processors. In: Proceedings of the 2009 ACM/IEEE Conference on Supercomputing (SC 2009), pp. 18:1–18:11 (November 2009)

    Google Scholar 

  5. Carlson, W., Draper, J., Culler, D., Yelick, K., Brooks, E., Warren, K.: Introduction to UPC and language specification. Center for Computing Sciences, Institute for Defense Analyses (1999)

    Google Scholar 

  6. Chien, A.A., Snavely, A., Gahagan, M.: 10x10: A general-purpose architectural approach to heterogeneity and energy efficiency. Procedia CS 4, 1987–1996 (2011)

    Google Scholar 

  7. Choi, J.W., Singh, A., Vuduc, R.W.: Model-driven autotuning of sparse matrix-vector multiply on GPUs. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2010), pp. 115–126 (January 2010)

    Google Scholar 

  8. Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju, V., Vetter, J.S.: The scalable heterogeneous computing (SHOC) benchmark suite. In: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU 2010), pp. 63–74. ACM, New York (2010)

    Chapter  Google Scholar 

  9. Davidson, A., Zhang, Y., Owens, J.D.: An auto-tuned method for solving large tridiagonal systems on the GPU. In: Proceedings of the 25th IEEE International Parallel and Distributed Processing Symposium, pp. 956–965 (May 2011)

    Google Scholar 

  10. Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J.: From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming. Parallel Comput. 38(8), 391–407 (2012)

    Article  Google Scholar 

  11. Fang, J., Varbanescu, A.L., Sips, H.: A comprehensive performance comparison of CUDA and OpenCL. In: Proceedings of the 2011 International Conference on Parallel Processing (ICPP 2011), pp. 216–225. IEEE Computer Society, Washington, DC (2011)

    Chapter  Google Scholar 

  12. Goto, K., Van De Geijn, R.: High-performance implementation of the level-3 BLAS. ACM Trans. Math. Softw. 35(1), 4:1–4:14 (2008)

    Article  MathSciNet  Google Scholar 

  13. Hong, S., Kim, H.: An integrated GPU power and performance model. In: Proceedings of the 37th International Symposium on Computer Architecture (ISCA 2010), pp. 280–289 (2010)

    Google Scholar 

  14. Komatsu, K., Sato, K., Arai, Y., Koyama, K., Takizawa, H., Kobayashi, H.: Evaluating performance and portability of OpenCL programs. In: The Fifth International Workshop on Automatic Performance Tuning (June 2010)

    Google Scholar 

  15. Loveman, D.: High performance Fortran. IEEE Parallel & Distributed Technology: Systems & Applications 1(1), 25–42 (1993)

    Article  Google Scholar 

  16. Meng, J., Skadron, K.: Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In: Proceedings of the 23rd International Conference on Supercomputing (ICS 2009), pp. 256–265 (June 2009)

    Google Scholar 

  17. NVIDIA Corporation. NVIDIA CUDA compute unified device architecture, programming guide 5.0 (October 2012), http://developer.nvidia.com/

  18. Rice, J.R., Boisvert, R.F.: Solving Elliptic Problems using ELLPACK. Springer-Verlag New York, Inc. (1984)

    Google Scholar 

  19. Rul, S., Vandierendonck, H., D’Haene, J., De Bosschere, K.: An experimental study on performance portability of OpenCL kernels. In: 2010 Symposium on Application Accelerators in High Performance Computing, p. 3 (2010)

    Google Scholar 

  20. Seo, S., Jo, G., Lee, J.: Performance characterization of the NAS parallel benchmarks in OpenCL. In: 2011 IEEE International Symposium on Workload Characterization (IISWC 2011), pp. 137–148 (November 2011)

    Google Scholar 

  21. Shen, J., Fang, J., Sips, H., Varbanescu, A.: Performance gaps between OpenMP and OpenCL for multi-core CPUs. In: 2012 41st International Conference on Parallel Processing Workshops (ICPPW 2012), pp. 116–125 (September 2012)

    Google Scholar 

  22. Stratton, J.A., Stone, S.S., Hwu, W.-m.W.: MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In: Amaral, J.N. (ed.) LCPC 2008. LNCS, vol. 5335, pp. 16–30. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  23. Thoman, P., Kofler, K., Studt, H., Thomson, J., Fahringer, T.: Automatic openCL device characterization: Guiding optimized kernel design. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011, Part II. LNCS, vol. 6853, pp. 438–452. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  24. Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing (SC 2008), pp. 31:1–31:11 (November 2008)

    Google Scholar 

  25. Volkov, V., Kazian, B.: Fitting FFT onto the G80 architecture (May 2008), http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project6_report.pdf/

  26. Zhang, Y., Owens, J.D.: A quantitative performance analysis model for GPU architectures. In: Proceedings of the 17th IEEE International Symposium on High-Performance Computer Architecture (HPCA 17), pp. 382–393 (February 2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhang, Y., Sinclair, M., Chien, A.A. (2013). Improving Performance Portability in OpenCL Programs. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds) Supercomputing. ISC 2013. Lecture Notes in Computer Science, vol 7905. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38750-0_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-38750-0_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-38749-4

  • Online ISBN: 978-3-642-38750-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics