Journal of Computer Science and Technology

, Volume 31, Issue 2, pp 235–252 | Cite as

Pragma Directed Shared Memory Centric Optimizations on GPUs

  • Jing LiEmail author
  • Lei Liu
  • Yuan Wu
  • Xiang-Hua Liu
  • Yi Gao
  • Xiao-Bing Feng
  • Cheng-Yong Wu
Regular Paper


GPUs become a ubiquitous choice as coprocessors since they have excellent ability in concurrent processing. In GPU architecture, shared memory plays a very important role in system performance as it can largely improve bandwidth utilization and accelerate memory operations. However, even for affine GPU applications that contain regular access patterns, optimizing for shared memory is not an easy work. It often requires programmer expertise and nontrivial parameter selection. Improper shared memory usage might even underutilize GPU resource. Even using state-of-the-art high level programming models (e.g., OpenACC and OpenHMPP), it is still hard to utilize shared memory since they lack inherent support in describing shared memory optimization and selecting suitable parameters, let alone maintaining high resource utilization. Targeting higher productivity for affine applications, we propose a data centric way to shared memory optimization on GPU. We design a pragma extension on OpenACC so as to convey data management hints of programmers to compiler. Meanwhile, we devise a compiler framework to automatically select optimal parameters for shared arrays, using the polyhedral model. We further propose optimization techniques to expose higher memory and instruction level parallelism. The experimental results show that our shared memory centric approaches effectively improve the performance of five typical GPU applications across four widely used platforms by 3.7x on average, and do not burden programmers with lots of pragmas.


GPU shared memory pragma directed data centric 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Ruetsch G, Micikevicius P. Optimizing matrix transpose in CUDA.∼cs675/MatrixTranspose. pdf, Jan. 2009.
  2. [2]
    Fujimoto N. Faster matrix–vector multiplication on GeForce 8800GTX. In Proc. IEEE International Symposium on Parallel and Distributed Processing, Apr. 2008.Google Scholar
  3. [3]
    Van Werkhoven B, Maassen J, Bal H E, Seinstra F J. Optimizing convolution operations on GPUs using adaptive tiling. Future Gener. Comput. Syst., 2014, 30: 14–26.CrossRefGoogle Scholar
  4. [4]
    Nguyen A, Satish N, Chhugani J, Kim C, Dubey P. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proc. the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2010.Google Scholar
  5. [5]
    Yang Y, Xiang P, Kong J, Zhou H. A GPGPU compiler for memory optimization and parallelism management. In Proc. the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2010, pp.86-97.Google Scholar
  6. [6]
    Kandemir M, Kadayif I, Sezer U. Exploiting scratch-pad memory using Presburger formulas. In Proc. the 14th International Symposium on Systems Synthesis, Sept. 2001, pp.7-12.Google Scholar
  7. [7]
    Ueng S Z, Lathara M, Baghsorkhi S, Hwu W. CUDA-Lite: Reducing GPU programming complexity. In Proc. the Languages and Compilers for Parallel Computing, July 3-Aug. 2, 2008, pp.1-15.Google Scholar
  8. [8]
    Yang Y, Xiang P, Mantor M, Rubin N, Zhou H. Shared memory multiplexing: A novel way to improve GPGPU throughput. In Proc. the 21st International Conference on Parallel Architectures and Compilation Techniques, Sept. 2012, pp.283-292.Google Scholar
  9. [9]
    Jablin J A, Jablin T B, Mutlu O, Herlihy M. Warp-aware trace scheduling for GPUs. In Proc. the 23rd International Conference on Parallel Architectures and Compilation, Aug. 2014, pp.163-174.Google Scholar
  10. [10]
    Schäfer A, Fey D. High performance stencil code algorithms for GPGPUs. Procedia Computer Science, 2011, 4: 2027–2036.Google Scholar
  11. [11]
    Volkov V. Better performance at lower occupancy.∼volkov/volkov10-GTC.pdf, Dec. 2014.
  12. [12]
    Bondhugula U, Hartono A, Ramanujam J, Sadayappan P. A practical automatic polyhedral parallelizer and locality optimizer. In Proc. the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2008, pp.101-113.Google Scholar
  13. [13]
    Bastoul C. Code generation in the polyhedral model is easier than you think. In Proc. the 13th International Conference on Parallel Architectures and Compilation Techniques, Sept. 29-Oct. 3, 2004, pp.7-16.Google Scholar
  14. [14]
    Baskaran M M, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P. A compiler framework for optimization of affine loop nests for GPGPUs. In Proc. the 22nd Annual International Conference on Supercomputing, Jun. 2008, pp.225-234.Google Scholar
  15. [15]
    Baskaran M, Ramanujam J, Sadayappan P. Automatic Cto-CUDA code generation for affine programs. In Proc. the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction, Mar. 2010, pp.244-263.Google Scholar
  16. [16]
    Pouchet L N. Polyhedral compilation foundations.∼pouchet/lectures/doc/888.11.2.pdf, Dec. 2014.
  17. [17]
    Murthy G S, Ravishankar M, Baskaran M M, Sadayappan P. Optimal loop unrolling for GPGPU programs. In Proc. the 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Apr. 2010.Google Scholar
  18. [18]
    Liu L, Li Y, Cui Z, Bao Y, Chen M, Wu C. Going vertical in memory management: Handling multiplicity by multipolicy. In Proc. the 41st ACM/IEEE International Symposium on Computer Architecture (ISCA), Jun. 2014, pp.169-180Google Scholar
  19. [19]
    Gao S. Improving GPU shared memory access efficiency [Ph.D. Thesis]. University of Tennessee, 2014.Google Scholar
  20. [20]
    Gou C, Gaydadjiev G. Addressing GPU on-chip shared memory bank conflicts using elastic pipeline. International Journal of Parallel Programming, 2013, 41(3): 400–429.CrossRefGoogle Scholar
  21. [21]
    Ryoo S, Rodrigues C I, Baghsorkhi S S, Stone S S, Kirk D B, Hwu W W. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proc. the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2008, pp.73-82.Google Scholar
  22. [22]
    Lee S I, Johnson T, Eigenmann R. Cetus — An extensible compiler infrastructure for source-to-source transformation. In Lecture Notes in Computer Science 2958, Rauchwerger L (ed.), Springer Berlin Heidelberg, 2004, pp.539-553.Google Scholar
  23. [23]
    Lee S, Min S, Eigenmann R. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proc. the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2009, pp.101-110.Google Scholar
  24. [24]
    Wienke S, Springer P, Terboven C, an Mey D. OpenACC — First experiences with real-world applications. In Lecture Notes in Computer Science 7484, Kaklamanis C, Papatheodorou T, Spirakis P G (eds.), Springer Berlin Heidelberg, 2012, pp.859-870.Google Scholar
  25. [25]
    Catanzaro B, Garland M, Keutzer K. Copperhead: Compiling an embedded data parallel language. Technical Report, UCB/EECS-2010-124, EECS Department, University of California, Berkeley, Sept. 2010.Google Scholar
  26. [26]
    Reyes R, López I, Fumero J, de Sande F. A preliminary evaluation of OpenACC implementations. The Journal of Supercomputing, 2013, 65(3): 1063–1075.CrossRefGoogle Scholar
  27. [27]
    Fang J, Varbanescu A, Sips H. A comprehensive performance comparison of CUDA and OpenCL. In Proc. the International Conference on Parallel Processing, Sept. 2011, pp.216-225.Google Scholar
  28. [28]
    Karimi K, Dickson N G, Hamze F. A performance comparison of CUDA and OpenCL. arXiv: 1005.2581, 2010., Jan. 2016.
  29. [29]
    Li C, Yang Y, Dai H, Yan S, Mueller F, Zhou H. Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs. In Proc. the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Mar. 2014, pp.231-242.Google Scholar
  30. [30]
    Chen G, Wu B, Li D, Shen X. PORPLE: An extensible optimizer for portable data placement on GPU. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2014, pp.88-100.Google Scholar
  31. [31]
    van den Braak G, Mesman B, Corporaal H. Compile-time GPU memory access optimizations. In Proc. the 2010 International Conference on Embedded Computer Systems (SAMOS), Jul. 2010, pp.200-207.Google Scholar
  32. [32]
    Baskaran M M, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In Proc. the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2008, pp.1-10.Google Scholar
  33. [33]
    Baghdadi S, Gröβlinger A, Cohen A. Putting automatic polyhedral compilation for GPGPU to work. In Proc. the 15th Workshop Compilers for Parallel Computers, Jul. 2010.Google Scholar
  34. [34]
    Gröβlinger A. Precise management of scratchpad memories for localising array accesses in scientific codes. In Proc. the 18th International Conference on Compiler Construction, Mar. 2009, pp.236-250.Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Jing Li
    • 1
    • 2
    Email author
  • Lei Liu
    • 1
  • Yuan Wu
    • 3
  • Xiang-Hua Liu
    • 3
  • Yi Gao
    • 3
  • Xiao-Bing Feng
    • 1
  • Cheng-Yong Wu
    • 1
  1. 1.State Key Laboratory of Computer Architecture, Institute of Computing TechnologyChinese Academy of SciencesBeijingChina
  2. 2.University of Chinese Academy of SciencesBeijingChina
  3. 3.Beijing Samsung Telecom Research and Development CenterBeijingChina

Personalised recommendations