Pragma Directed Shared Memory Centric Optimizations on GPUs
- 117 Downloads
GPUs become a ubiquitous choice as coprocessors since they have excellent ability in concurrent processing. In GPU architecture, shared memory plays a very important role in system performance as it can largely improve bandwidth utilization and accelerate memory operations. However, even for affine GPU applications that contain regular access patterns, optimizing for shared memory is not an easy work. It often requires programmer expertise and nontrivial parameter selection. Improper shared memory usage might even underutilize GPU resource. Even using state-of-the-art high level programming models (e.g., OpenACC and OpenHMPP), it is still hard to utilize shared memory since they lack inherent support in describing shared memory optimization and selecting suitable parameters, let alone maintaining high resource utilization. Targeting higher productivity for affine applications, we propose a data centric way to shared memory optimization on GPU. We design a pragma extension on OpenACC so as to convey data management hints of programmers to compiler. Meanwhile, we devise a compiler framework to automatically select optimal parameters for shared arrays, using the polyhedral model. We further propose optimization techniques to expose higher memory and instruction level parallelism. The experimental results show that our shared memory centric approaches effectively improve the performance of five typical GPU applications across four widely used platforms by 3.7x on average, and do not burden programmers with lots of pragmas.
KeywordsGPU shared memory pragma directed data centric
Unable to display preview. Download preview PDF.
- Ruetsch G, Micikevicius P. Optimizing matrix transpose in CUDA. http://www.cs.colostate.edu/∼cs675/MatrixTranspose. pdf, Jan. 2009.
- Fujimoto N. Faster matrix–vector multiplication on GeForce 8800GTX. In Proc. IEEE International Symposium on Parallel and Distributed Processing, Apr. 2008.Google Scholar
- Nguyen A, Satish N, Chhugani J, Kim C, Dubey P. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proc. the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2010.Google Scholar
- Yang Y, Xiang P, Kong J, Zhou H. A GPGPU compiler for memory optimization and parallelism management. In Proc. the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2010, pp.86-97.Google Scholar
- Kandemir M, Kadayif I, Sezer U. Exploiting scratch-pad memory using Presburger formulas. In Proc. the 14th International Symposium on Systems Synthesis, Sept. 2001, pp.7-12.Google Scholar
- Ueng S Z, Lathara M, Baghsorkhi S, Hwu W. CUDA-Lite: Reducing GPU programming complexity. In Proc. the Languages and Compilers for Parallel Computing, July 3-Aug. 2, 2008, pp.1-15.Google Scholar
- Yang Y, Xiang P, Mantor M, Rubin N, Zhou H. Shared memory multiplexing: A novel way to improve GPGPU throughput. In Proc. the 21st International Conference on Parallel Architectures and Compilation Techniques, Sept. 2012, pp.283-292.Google Scholar
- Jablin J A, Jablin T B, Mutlu O, Herlihy M. Warp-aware trace scheduling for GPUs. In Proc. the 23rd International Conference on Parallel Architectures and Compilation, Aug. 2014, pp.163-174.Google Scholar
- Schäfer A, Fey D. High performance stencil code algorithms for GPGPUs. Procedia Computer Science, 2011, 4: 2027–2036.Google Scholar
- Volkov V. Better performance at lower occupancy. www.cs.berkeley.edu/∼volkov/volkov10-GTC.pdf, Dec. 2014.
- Bondhugula U, Hartono A, Ramanujam J, Sadayappan P. A practical automatic polyhedral parallelizer and locality optimizer. In Proc. the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2008, pp.101-113.Google Scholar
- Bastoul C. Code generation in the polyhedral model is easier than you think. In Proc. the 13th International Conference on Parallel Architectures and Compilation Techniques, Sept. 29-Oct. 3, 2004, pp.7-16.Google Scholar
- Baskaran M M, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P. A compiler framework for optimization of affine loop nests for GPGPUs. In Proc. the 22nd Annual International Conference on Supercomputing, Jun. 2008, pp.225-234.Google Scholar
- Baskaran M, Ramanujam J, Sadayappan P. Automatic Cto-CUDA code generation for affine programs. In Proc. the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction, Mar. 2010, pp.244-263.Google Scholar
- Pouchet L N. Polyhedral compilation foundations. http://web.cs.ucla.edu/∼pouchet/lectures/doc/888.11.2.pdf, Dec. 2014.
- Murthy G S, Ravishankar M, Baskaran M M, Sadayappan P. Optimal loop unrolling for GPGPU programs. In Proc. the 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Apr. 2010.Google Scholar
- Liu L, Li Y, Cui Z, Bao Y, Chen M, Wu C. Going vertical in memory management: Handling multiplicity by multipolicy. In Proc. the 41st ACM/IEEE International Symposium on Computer Architecture (ISCA), Jun. 2014, pp.169-180Google Scholar
- Gao S. Improving GPU shared memory access efficiency [Ph.D. Thesis]. University of Tennessee, 2014.Google Scholar
- Ryoo S, Rodrigues C I, Baghsorkhi S S, Stone S S, Kirk D B, Hwu W W. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In Proc. the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2008, pp.73-82.Google Scholar
- Lee S I, Johnson T, Eigenmann R. Cetus — An extensible compiler infrastructure for source-to-source transformation. In Lecture Notes in Computer Science 2958, Rauchwerger L (ed.), Springer Berlin Heidelberg, 2004, pp.539-553.Google Scholar
- Lee S, Min S, Eigenmann R. OpenMP to GPGPU: A compiler framework for automatic translation and optimization. In Proc. the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2009, pp.101-110.Google Scholar
- Wienke S, Springer P, Terboven C, an Mey D. OpenACC — First experiences with real-world applications. In Lecture Notes in Computer Science 7484, Kaklamanis C, Papatheodorou T, Spirakis P G (eds.), Springer Berlin Heidelberg, 2012, pp.859-870.Google Scholar
- Catanzaro B, Garland M, Keutzer K. Copperhead: Compiling an embedded data parallel language. Technical Report, UCB/EECS-2010-124, EECS Department, University of California, Berkeley, Sept. 2010.Google Scholar
- Fang J, Varbanescu A, Sips H. A comprehensive performance comparison of CUDA and OpenCL. In Proc. the International Conference on Parallel Processing, Sept. 2011, pp.216-225.Google Scholar
- Karimi K, Dickson N G, Hamze F. A performance comparison of CUDA and OpenCL. arXiv: 1005.2581, 2010. http://arvix.org/abs/1005.2581, Jan. 2016.
- Li C, Yang Y, Dai H, Yan S, Mueller F, Zhou H. Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs. In Proc. the 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Mar. 2014, pp.231-242.Google Scholar
- Chen G, Wu B, Li D, Shen X. PORPLE: An extensible optimizer for portable data placement on GPU. In Proc. the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec. 2014, pp.88-100.Google Scholar
- van den Braak G, Mesman B, Corporaal H. Compile-time GPU memory access optimizations. In Proc. the 2010 International Conference on Embedded Computer Systems (SAMOS), Jul. 2010, pp.200-207.Google Scholar
- Baskaran M M, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In Proc. the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2008, pp.1-10.Google Scholar
- Baghdadi S, Gröβlinger A, Cohen A. Putting automatic polyhedral compilation for GPGPU to work. In Proc. the 15th Workshop Compilers for Parallel Computers, Jul. 2010.Google Scholar
- Gröβlinger A. Precise management of scratchpad memories for localising array accesses in scientific codes. In Proc. the 18th International Conference on Compiler Construction, Mar. 2009, pp.236-250.Google Scholar