Panda: A Compiler Framework for Concurrent CPU $$+$$ GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

Sourouri, Mohammed; Baden, Scott B.; Cai, Xing

doi:10.1007/s10766-016-0454-1

Panda: A Compiler Framework for Concurrent CPU$+$GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

Published: 05 October 2016

Volume 45, pages 711–729, (2017)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

664 Accesses
14 Citations
1 Altmetric
Explore all metrics

Abstract

We present a new compiler framework for truly heterogeneous 3D stencil computation on GPU clusters. Our framework consists of a simple directive-based programming model and a tightly integrated source-to-source compiler. Annotated with a small number of directives, sequential stencil C codes can be automatically parallelized for large-scale GPU clusters. The most distinctive feature of the compiler is its capability to generate hybrid MPI$+$CUDA$+$OpenMP code that uses concurrent CPU$+$GPU computing to unleash the full potential of powerful GPU clusters. The auto-generated hybrid codes hide the overhead of various data motion by overlapping them with computation. Test results on the Titan supercomputer and the Wilkes cluster show that auto-translated codes can achieve about 90 % of the performance of highly optimized handwritten codes, for both a simple stencil benchmark and a real-world application in cardiac modeling. The user-friendliness and performance of our domain-specific compiler framework allow harnessing the full power of GPU-accelerated supercomputing without painstaking coding effort.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Directive-Based Compilers for GPUs

Automatic Skeleton-Based Compilation through Integration with an Algorithm Classification

A Source-to-Source OpenACC Compiler for CUDA

References

Ang, J., Barrett, R., Benner, R., Burke, D., Chan, C., Cook, J., Donofrio, D., Hammond, S., Hemmert, K., Kelly, S., Le, H., Leung, V., Resnick, D., Rodrigues, A., Shalf, J., Stark, D., Unat, D., Wright, N.: Abstract machine models and proxy architectures for exascale computing. In: Proceedings of the 1st International Workshop on Hardware–Software Co-Design for High Performance Computing (Co-HPC), pp. 25–32 (2014)
Baskaran, M.M., Ramanujam, J., Sadayappan, P.: Automatic C-to-CUDA code generation for affine programs. In: Proceedings of the 19th Joint European Conference on Theory and Practice of Software, International Conference on Compiler Construction, pp. 244–263 (2010)
Basumallik, A., Eigenmann, R.: Towards automatic translation of OpenMP to MPI. In: Proceedings of the 19th Annual International Conference on Supercomputing, pp. 189–198 (2005)
Christen, M., Schenk, O., Burkhart, B.: PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In: Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, pp. 676–687 (2011)
Dathathri, R., Reddy, C., Ramashekar, T., Bondhugula, U.: Generating efficient data movement code for heterogeneous architectures with distributed-memory. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, pp. 375–386 (2013)
Grosser, T., Cohen, A., Holewinski, J., Sadayappan, P., Verdoolaege, S.: Hybrid hexagonal/classical tiling for GPUs. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, pp. 66:66–66:75 (2014)
Gysi, T., Osuna, C., Fuhrer, O., Bianco, M., Schulthess, T.C.: STELLA: A domain-specific tool for structured grid methods in weather and climate models. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 41:1–41:12 (2015)
Hanslien, M., Artebrant, R., Tveito, A., Lines, G.T., Cai, X.: Stability of two time-integrators for the Aliev-Panfilov system. Int. J. Numer. Anal. Model. 8, 427–442 (2011)
MathSciNet MATH Google Scholar
Holewinski, J., Pouchet, L.N., Sadayappan, P.: High-performance code generation for stencil computations on GPU architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing, pp. 311–320 (2012)
Kamil, S., Chan, C., Oliker, L., Shalf, J., Williams, S.: An auto-tuning framework for parallel multicore stencil computations. In: Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pp. 1–12 (2010)
Kim, J., Seo, S., Lee, J., Nah, J., Jo, G., Lee, J.: SnuCL: An OpenCL framework for heterogeneous CPU/GPU clusters. In: Proceedings of the 26th ACM International Conference on Supercomputing, pp. 341–352 (2012)
Langguth, J., Sourouri, M., Lines, G.T., Baden, S.B., Cai, X.: Scalable heterogeneous CPU–GPU computations for unstructured tetrahedral meshes. Micro, IEEE 35(4), 6–15 (2015)
Article Google Scholar
Lawrence Livermore National Laboratory: ROSE compiler infrastructure. http://rosecompiler.org (2015). Accessed 04 June 2015
Lee, S., Eigenmann, R.: OpenMPC: Extended OpenMP programming and tuning for GPUs. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2010)
Lee, S., Vetter, J.S.: Early evaluation of directive-based GPU programming models for productive exascale computing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 23:1–23:11 (2012)
Levesque, J.M., Sankaran, R., Grout, R.: Hybridizing S3D into an exascale application using OpenACC: An approach for moving to multi-petaflops and beyond. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 15:1–15:11 (2012)
Lutz, T., Fensch, C., Cole, M.: PARTANS: an autotuning framework for stencil computation on multi-GPU systems. ACM Trans. Archit. Code Optim. 9(4), 59:1–59:24 (2013)
Article Google Scholar
Mark Harris: CUDA pro tip: Write flexible kernels with grid-stride loops. http://goo.gl/b8Vmkh (2015). Accessed 12 Nov 2015
Maruyama, N., Aoki, T.: Optimizing stencil computations for NVIDIA Kepler GPUs. In: Proceedings of the 1st International Workshop on High-Performance Stencil Computations, pp. 89–95 (2014)
Maruyama, N., Nomura, T., Sato, K., Matsuoka, S.: Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 11:1–11:12 (2011)
McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter pp. 19–25 (1995)
Mittal, S., Vetter, J.S.: A survey of CPU–GPU heterogeneous computing techniques. ACM Comput. Surv. 47(4) (2015)
NVIDIA: NVIDIA’s next generation CUDA compute architecture: Kepler GK110. http://goo.gl/9ju84x (2013). Accessed 12 Nov 2015
Olschanowsky, C., Strout, M.M., Guzik, S., Loffeld, J., Hittinger, J.: A study on balancing parallelism, data locality, and recomputation in existing PDE solvers. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 793–804 (2014)
OpenACC - Directives for Accelerators: The OpenACC Application Program Interface. http://openacc-standard.org (2015). Accessed 23 May 2015
OpenMP Architecture Review Board: OpenMP Application Program Interface. http://openmp.org (2015). Accessed 23 May 2015
Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., Amarasinghe, S.: Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In: Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 519–530 (2013)
Rahman, S.M.F., Yi, Q., Qasem, A.: Understanding stencil code performance on multicore architectures. In: Proceedings of the 8th ACM International Conference on Computing Frontiers, pp. 30:1–30:10 (2011)
Ravishankar, M., Dathathri, R., Elango, V., Pouchet, L.N., Ramanujam, J., Rountev, A., Sadayappan, P.: Distributed memory code generation for mixed irregular/regular computations. In: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pp. 65–75 (2015)
Schäfer, A., Fey, D.: High performance stencil code algorithms for GPGPUs. In: Proceedings of 2011 International Conference on Computational Sciences (ICCS) 4, 2027–2036 (2011)
Shimokawabe, T., Aoki, T., Onodera, N.: High-productivity framework on GPU-rich supercomputers for operational weather prediction code ASUCA. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 251–261 (2014)
Shimokawabe, T., Aoki, T., Takaki, T., Endo, T., Yamanaka, A., Maruyama, N., Nukada, A., Matsuoka, S.: Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 3:1–3:11 (2011)
Sourouri, M., Langguth, J., Spiga, F., Baden, S.B., Cai, X.: CPU+GPU programming of stencil computations for resource-efficient use of GPU clusters. In: Computational Science and Engineering (CSE), 2015 IEEE 18th International Conference on, pp. 17–26 (2015)
Su, H., Wu, N., Wen, M., Zhang, C., Cai, X.: On the GPU performance of 3D stencil computations implemented in OpenCL. In: Proceedings of the 28th International Supercomputing Conference 7905, 125–135 (2013)
Top500.org: June 2015—the green500 list. http://www.green500.org/lists/green201506 (2015). Accessed 04 Sept 2015
Top500.org: November 2015—top500 supercomputer sites. http://top500.org/lists/2015/11/ (2015). Accessed 18 Nov 2015
Unat, D., Cai, X., Baden, S.B.: Mint: Realizing CUDA performance in 3D stencil methods with annotated C. In: Proceedings of the International Conference on Supercomputing, pp. 214–224 (2011)
Venkatasubramanian, S., Vuduc, R.W.: Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems. In: Proceedings of the 23rd International Conference on Supercomputing, pp. 244–255 (2009)
Wienke, S., Springer, P., Terboven, C., Mey, D.: OpenACC - First Experiences with Real-World Applications. In: Euro-Par 2012 Parallel Processing—18th International Conference, vol. 7484, pp. 859–870 (2012)
Williams, S., Kalamkar, D.D., Singh, A., Deshpande, A.M., Van Straalen, B., Smelyanskiy, M., Almgren, A., Dubey, P., Shalf, J., Oliker, L.: Optimization of geometric multigrid for emerging multi- and manycore processors. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 96:1–96:11 (2012)
Williams, S., Waterman, A., Patterson, D.: Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009)
Article Google Scholar
Zhang, Y., Mueller, F.: Auto-generation and auto-tuning of 3D stencil codes on GPU clusters. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, pp. 155–164 (2012)

Download references

Acknowledgments

This work was supported by the FriNatek program of the Research Council of Norway, through Grant No. 214113/F20. The authors thank High Performance Computing Service at the University of Cambridge, UK. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Author information

Authors and Affiliations

Simula Research Laboratory, Oslo, Norway
Mohammed Sourouri & Xing Cai
Department of Informatics, University of Oslo, Oslo, Norway
Mohammed Sourouri & Xing Cai
Department of Computer Science and Engineering, University of California, San Diego, CA, USA
Scott B. Baden

Authors

Mohammed Sourouri
View author publications
You can also search for this author in PubMed Google Scholar
Scott B. Baden
View author publications
You can also search for this author in PubMed Google Scholar
Xing Cai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammed Sourouri.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sourouri, M., Baden, S.B. & Cai, X. Panda: A Compiler Framework for Concurrent CPU$+$GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers. Int J Parallel Prog 45, 711–729 (2017). https://doi.org/10.1007/s10766-016-0454-1

Download citation

Received: 04 February 2016
Accepted: 22 September 2016
Published: 05 October 2016
Issue Date: June 2017
DOI: https://doi.org/10.1007/s10766-016-0454-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Panda: A Compiler Framework for Concurrent CPU\(+\)GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

Abstract

Access this article

Similar content being viewed by others

Directive-Based Compilers for GPUs

Automatic Skeleton-Based Compilation through Integration with an Algorithm Classification

A Source-to-Source OpenACC Compiler for CUDA

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Panda: A Compiler Framework for Concurrent CPU\(+\)GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers

Abstract

Access this article

Similar content being viewed by others

Directive-Based Compilers for GPUs

Automatic Skeleton-Based Compilation through Integration with an Algorithm Classification

A Source-to-Source OpenACC Compiler for CUDA

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation