Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation

Chen, Li; Liu, Lei; Tang, Shenglin; Huang, Lei; Jing, Zheng; Xu, Shixiong; Zhang, Dingfei; Shou, Baojiang

doi:10.1007/978-3-642-19595-2_11

Li Chen¹⁷,
Lei Liu¹⁷,
Shenglin Tang¹⁷,
Lei Huang¹⁸,
Zheng Jing¹⁷,
Shixiong Xu¹⁷,
Dingfei Zhang¹⁷ &
…
Baojiang Shou¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6548))

Included in the following conference series:

International Workshop on Languages and Compilers for Parallel Computing

973 Accesses
8 Citations
3 Altmetric

Abstract

Unified Parallel C (UPC), a parallel extension to ANSI C, is designed for high performance computing on large-scale parallel machines. With General-purpose graphics processing units (GPUs) becoming an increasingly important high performance computing platform, we propose new language extensions to UPC to take advantage of GPU clusters. We extend UPC with hierarchical data distribution, revise the execution model of UPC to mix SPMD with fork-join execution model, and modify the semantics of upc_forall to reflect the data-thread affinity on a thread hierarchy. We implement the compiling system, including affinity-aware loop tiling, GPU code generation, and several memory optimizations targeting NVIDIA CUDA. We also put forward unified data management for each UPC thread to optimize data transfer and memory layout for separate memory modules of CPUs and GPUs. The experimental results show that the UPC extension has better programmability than the mixed MPI/CUDA approach. We also demonstrate that the integrated compile-time and runtime optimization is effective to achieve good performance on GPU clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Fatahalian, K., Knight, T., Houston, M., Erez, M., Horn, D., Leem, L., Park, H., Ren, M., Aiken, A., Dally, W., Hanrahan, P.: Sequoia: Programming the Memory Hierarchy. In: Proceedings of Supercomputing 2006 (November 2006)
Google Scholar
Yan, Y., Zhao, J., Guo, Y., Sarkar, V.: Hierarchical place trees: A portable abstraction for task parallelism and data movement. In: Gao, G.R., Pollock, L.L., Cavazos, J., Li, X. (eds.) LCPC 2009. LNCS, vol. 5898, pp. 172–187. Springer, Heidelberg (2010)
Chapter Google Scholar
Bikshandi, G., Guo, J., Hoeflinger, D., Almasi, G., Fraguela, B.B., Garzarán, M.J., Padua, D., von Praun, C.: Programming for parallelism and locality with hierarchically tiled arrays. In: PPoPP, New York, USA, March 29-31 (2006)
Google Scholar
Nishtala, R., Almasi, G., Cascaval, C.: Performance without pain = productivity: data layout and collective communication in UPC. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2008 (2008)
Google Scholar
Ueng, S., Lathara, M., Baghsorkhi, S.S., Hwu, W.W.: CUDA-lite: Reducing GPU programming complexity. In: Amaral, J.N. (ed.) LCPC 2008. LNCS, vol. 5335, pp. 1–15. Springer, Heidelberg (2008)
Chapter Google Scholar
Han, T.D., Abdelrahman, T.S.: hiCUDA: a high-level directive-based language for GPU programming. In: Workshop on General Purpose Processing on Graphics Processing Units (GPGPU), pp. 52–61 (March 2009)
Google Scholar
Yan, Y., et al.: JCUDA: a Programmer-Friendly Interface for Accelerating Java Programs with CUDA. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 887–899. Springer, Heidelberg (2009)
Chapter Google Scholar
Lee, S., Min, S.-J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 101–110 (February 2009)
Google Scholar
The Portland Group. PGI Fortran & C Accelerator Programming Model (March 2010), http://grape.pgroup.com/lit/whitepapers/pgi_accel_prog_model_1.2.pdf
http://www.caps-entreprise.com/fr/page/index.php?id=49&p_p=36
NVIDIA CUDA, China campus programming contest (2009), http://cuda.csdn.net/contest/pro
Yang, Y., Xiang, P., Kong, J., Zhou, H.: A GPGPU Compiler for Memory Optimization and Parallelism Management. In: The ACM SIGNPLAN 2010 Conference on Programming Language Design and Implementation, PLDI 2010 (June 2010)
Google Scholar
Serres, O., Kayi, A., Anbar, A., El-Ghazawi, T.: A UPC Specification Extension Proposal for Hierarchical Parallelism. In: The 3rd Conference on Partitioned Global Address Space Programming Models, Virginia, USA (October 2009)
Google Scholar
Numrich, R.: Teams for Co-Array Fortran. In: The 3rd Conference on Partitioned Global Address Space Programming Models, Virginia, USA (October 2009)
Google Scholar
Barton, C., Casçaval, C., Almási, G., Zheng, Y., Farreras, M., Chatterje, S., Amaral, J.N.: Shared memory programming for large scale machines. In: Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation, Ottawa, Ontario, Canada, June 11-14 (2006)
Google Scholar
Husbands, P., Iancu, C., Yelick, K.: A performance analysis of the Berkeley UPC compiler. In: Proceedings of the 17th Annual International Conference on Supercomputing, San Francisco, CA, USA, June 23-26 (2003)
Google Scholar
Bauer, M., Clark, J., Schkufza, E., Aiken, A.: Sequoia++ User Manual, http://sequoia.stanford.edu/

Download references

Author information

Authors and Affiliations

Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, China
Li Chen, Lei Liu, Shenglin Tang, Zheng Jing, Shixiong Xu, Dingfei Zhang & Baojiang Shou
Department of Computer Science, University of Houston, Houston, TX, USA
Lei Huang

Authors

Li Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Shenglin Tang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Huang
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Jing
View author publications
You can also search for this author in PubMed Google Scholar
Shixiong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Dingfei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Baojiang Shou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Rice University, 6100 Main Street, 77005-1892, Houston, TX, USA
Keith Cooper , John Mellor-Crummey & Vivek Sarkar , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, L. et al. (2011). Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation. In: Cooper, K., Mellor-Crummey, J., Sarkar, V. (eds) Languages and Compilers for Parallel Computing. LCPC 2010. Lecture Notes in Computer Science, vol 6548. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19595-2_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-19595-2_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19594-5
Online ISBN: 978-3-642-19595-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics