Abstract
Unified Parallel C (UPC), a parallel extension to ANSI C, is designed for high performance computing on large-scale parallel machines. With General-purpose graphics processing units (GPUs) becoming an increasingly important high performance computing platform, we propose new language extensions to UPC to take advantage of GPU clusters. We extend UPC with hierarchical data distribution, revise the execution model of UPC to mix SPMD with fork-join execution model, and modify the semantics of upc_forall to reflect the data-thread affinity on a thread hierarchy. We implement the compiling system, including affinity-aware loop tiling, GPU code generation, and several memory optimizations targeting NVIDIA CUDA. We also put forward unified data management for each UPC thread to optimize data transfer and memory layout for separate memory modules of CPUs and GPUs. The experimental results show that the UPC extension has better programmability than the mixed MPI/CUDA approach. We also demonstrate that the integrated compile-time and runtime optimization is effective to achieve good performance on GPU clusters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Fatahalian, K., Knight, T., Houston, M., Erez, M., Horn, D., Leem, L., Park, H., Ren, M., Aiken, A., Dally, W., Hanrahan, P.: Sequoia: Programming the Memory Hierarchy. In: Proceedings of Supercomputing 2006 (November 2006)
Yan, Y., Zhao, J., Guo, Y., Sarkar, V.: Hierarchical place trees: A portable abstraction for task parallelism and data movement. In: Gao, G.R., Pollock, L.L., Cavazos, J., Li, X. (eds.) LCPC 2009. LNCS, vol. 5898, pp. 172–187. Springer, Heidelberg (2010)
Bikshandi, G., Guo, J., Hoeflinger, D., Almasi, G., Fraguela, B.B., Garzarán, M.J., Padua, D., von Praun, C.: Programming for parallelism and locality with hierarchically tiled arrays. In: PPoPP, New York, USA, March 29-31 (2006)
Nishtala, R., Almasi, G., Cascaval, C.: Performance without pain = productivity: data layout and collective communication in UPC. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2008 (2008)
Ueng, S., Lathara, M., Baghsorkhi, S.S., Hwu, W.W.: CUDA-lite: Reducing GPU programming complexity. In: Amaral, J.N. (ed.) LCPC 2008. LNCS, vol. 5335, pp. 1–15. Springer, Heidelberg (2008)
Han, T.D., Abdelrahman, T.S.: hiCUDA: a high-level directive-based language for GPU programming. In: Workshop on General Purpose Processing on Graphics Processing Units (GPGPU), pp. 52–61 (March 2009)
Yan, Y., et al.: JCUDA: a Programmer-Friendly Interface for Accelerating Java Programs with CUDA. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 887–899. Springer, Heidelberg (2009)
Lee, S., Min, S.-J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 101–110 (February 2009)
The Portland Group. PGI Fortran & C Accelerator Programming Model (March 2010), http://grape.pgroup.com/lit/whitepapers/pgi_accel_prog_model_1.2.pdf
http://www.caps-entreprise.com/fr/page/index.php?id=49&p_p=36
NVIDIA CUDA, China campus programming contest (2009), http://cuda.csdn.net/contest/pro
Yang, Y., Xiang, P., Kong, J., Zhou, H.: A GPGPU Compiler for Memory Optimization and Parallelism Management. In: The ACM SIGNPLAN 2010 Conference on Programming Language Design and Implementation, PLDI 2010 (June 2010)
Serres, O., Kayi, A., Anbar, A., El-Ghazawi, T.: A UPC Specification Extension Proposal for Hierarchical Parallelism. In: The 3rd Conference on Partitioned Global Address Space Programming Models, Virginia, USA (October 2009)
Numrich, R.: Teams for Co-Array Fortran. In: The 3rd Conference on Partitioned Global Address Space Programming Models, Virginia, USA (October 2009)
Barton, C., Casçaval, C., Almási, G., Zheng, Y., Farreras, M., Chatterje, S., Amaral, J.N.: Shared memory programming for large scale machines. In: Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation, Ottawa, Ontario, Canada, June 11-14 (2006)
Husbands, P., Iancu, C., Yelick, K.: A performance analysis of the Berkeley UPC compiler. In: Proceedings of the 17th Annual International Conference on Supercomputing, San Francisco, CA, USA, June 23-26 (2003)
Bauer, M., Clark, J., Schkufza, E., Aiken, A.: Sequoia++ User Manual, http://sequoia.stanford.edu/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chen, L. et al. (2011). Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation. In: Cooper, K., Mellor-Crummey, J., Sarkar, V. (eds) Languages and Compilers for Parallel Computing. LCPC 2010. Lecture Notes in Computer Science, vol 6548. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19595-2_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-19595-2_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19594-5
Online ISBN: 978-3-642-19595-2
eBook Packages: Computer ScienceComputer Science (R0)