Abstract
The computer industry is at a major inflection point in its hardware roadmap due to the end of a decades-long trend of exponentially increasing clock frequencies. Instead, future computer systems are expected to be built using homogeneous and heterogeneous many-core processors with 10’s to 100’s of cores per chip, and complex hardware designs to address the challenges of concurrency, energy efficiency and resiliency. Unlike previous generations of hardware evolution, this shift towards many-core computing will have a profound impact on software. These software challenges are further compounded by the need to enable parallelism in workloads and application domains that traditionally did not have to worry about multiprocessor parallelism in the past. A recent trend in mainstream desktop systems is the use of graphics processor units (GPUs) to obtain order-of-magnitude performance improvements relative to general-purpose CPUs. Unfortunately, hybrid programming models that support multithreaded execution on CPUs in parallel with CUDA execution on GPUs prove to be too complex for use by mainstream programmers and domain experts, especially when targeting platforms with multiple CPU cores and multiple GPU devices.
In this paper, we extend past work on Intel’s Concurrent Collections (CnC) programming model to address the hybrid programming challenge using a model called CnC-CUDA. CnC is a declarative and implicitly parallel coordination language that supports flexible combinations of task and data parallelism while retaining determinism. CnC computations are built using steps that are related by data and control dependence edges, which are represented by a CnC graph. The CnC-CUDA extensions in this paper include the definition of multithreaded steps for execution on GPUs, and automatic generation of data and control flow between CPU steps and GPU steps. Experimental results show that this approach can yield significant performance benefits with both GPU execution and hybrid CPU/GPU execution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Habanero Multicore Software Project, http://habanero.rice.edu
Budimlić, Z., Burke, M., Cavé, V., Knobe, K., Lowney, G., Newton, R., Palsberg, J., Peixotto, D., Sarkar, V., Schlimbach, F., Taşrlar, S.: The CnC Programming Model. In: SIAM PP10, Special Issue on Scientific Programming (2010)
Burke, M.G., Knobe, K., Newton, R., Sarkar, V.: The Concurrent Collections Programming Model. In: Padua, D. (ed.) Encyclopedia of Parallel Computing. Springer, New York (to be published 2011)
Chandra, R., Dagum, L., Kohr, D., Maydan, D., McDonald, J., Menon, R.: Programming in OpenMP. Academic Press, London (2001)
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.-H., Skadron, K.: Rodinia: A benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization (October 2009)
Concurrent Collections in Habanero-Java, HJ (2010), http://habanero.rice.edu/cnc-download
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Charles, P., et al.: X10: An object-oriented approach to non-uniform cluster computing. In: Proceedings of OOPSLA 2005, ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications, pp. 519–538 (2005)
Barik, R., et al.: Experiences with an smp implementation for x10 based on the java concurrency utilities. In: Workshop on Programming Models for Ubiquitous Parallelism (PMUP), held in conjunction with PACT 2006 (September 2006)
Bocchino, R.L., et al.: A type and effect system for Deterministic Parallel Java. In: Proceedings of OOPSLA 2009, ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages and Applications, pp. 97–116 (2009)
Lo, V.M., et al.: Oregami: Tools for mapping parallel computations to parallel architectures. IJPP: International Journal of Parallel Programming 20(3), 237–270 (1991)
Lee, V.W., et al.: Debunking the 100x gpu vs. cpu myth: An evaluation of throughput computing on cpu and gpu. In: ISCA 2010: ACM IEEE International Symposium on Computer Architecture (June 2010)
Budimlić, Z., et al.: Declarative aspects of memory management in the concurrent collections parallel programming model. In: DAMP 2009: the Workshop on Declarative Aspects of Multicore Programming, pp. 47–58. ACM, New York (2008)
Gelernter, D.: Generative communication in linda. ACM Trans. Program. Lang. Syst. 7(1), 80–112 (1985)
The Java Grande Forum benchmark suite, http://www.epcc.ed.ac.uk/javagrande
Kennedy, K., Koelbel, C., Zima, H.P.: The rise and fall of High Performance Fortran. In: Proceedings of HOPL 2007, Third ACM SIGPLAN History of Programming Languages Conference, pp. 1–22 (2007)
Khronos OpenCL Working Group. The OpenCL Specification - Version 1.0. Technical report, The Khronos Group (2009)
Knobe, K., Offner, C.D.: Tstreams: A model of parallel computation (preliminary report). Technical Report HPL-2004-78, HP Labs (2004)
Lee, S., Min, S.-J., Eigenmann, R.: Openmp to gpgpu: a compiler framework for automatic translation and optimization. In: PPoPP 2009, pp. 101–110. ACM, New York (2009)
Nickolls, J., Buck, I., Garland, M., Nvidia, Skadron, K.: Scalable Parallel Programming with CUDA. ACM Queue 6(2), 40–53 (2008)
Peierls, T., Goetz, B., Bloch, J., Bowbeer, J., Lea, D., Holmes, D.: Java Concurrency in Practice. Addison-Wesley Professional, Reading (2005)
Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O’Reilly Media, Sebastopol (2007)
Yan, Y., Grossman, M., Sarkar, V.: Jcuda: A programmer-friendly interface for accelerating java programs with cuda. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 887–899. Springer, Heidelberg (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Grossman, M., Simion Sbîrlea, A., Budimlić, Z., Sarkar, V. (2011). CnC-CUDA: Declarative Programming for GPUs. In: Cooper, K., Mellor-Crummey, J., Sarkar, V. (eds) Languages and Compilers for Parallel Computing. LCPC 2010. Lecture Notes in Computer Science, vol 6548. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19595-2_16
Download citation
DOI: https://doi.org/10.1007/978-3-642-19595-2_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19594-5
Online ISBN: 978-3-642-19595-2
eBook Packages: Computer ScienceComputer Science (R0)