Encapsulated Synchronization and Load-Balance in Heterogeneous Programming

  • Yuri Torres
  • Arturo Gonzalez-Escribano
  • Diego Llanos
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7484)


Programming models and techniques to exploit parallelism in accelerators, such as GPUs, are different from those used in traditional parallel models for shared- or distributed-memory systems. It is a challenge to blend different programming models to coordinate and exploit devices with very different characteristics and computation powers. This paper presents a new extensible framework model to encapsulate run-time decisions related to data partition, granularity, load balance, synchronization, and communication for systems including assorted GPUs. Thus, the main parallel code becomes independent of them, using internal topology and system information to transparently adapt the computation to the system. The programmer can develop specific functions for each architecture, or use existent specialized library functions for different CPU-core or GPU architectures. The high-level coordination is expressed using a programming model built on top of message-passing, providing portability across distributed- or shared-memory systems. We show with an example how to produce a parallel code that can be used to efficiently run on systems ranging from a Beowulf cluster to a machine with mixed GPUs. Our experimental results show how the run-time system, guided by hints about the computational-power ratios of different devices, can automatically part and distribute large computations across heterogeneous systems, improving the overall performance.


Logical Process Virtual Topology Hardware Accelerator Memory Access Pattern Beowulf Cluster 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Chamberlain, B., Deitz, S., Iten, D., Choi, S.E.: User-defined distributions and layouts in Chapel: Philosophy and framework. In: 2nd USENIX Workshop on Hot Topics in Parallelism (June 2010)Google Scholar
  2. 2.
    Kui Chen, Q., Kang Zhang, J.: A stream processor cluster architecture model with the hybrid technology of mpi and cuda. In: ICISE 2009, pp. 86–89 (December 2009)Google Scholar
  3. 3.
    de Blas Cartón, C., Gonzalez-Escribano, A., Llanos, D.R.: Effortless and Efficient Distributed Data-Partitioning in Linear Algebra. In: HPCC 2011, pp. 89–97. IEEE (September 2010)Google Scholar
  4. 4.
    Farooqui, N., Kerr, A., Diamos, G.F., Yalamanchili, S., Schwan, K.: A framework for dynamically instrumenting GPU compute applications within GPU Ocelot. In: GPGPU, p. 9 (2011)Google Scholar
  5. 5.
    Fresno, J., Gonzalez-Escribano, A., Llanos, D.R.: Automatic Data Partitioning Applied to Multigrid PDE Solvers. In: PDP 2011, pp. 239–246. IEEE (February 2011)Google Scholar
  6. 6.
    Gelado, I., Stone, J.E., Cabezas, J., Patel, S., Navarro, N., Mei, W.: An asymmetric distributed shared memory model for heterogeneous parallel systems. In: ASPLOS 2010, pp. 347–358. ACM, New York (2010)Google Scholar
  7. 7.
    Grama, A., Gupta, A., Karypis, G., Kumar, V.: Introduction to Parallel Computing, 2nd edn. Addison Wesley (2003)Google Scholar
  8. 8.
    Hong, C., Chen, D., Chen, W., Zheng, W., Lin, H.: MapCG: writing parallel program portable between CPU and GPU. In: PACT 2010, pp. 217–226. ACM, New York (2010)CrossRefGoogle Scholar
  9. 9.
    Karunadasa, N., Ranasinghe, D.: Accelerating high performance applications with cuda and mpi. In: ICIIS 2009, pp. 331–336 (December 2009)Google Scholar
  10. 10.
    Luk, C.K., Hong, S., Kim, H.: Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In: MICRO-42, pp. 45–55 (December 2009)Google Scholar
  11. 11.
    Quintana-Ortí, G., Igual, F.D., Quintana-Ortí, E.S., van de Geijn, R.A.: Solving dense linear systems on platforms with multiple hardware accelerators. In: PPoPP 2009, pp. 121–130. ACM, New York (2009)Google Scholar
  12. 12.
    Singh, S.: Computing without processors. Commun. ACM 54, 46–54 (2011)CrossRefGoogle Scholar
  13. 13.
    Stratton, J.A., Stone, S.S., Hwu, W.-M.W.: MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs. In: Amaral, J.N. (ed.) LCPC 2008. LNCS, vol. 5335, pp. 16–30. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  14. 14.
    Torres, Y., Gonzalez-Escribano, A., Llanos, D.R.: Using Fermi architecture knowledge to speed up CUDA and OpenCL programs. In: Proc. ISPA 2012, Leganes, Madrid, Spain (2012)Google Scholar
  15. 15.
    Yao, P., An, H., Xu, M., Liu, G., Li, X., Wang, Y., Han, W.: CuHMMer: A load-balanced CPU-GPU cooperative bioinformatics application. In: HPCS 2010, pp. 24–30 (July 2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Yuri Torres
    • 1
  • Arturo Gonzalez-Escribano
    • 1
  • Diego Llanos
    • 1
  1. 1.Departamento de InformaticaUniversidad de ValladolidSpain

Personalised recommendations