The Journal of Supercomputing

, Volume 66, Issue 1, pp 381–405 | Cite as

A compound OpenMP/MPI program development toolkit for hybrid CPU/GPU clusters



In this paper, we propose a program development toolkit called OMPICUDA for hybrid CPU/GPU clusters. With the support of this toolkit, users can make use of a familiar programming model, i.e., compound OpenMP and MPI instead of mixed CUDA and MPI or SDSM to develop their applications on a hybrid CPU/GPU cluster. In addition, they can adapt the types of resources used for executing different parallel regions in the same program by means of an extended device directive according to the property of each parallel region. On the other hand, this programming toolkit supports a set of data-partition interfaces for users to achieve load balance at the application level no matter what type of resources are used for the execution of their programs.


Hybrid CPU/GPU clusters Compound OpenMP/MPI CUDA Load balance Device directive 



We would like to thank National Science Council of the Republic of China for their grant support with the project number of NSC 99-2221-E-151-055-MY3.


  1. 1.
    Owens JD, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn AE, Purcell T (2007) A survey of general purpose computation on graphics hardware. Comput Graph Forum 26(1):80–113 CrossRefGoogle Scholar
  2. 2.
    Top500 list, Nov 2012, Referenced from
  3. 3.
    Titan supercomputer, referenced from
  4. 4.
    Yang X-J, Liao X-K, Lu K, Hu Q-F, Song J-Q, Su J-S (2011) The TianHe-1A supercomputer: its hardware and software. J Comput Sci Technol 26(3):344–351 CrossRefGoogle Scholar
  5. 5.
    Vecchiola C, Pandey S, Buyya R (2009) High-performance cloud computing: a view of scientific applications. In: Proceedings of 10th international symposium on pervasive systems, algorithms, and networks, pp 4–16 CrossRefGoogle Scholar
  6. 6.
    Gropp W, Lusk E, Doss N, Skjellum A (1996) A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput 22:789–828 CrossRefMATHGoogle Scholar
  7. 7.
    The OpenMP Forum (1998) OpenMP C and C++ application program interface, version 1.0.
  8. 8.
    NVIDIA CUDA programming guide version 2.1.1.
  9. 9.
    Stone JE, Gohara D, Shi G (2010) OpenCL: a parallel programming standard for heterogeneous computing systems. Comput Sci Eng 12(3):66–73 CrossRefGoogle Scholar
  10. 10.
    Amza C, Cox AL, Dwarkadas H, Keleher P, Lu H, Rajamony R, Yu W, Zwaenepoel W (1996) TreadMarks: shared memory computing on networks of workstations. IEEE Comput 29(2):18–28 CrossRefGoogle Scholar
  11. 11.
    Clark C, Fraser K, Hand SM, Hansen JG, Jul EB, Limpach C, Pratt IA, Warfield A (2005) Live migration of virtual machines. Proceedings of the 2nd Conference on Symposium on Networked Systems Design and Implementation 2:273–286 Google Scholar
  12. 12.
    Basumallik A, Min S-j, Eigenmann R (2012) Towards OpenMP execution on software distributed shared memory systems. In: Proceedings of WOMPEI’02. Lecture notes in computer science, vol 2327, pp 457–468 Google Scholar
  13. 13.
  14. 14.
    Kessenich J, Baldwin D, Rost R (2011) The OpenGL shader language Google Scholar
  15. 15.
    Fernando R, Kilgard MJ (2003) The Cg tutorial: the definitive guide to programmable real-time graphics. Addison-Wesley Professional, Reading. ISBN 0-321-19496-9 Google Scholar
  16. 16.
    Yan Y, Grossman M, Sarkar V (2009) JCUDA: a programmer-friendly interface for accelerating Java programs with CUDA. In: Euro-par 2009 parallel processing. Lecture notes in computer science, vol 5704, pp 887–899 CrossRefGoogle Scholar
  17. 17.
    Dotzler G, Veldema R, Klemm M (2010) JCudaMP:OpenMP/Java on CUDA. In: Proceeding of the 3rd international workshops on multicore software engineering, pp 10–17 CrossRefGoogle Scholar
  18. 18.
    Chen Q-k, Zhang J-k (2009) A stream processor cluster architecture model with the hybrid technology of MPI and CUDA. In: Proceeding of 2009 1st international conference on information science and engineering, pp 26–28 Google Scholar
  19. 19.
    Han TD, Abdelrahman TS (2011) hiCUDA: high-level GPGPU programming. IEEE Trans Parallel Distrib Syst 22(1):78–90 CrossRefGoogle Scholar
  20. 20.
    Noaje G, Jaillet C, Krajecki M (2011) Source-to-source code translator: OpenMP C to CUDA. In: IEEE 13th international conference on high performance computing and communications (HPCC), pp 512–519 Google Scholar
  21. 21.
    Lee S, Eigenmann R (2010) OpenMPC: extended OpenMP programming and tuning for GPUs. In: 2010 international conference for high performance computing, networking, storage and analysis (SC), pp 1–11 CrossRefGoogle Scholar
  22. 22.
    He B, Fang W, Luo Q, Govindaraju NK, Wang T (2008) Mars: a MapReduce framework on graphics processors. In: Proceedings of the 17th international conference on parallel architectures and compilation techniques, pp 260–269 CrossRefGoogle Scholar
  23. 23.
    Dolbeau R, Bihan S, Bodin F (2007) HMPP: a hybrid multi-core parallel programming environment. In: The proceedings of the workshop on general purpose processing on graphics processing units (GPGPU 2007) Google Scholar
  24. 24.
    Tsai T-C (2010) OMP2OCL translator: a translator for automatic translation of OpenMP programs into OpenCL programs. Mater Thesis, Institute of Computer Science and Engineering, National Chiao-Tung University Google Scholar
  25. 25.
    Dean J, Ghmawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113 CrossRefGoogle Scholar
  26. 26.
  27. 27.
    Liang T-Y, Li H-F, Chiu J-Y (2012) Enabling mixed OpenMP/MPI programming on hybrid CPU/GPU computing architecture. In: IPDPS 2102, pp 2369–2377 Google Scholar
  28. 28.
    Liang T-Y, Chang Y-W, Li H-F (2012) A CUDA programming toolkit on grids. Int J Grid Util Comput 3(2/3):97–111 CrossRefGoogle Scholar
  29. 29.
    Kivity A, Kamay Y, Laor D, Lublin U, Liguori A (2007) KVM: Linux virtual machine monitor. In: Proceedings of the Linux symposium, vol 1, pp 225–230 Google Scholar
  30. 30.
    Li H-F, Liang T-Y, Jiang J-L (2011) An OpenMP compiler for hybrid CPU/GPU computing architecture. In: Third international conference on intelligent networking and collaborative systems, pp 209–216 CrossRefGoogle Scholar
  31. 31.
    Kusano K, Sato M, Hosomi T, Seo Y (2001) The omni OpenMP compiler on the distributed shared memory of Cenju-4. In: OpenMP shared memory parallel programming. Lecture notes in computer science, vol 2104, pp 20–30 CrossRefGoogle Scholar
  32. 32.
    Lee S, Min S-J, Eigenmann R (2009) OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: Proceedings of the 14th ACM SIGPLAN symposium on principles and practice of parallel programming, pp 101–110 Google Scholar
  33. 33.
    Conway ME (1963) Design of a separable transition-diagram compiler. Commun ACM, 396–408 Google Scholar
  34. 34.
    NVIDIA Development Zone (2012) CUDA C best practices guide, pp 51–52.
  35. 35.
    Almási G, Heidelberger P, Archer CJ, Martorell X, Erway CC, Moreira JE, Steinmacher-Burow B, Zheng Y (2005) Optimization of MPI collective communication on BlueGene/L systems. In: Proceedings of 19th annual international conference on supercomputing, pp 253–262 CrossRefGoogle Scholar
  36. 36.
    Vadhiyar S, Fagg G, Dongarra J (2000) Automatically tuned collective communications. In: Proceedings of the 2000 ACM/IEEE conference on supercomputing Google Scholar
  37. 37.
    Corbalan J, Duran A, Labarta J (2004) Dynamic load balancing of MPI+OpenMP applications. In: International conference on parallel processing 2004, vol 1, pp 195–202 CrossRefGoogle Scholar
  38. 38.
    Zhang K, Wu B (2012) Task scheduling for GPU heterogeneous cluster. In: 2012 IEEE international conference on cluster computing workshops, pp 161–169 CrossRefGoogle Scholar
  39. 39.
    Nian S, Guangmin L (2009) Dynamic load balancing algorithm for MPI parallel computing. In: 2009 international conference on new trends in information and service science, pp 95–99 CrossRefGoogle Scholar
  40. 40.
    Galindo I, Almeida F (2008) Dynamic load balancing on dedicated heterogeneous systems. In: Proceedings of 15th Euro PVM/MPI, pp 64–74 Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Department of Electrical EngineeringNational Kaohsiung University of Applied SciencesKaohsiungTaiwan, R.O.C.

Personalised recommendations