International Journal of Parallel Programming

, Volume 44, Issue 6, pp 1268–1295 | Cite as

Combining Data and Computation Distribution Directives for Hybrid Parallel Programming : A Transformation System

  • Rachid Habel
  • Frédérique Silber-Chaussumier
  • François Irigoin
  • Elisabeth Brunet
  • François Trahay


This paper describes dSTEP, a directive-based programming model for hybrid shared and distributed memory machines. The originality of our work is the definition and an implementation of a unified high-level programming model addressing both data and computation distributions, providing a particularly fine control of the computation. The goal is to improve the programmer productivity while providing good performances in terms of execution time and memory usage. We define a generic compilation scheme for computation mapping and communication generation. We implement the solution in a source-to-source compiler together with a runtime library. We provide a series of optimizations to improve the performance of the generated code, with a special focus on reducing the communications time. We evaluate our solution on several scientific kernels as well as on the more challenging NAS BT benchmark, and compare our results with the hand written Fortran MPI and UPC implementations. The results show first that our solution allows to make explicit the non trivial parallel execution of the NAS BT benchmark using the dSTEP directives. Second, the results show that our generated MPI+OpenMP BT program runs with a 83.35 speedup over the original NAS OpenMP C benchmark on a hybrid cluster composed of 64 quadricores (256 cores). Overall, our solution dramatically reduces the programming effort while providing good time execution and memory usage performances. This programming model is suitable for a large variety of machines as multi-core and accelerator clusters.


Distributed-memory Shared-memory Source-to-source transformation OpenMP MPI Optimization 


  1. 1.
    Amini, M., Ancourt, C., Coelho, F., Irigoin, F., Jouvelot, P., Keryell, R., Villalon, P., Creusillet, B., Guelton, S.: PIPS is Not (Just) Polyhedral Software. In: International Workshop on Polyhedral Compilation Techniques (IMPACT11), Chamonix, France (2011)Google Scholar
  2. 2.
    Amza, C., Cox, A.L., Dwarkadas, S., Keleher, P., Lu, H., Rajamony, R., Yu, W., Zwaenepoel, W.: Treadmarks: shared memory computing on networks of workstations. Computer 29(2), 18–28 (1996)CrossRefGoogle Scholar
  3. 3.
    Ancourt, C., Coelho, F., Irigoin, F., Keryell, R.: A linear algebra framework for static high performance Fortran code distribution. Sci. Program. 6, 3–27 (1997)Google Scholar
  4. 4.
    Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr. Comput.: Pract. Exp. 23(2), 187–198 (2011)CrossRefGoogle Scholar
  5. 5.
    Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, L., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., et al.: The NAS parallel benchmarks. Int. J. High Perform. Comput. Appl. 5(3), 63–73 (1991)CrossRefGoogle Scholar
  6. 6.
    Banerjee, P., Chandy, J.A., Gupta, M., Hodges IV, E.W., Holm, J.G., Lain, A., Palermo, D.J., Ramaswamy, S., Su, E.: The PARADIGM compiler for distributed-memory multicomputers. Computer 28(10), 37–47 (1995)CrossRefGoogle Scholar
  7. 7.
    Basumallik, A., Eigenmann, R.: Towards automatic translation of OpenMP to MPI. In: Proceedings of the 19th annual international conference on Supercomputing, ACM, pp. 189–198 (2005)Google Scholar
  8. 8.
    Bolze, R., Cappello, F., Caron, E., Daydé, M., Desprez, F., Jeannot, E., Jégou, Y., Lanteri, S., Leduc, J., Melab, N., et al.: Grid’5000: a large scale and highly reconfigurable experimental grid testbed. Int. J. High Perform. Comput. Appl. 20(4), 481–494 (2006)CrossRefGoogle Scholar
  9. 9.
    Bonachea, D.: GASNet Specification, V1.1. Technical Report, University of California at Berkeley, Berkeley (2002)Google Scholar
  10. 10.
    Bueno, J., Martinell, L., Duran, A., Farreras, M., Martorell, X., Badia, R., Ayguade, E., Labarta, J.: Productive Cluster Programming with OmpSs. In: Euro-Par 2011 Parallel Processing, Lecture Notes in Computer Science, vol. 6852, pp. 555–566. Springer, Berlin, Heidelberg (2011)Google Scholar
  11. 11.
    Chamberlain, B.L., Callahan, D., Zima, H.P.: Parallel programmability and the chapel language. Int. J. High Perform. Comput. Appl. 21(3), 291–312 (2007)CrossRefGoogle Scholar
  12. 12.
    Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., Von Praun, C., Sarkar, V.: X10: an object-oriented approach to non-uniform cluster computing. ACM SIGPLAN Not. 40(10), 519–538 (2005)CrossRefGoogle Scholar
  13. 13.
    Creusillet, B., Irigoin, F.: Interprocedural Array Region Analyses. In: Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, vol. 1033, pp. 46–60. Springer, Berlin, Heidelberg (1996)Google Scholar
  14. 14.
    Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)CrossRefGoogle Scholar
  15. 15.
    Dolbeau, R., Bihan, S., Bodin, F.: HMPP: A hybrid multi-core parallel programming environment. In: Workshop on General Purpose Processing on Graphics Processing Units (GPGPU 2007) (2007)Google Scholar
  16. 16.
    Duarn, A., Ayguadé, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21(02), 173–193 (2011). doi: 10.1142/S0129626411000151 MathSciNetCrossRefGoogle Scholar
  17. 17.
    Feautrier, P.: Dataflow Analysis of Array and Scalar References. Int. J. Parallel Program. 20, 23–53 (1991)CrossRefMATHGoogle Scholar
  18. 18.
    Irigoin, F., Jouvelot, P., Triolet, R.: Semantical interprocedural parallelization: an overview of the PIPS project. In: Proceedings of the 5th international conference on Supercomputing, ACM, New York, ICS ’91, pp. 244–251 (1991). doi: 10.1145/109025.109086
  19. 19.
    Kennedy, K., Koelbel, C., Zima, H.: The rise and fall of High Performance Fortran: an historical object lesson. In: Proceedings of the third ACM SIGPLAN conference on History of programming languages, ACM, New York, HOPL III, pp. 7–1–7–22 (2007). doi: 10.1145/1238844.1238851
  20. 20.
    Kim, D.: Parameterized and Multi-level Tiled Loop Generation. Ph.D. thesis, Colorado State University aAI3419053 (2010)Google Scholar
  21. 21.
    Kusano, K., Satoh, S., Sato, M.: Performance Evaluation of the Omni OpenMP Compiler. In: High Performance Computing, pp. 403–414. Springer (2000)Google Scholar
  22. 22.
    Lee, S., Min, S.J., Eigenmann, R.: OpenMP to GPGPU: a compiler framework for automatic translation and optimization. ACM Sigplan Not. 44(4), 101–110 (2009)CrossRefGoogle Scholar
  23. 23.
    Li, J., Chen, M.: Index domain alignment: minimizing cost of cross-referencing between distributed arrays. In: Proceedings of the 3rd Symposium on the Frontiers of Massively Parallel Computation, 1990, IEEE, pp. 424–433 (1990)Google Scholar
  24. 24.
    Mellor-Crummey, J., Adhianto, L., Scherer, W.N. III, Jin, G.: A new vision for co-array Fortran. In: Proceedings of the Third Conference on Partitioned Global Address Space Programing Models, ACM, New York, PGAS ’09, pp. 5:1–5:9 (2009). doi: 10.1145/1809961.1809969
  25. 25.
    Mellor-Crummey, John M., Adve, Vikram S., Broom, Bradley, Chavarra-Miranda, Daniel G., Fowler, Robert J., Jin, Guohua, Kennedy, Ken, Yi, Qing: Advanced optimization strategies in the Rice dHPF compiler. Concurr. Comput.: Pract. Exp. 14, 741–767 (2002)CrossRefMATHGoogle Scholar
  26. 26.
    Merlin, J., Miles, D., Schuster, V.: Distributed OMP: Extensions to OpenMP for SMP clusters. In: Second European Workshop on OpenMP (EWOMP), pp. 14–15 (2000)Google Scholar
  27. 27.
    Message Passing Interface Forum: MPI: A Message-Passing Interface Standard, Version 3.0 (2012)Google Scholar
  28. 28.
    Millot, D., Muller, A., Parrot, C., Silber-Chaussumier, F.: STEP: A Distributed OpenMP for Coarse-Grain Parallelism Tool. In: OpenMP in a New Era of Parallelism, Lecture Notes in Computer Science, vol. 5004, pp. 83–99. Springer, Berlin, Heidelberg (2008)Google Scholar
  29. 29.
    Millot, D., Muller, A., Parrot, C., Silber-Chaussumier, F.: From OpenMP to MPI: first experiments of the STEP source-to-source transformation tool. In: The international Parallel Computing Conference (ParCo), pp. 669–676 (2009)Google Scholar
  30. 30.
    Nakao, Masahiro, Lee, Jinpil, Boku, Taisuke, Sato, Mitsuhisa: Productivity and performance of global-view programming with XcalableMP PGAS language. In: 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 402–409 (2012)Google Scholar
  31. 31.
    Nieplocha, J., Harrison, R., Littlefield, R.J.: Global arrays: a nonuniform memory access programming model for high-performance computers. J. Supercomput. 10(2), 169–189 (1996)CrossRefGoogle Scholar
  32. 32.
    Numrich, R.W., Reid, J.: Co-Array Fortran for parallel programming. SIGPLAN Fortran Forum 17(2), 1–31 (1998)CrossRefGoogle Scholar
  33. 33.
    Pouchet, L.N.: PolyBoench/C, The polyhedral benchmark suite. (2014).
  34. 34.
    Rice University, CORPORATE.: High performance Fortran language specification. SIGPLAN Fortran Forum 12(4), 1–86 (1993). doi: 10.1145/174223.158909
  35. 35.
    Silber-Chaussumier, F., Muller, A., Habel, R.: Generating data transfers for distributed GPU parallel Programs. J. Parallel Distrib. Comput. 73(12), 1649–1660 (2013)CrossRefGoogle Scholar
  36. 36.
    The OpenACC Consortium: The OpenACC Programming Interface. (2014).
  37. 37.
    Trahay, F., Brunet, E., Denis, A., Namyst, R.: A multithreaded communication engine for multicore architectures. In: IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp. 1–7 (2008). doi: 10.1109/IPDPS.2008.4536139
  38. 38.
    UPC Consortium: UPC Language Specifications, v1.2. Technical Report LBNL-59208, Lawrence Berkeley National Lab. (2005).
  39. 39.
    Wienke, S., Springer, P., Terboven, C., an Mey, D.: OpenACCfirst Experiences with Real-world Applications. In: Euro-Par 2012 Parallel Processing, Springer, pp. 859–870 (2012)Google Scholar
  40. 40.
    Van der Wijngaart, R.F., Wong, P.: NAS Parallel Benchmarks Version 2.4. Technical Report, NAS technical report, NAS-02-007 (2002)Google Scholar
  41. 41.
    Wolf, M.E., Lam, M.S.: A data locality optimizing algorithm. SIGPLAN Not. 26(6), 30–44 (1991)CrossRefGoogle Scholar
  42. 42.
    Yuki, T., Rajopadhye, S.: Parametrically Tiled Distributed Memory Parallelization of Polyhedral Programs. Technical Report, Colorado State University Technical Report CS13-105 (2013)Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Rachid Habel
    • 1
  • Frédérique Silber-Chaussumier
    • 1
  • François Irigoin
    • 2
  • Elisabeth Brunet
    • 1
  • François Trahay
    • 1
  1. 1.TELECOM SudParisEvryFrance
  2. 2.MINES ParisTechFontainebleauFrance

Personalised recommendations