Comparing Runtime Systems with Exascale Ambitions Using the Parallel Research Kernels

  • Rob F. Van der Wijngaart
  • Abdullah Kayi
  • Jeff R. Hammond
  • Gabriele Jost
  • Tom St. John
  • Srinivas Sridharan
  • Timothy G. Mattson
  • John Abercrombie
  • Jacob Nelson
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9697)


We use three Parallel Research Kernels to compare performance of a set of programming models(We employ the term programming model as it is commonly used in the application community. A more accurate term is programming environment, which is the collective of abstract programming model, embodiment of the model in an Application Programmer Interface (API), and the runtime that implements it.): MPI1 (MPI two-sided communication), MPIOPENMP (MPI+OpenMP), MPISHM (MPI1 with MPI-3 interprocess shared memory), MPIRMA (MPI one-sided communication), SHMEM, UPC, Charm++ and Grappa. The kernels in our study – Stencil, Synch_p2p and Transpose – underlie a wide range of computational science applications. They enable direct probing of properties of programming models, especially communication and synchronization. In contrast to mini- or proxy applications, the PRK allow for rapid implementation, measurement and verification. Our experimental results show MPISHM the overall winner, with MPI1, MPIOPENMP and SHMEM performing well. MPISHM and MPIOPENMP outperform the other models in the strong-scaling limit due to their effective use of shared memory and good granularity control. The non-evolutionary models Grappa and Charm++ are not competitive with traditional models (MPI and PGAS) for two of the kernels; these models favor irregular algorithms, while the PRK considered here are regular.


Programming models MPI PGAS Charm++ 


  1. 1.
    OpenSHMEM Specification.
  2. 2.
    Alverson, R., et al.: The Tera computer system. In: International Conference on Supercomputing, pp. 1–6. ACM, New York, NY, USA (1990)Google Scholar
  3. 3.
    Bailey, D.H., et al.: The NAS parallel benchmarks. Int. J. High Perf. Comp. Appl. 5(3), 63–73 (1991)CrossRefGoogle Scholar
  4. 4.
    Barrett, R.F., et al.: Toward an evolutionary task parallel integrated MPI+X programming model. In: Proceedings of Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, pp. 30–39. ACM (2015)Google Scholar
  5. 5.
    Bauer, M., Treichler, S., Slaughter, E., Aiken, A.: Legion: expressing locality and independence with logical regions. In: Supercomputing, p. 66. IEEE Computer Society Press (2012)Google Scholar
  6. 6.
    Belli, R., Hoefler, T.: Notified access: extending remote memory access programming models for producer-consumer synchronization. In: IPDPS, Hyderabad, India, May 2015Google Scholar
  7. 7.
    Berlin, K., et al.: Evaluating the impact of programming language features on the performance of parallel applications on cluster architectures. In: Rauchwerger, L. (ed.) LCPC 2003. LNCS, vol. 2958, pp. 194–208. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  8. 8.
    Bonachea, D., et al.: Efficient point-to-point synchronization in UPC. In: PGAS. ACM (2006)Google Scholar
  9. 9.
    Bull, J.M., Ball, C.: Point-to-point synchronisation on shared memory architectures. In: 5th European Workshop on OpenMP (2003)Google Scholar
  10. 10.
    Cantonnet, F., Yao, Y., Zahran, M., El-Ghazawi, T.: Productivity analysis of the UPC language. In: IPDPS, p. 254. IEEE (2004)Google Scholar
  11. 11.
    Coarfa, C., Dotsenko, Y., Eckhardt, J., Mellor-Crummey, J.: Co-array Fortran performance and potential: an NPB experimental study. In: Rauchwerger, L. (ed.) LCPC 2003. LNCS, vol. 2958, pp. 177–193. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  12. 12.
    Coarfa, C., et al.: An evaluation of global address space languages: co-array fortran and unified parallel C. In: PPoPP, pp. 36–47. ACM (2005)Google Scholar
  13. 13.
    Cook, R., Dube, E., Lee, I., Nau, L., Shereda, C., Wang, F.: Survey of novel programming models for parallelizing applications at exascale. Technical report LLNL-TR-515971, Lawrence Livermore National Laboratory (2011)Google Scholar
  14. 14.
    Dinan, J., Cole, C., Jost, G., Smith, S., Underwood, K., Wisniewski, R.W.: Reducing synchronization overhead through bundled communication. In: Poole, S., Hernandez, O., Shamis, P. (eds.) OpenSHMEM 2014. LNCS, vol. 8356, pp. 163–177. Springer, Heidelberg (2014)Google Scholar
  15. 15.
    Dun, N., Taura, K.: An empirical performance study of Chapel programming language. In: IPDPSW, pp. 497–506. IEEE (2012)Google Scholar
  16. 16.
    El-Ghazawi, T., Cantonnet, F.: UPC performance, potential: a NPB experimental study. In: Supercomputing, p. 17. IEEE (2002)Google Scholar
  17. 17.
    Feind, K.: Shared memory access (SHMEM) routines. In: CUG (1995)Google Scholar
  18. 18.
    Feo, J., et al.: Eldorado. In: Computing Frontiers, pp. 28–34. ACM (2005)Google Scholar
  19. 19.
    Georganas, E., Van der Wijngaart, R.F., Mattson, T.G.: Design and implementation of a parallel research kernel for assessing dynamic load-balancing capabilities. In: Parallel and Distributed Processing, ser. IPDPS, vol. 16 (2016, to appear)Google Scholar
  20. 20.
    Heroux, M.A., Brightwell, R., Wolf, M.M.: Bi-modal MPI and MPI+threads computing on scalable multicore systems (2011)Google Scholar
  21. 21.
    Hoefler, T., et al.: MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory. Computing 95(12), 1121–1136 (2013)CrossRefGoogle Scholar
  22. 22.
    Kaiser, H., et al.: HPX: a task based programming model in a global address space. In : PGAS, p. 6. ACM (2014)Google Scholar
  23. 23.
    Kale, L.V., Krishnan, S.: CHARM++: a portable concurrent object oriented system based on C++, vol. 28. ACM (1993)Google Scholar
  24. 24.
    Kamal, H., Wagner, A.: FG-MPI: fine-grain MPI for multicore and clusters. In: IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–8. IEEE (2010)Google Scholar
  25. 25.
    Karlin, I., et al.: Exploring traditional and emerging parallel programming models using a proxy application. In: IPDPS, pp. 919–932. IEEE (2013)Google Scholar
  26. 26.
    MPI Forum: MPI: a message-passing interface standard (1994)Google Scholar
  27. 27.
    MPI Forum: MPI-2: Extensions to the message-passing interface (1996)Google Scholar
  28. 28.
    MPI Forum: MPI: a message-passing interface standard. Version 3.0, November 2012Google Scholar
  29. 29.
    Nanz, S., et al.: Benchmarking usability and performance of multicore languages. In: International Symposium on Empirical Software Engineeringg. Measurement, pp. 183–192. IEEE (2013)Google Scholar
  30. 30.
    Nelson, J., Holt, B., Myers, B., Briggs, P., Ceze, L., Kahan, S., Oskin, M.: Latency-tolerant software distributed shared memory. In: 2015 USENIX Annual Technical Conference (USENIX ATC 2015). USENIX Association, Santa Clara, CA, July 2015Google Scholar
  31. 31.
    Patel, I., Gilbert, J.R.: An empirical study of the performance and productivity of two parallel programming models. In: IPDPS, pp. 1–7. IEEE (2008)Google Scholar
  32. 32.
    Shan, H., et al.: A preliminary evaluation of the hardware acceleration of the Cray Gemini interconnect for PGAS languages and comparison with MPI. ACM SIGMETRICS Perf. Eval. Rev. 40(2), 92–98 (2012)CrossRefGoogle Scholar
  33. 33.
    Shet, A.G., et al.: Programmability of the HPCS languages: a case study with a quantum chemistry kernel. In: IPDPS, pp. 1–8. IEEE (2008)Google Scholar
  34. 34.
    UPC Consortium: UPC lang. spec. v. 1.3, November 2013Google Scholar
  35. 35.
    Van der Wijngaart, R.F., et al.: Using the parallel research kernels to study PGAS models. In: PGAS. IEEE (2015)Google Scholar
  36. 36.
    Van der Wijngaart, R.F., Mattson, T.G.: The parallel research kernels: a tool for architecture and programming system investigation. In: HPEC. IEEE (2014)Google Scholar
  37. 37.
    Zerr, R., Baker, R.: Snap: Sn (discrete ordinates) application proxy: description. Technical report, Los Alamos National Laboratories, LAUR-13-21070 (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Rob F. Van der Wijngaart
    • 1
  • Abdullah Kayi
    • 1
  • Jeff R. Hammond
    • 1
  • Gabriele Jost
    • 1
  • Tom St. John
    • 1
  • Srinivas Sridharan
    • 1
  • Timothy G. Mattson
    • 1
  • John Abercrombie
    • 2
  • Jacob Nelson
    • 2
  1. 1.Intel CorporationHillsboroUSA
  2. 2.University of WashingtonSeattleUSA

Personalised recommendations