Effect of MPI tasks location on cluster throughput using NAS

  • Manuel Rodríguez-Pascual
  • José A. MoríñigoEmail author
  • Rafael Mayo-García


In this work the Numerical Aerodynamic Simulation (NAS) benchmarks have been executed in a systematic way on two clusters of rather different architectures and CPUs, to identify dependencies between MPI tasks mapping and the speedup or resource occupation. To this respect, series of experiments with the NAS kernels have been designed to take into account the context complexity when running scientific applications on HPC environments (CPU, I/O or memory-bound, execution time, degree of parallelism, dedicated computational resources, strong- and weak-scaling behaviour, to cite some). This context includes scheduling decisions, which have a great influence on the performance of the applications, making difficult to achieve an optimal exploitation with cost-effective strategies of the HPC resources. An analysis on how task grouping strategies under various cluster setups drive the execution time of jobs and the infrastructure throughput is provided. As a result, criteria for cluster setup arise linked to maximize performance of individual jobs, total cluster throughput or achieving better scheduling. To this respect, a criterion for execution decisions is suggested. This work is expected to be of interest on the design of scheduling policies and useful to HPC administrators.


MPI application performance Benchmarking Cluster throughput NAS Parallel Benchmarks 



This work was partially funded by the Spanish Ministry of Economy, Industry and Competitiveness project CODEC2 (TIN2015-63562-R) with European Regional Development Fund (ERDF) as well as carried out on computing facilities provided by the CYTED Network RICAP (517RT0529).


  1. 1.
    Bailey, D., et al.: The NAS Parallel Benchmarks. Tech. Rep. (1994)Google Scholar
  2. 2.
  3. 3.
    Chai, L., Gao, Q., Panda, D.K.: Understanding the impact of multi-core architecture in cluster computing: a case study with Intel dual-core system. In: Proceedings of 7th IEEE International Symposium on Cluster Computing and the Grid (CCGrid), pp. 471–478 (2007)Google Scholar
  4. 4.
    Chavarría-Miranda, D., Nieplocha, J., Tipparaju, V.: Topology-aware tile mapping for clusters of SMPs. In: Proceedings of 3rd Conference on Computing Frontiers 2006, pp. 383–392 (2006)Google Scholar
  5. 5.
    Intel Memory Latency Checker 3.1:
  6. 6.
    Jeannot, E., Mercier, G., Tessier, F.: Process placement in multicore clusters: algorithmic issues and practical techniques. IEEE Trans. Parallel Distrib. Syst. 25(4), 993–1002 (2014)CrossRefGoogle Scholar
  7. 7.
    McCalpin, J.D.: Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE TCCA Newsletter (May), pp. 19–25 (1995)Google Scholar
  8. 8.
  9. 9.
    Ribeiro, C.P.: Evaluating CPU and memory affinity for numerical scientific multithreaded benchmarks on multi-cores. Int. J. Comput. Sci. Inf. Security 7(1), 79–93 (2012)Google Scholar
  10. 10.
    Rodrigues, E.R., Madruga, F.L., Navaux, P.O.A., Panetta, J.: Multi-core aware process mapping and its impact on communication overhead of parallel applications. In: Proceedings of IEEE Symposium on Computers and Communications, pp. 811–817 (2009)Google Scholar
  11. 11.
    Shainer, G., Lui, P., Liu, T., Wilde, T., Layton, J.: The impact of inter-node latency versus intra-node latency on HPC applications. In: Proceedings of IASTED International Conference on Parallel and Distributed Computing and Systems, pp. 455–460 (2011)Google Scholar
  12. 12.
    Smith, B., Bode, B.: Performance effects of node mappings on the IBM BlueGene/L machine. In: Euro-Par 2005 Parallel Processing, pp. 1005–1013 (2005)Google Scholar
  13. 13.
    Top 500:
  14. 14.
    Wu, X., Taylor, V.: Performance modeling of hybrid MPI/OpenMP scientific applications on large-scale multicore supercomputers. J. Comput. Syst. Sci. 79(8), 1256–1268 (2013)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Xingfu, W., Taylor, V.: Processor partitioning: an experimental performance analysis of parallel applications on SMP clusters systems. In: 19th International Conference on Parallel Distributed Computing and Systems (PDCS’07), pp. 13–18, Los Angeles, CA, USA (2007)Google Scholar
  16. 16.
    Xingfu, W., Taylor, V.: Using processor partitioning to evaluate the performance of MPI, OpenMP and Hybrid Parallel Applications on Dual- and Quad-core Cray XT4 Systems. In: Cray UG Proceedings (CUG 2009), pp. 4–7. Atlanta, USA (2009)Google Scholar
  17. 17.
    Zhang, C., Yuan, X., Srinivasan, A.: Processor affinity and MPI performance on SMP-CMP clusters. In: IEEE International Symposium on Parallel & Distributed Processing, Workshops and PhD Forum, pp. 1–8. Atlanta, USA (2010)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Technology, Centro de Investigaciones Energéticas, Medioambientales y TecnológicasCIEMATMadridSpain

Personalised recommendations