An Application-Level Solution for the Dynamic Reconfiguration of MPI Applications

  • Iván Cores
  • Patricia González
  • Emmanuel Jeannot
  • María J. Martín
  • Gabriel Rodríguez
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10150)

Abstract

Current parallel environments aggregate large numbers of computational resources with a high rate of change in their availability and load conditions. In order to obtain the best performance in this type of infrastructures, parallel applications must be able to adapt to these changing conditions. This paper presents an application-level proposal to automatically and transparently adapt MPI applications to the available resources. The architecture includes: automatic code transformation of the parallel applications, a system to reschedule processes on available nodes, and migration capabilities based on checkpoint-and-restart techniques to move selected processes to target nodes. Experimental results show a good degree of adaptability and a good performance in different availability scenarios.

Keywords

HPC MPI Checkpointing Migration Scheduling 

Notes

Acknowledgments

This research was partially supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds of the EU (Project TIN2013-42148-P), by the Galician Government and FEDER funds of the EU (consolidation program of competitive reference groups GRC2013/055) and by the EU under the COST programme Action IC1305, Network for Sustainable Ultrascale Computing.

References

  1. 1.
    Agbaria, A., Friedman, R.: Starfish: fault-tolerant dynamic MPI programs on clusters of workstations. Cluster Comput. 6(3), 227–236 (2003)CrossRefGoogle Scholar
  2. 2.
    Broquedis, F., Clet-Ortega, J., Moreaud, S., Furmento, N., Goglin, B., Mercier, G., Thibault, S., Namyst, R.: hwloc: a generic framework for managing hardware affinities in HPC applications. In: Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 180–186 (2010)Google Scholar
  3. 3.
    Buisson, J., Sonmez, O., Mohamed, H., Lammers, W., Epema, D.: Scheduling malleable applications in multicluster systems. In: 2007 International Conference on Cluster Computing (CLUSTER), pp. 372–381 (2007)Google Scholar
  4. 4.
    Cores, I., Rodríguez, G., Martín, M.J., González, P.: Achieving checkpointing global consistency through a hybrid compile time and runtime protocol. Procedia Comput. Sci. 18, 169–178 (2013). International Conference on Computational Science (ICCS)Google Scholar
  5. 5.
    George, C., Vadhiyar, S.S.: ADFT: an adaptive framework for fault tolerance on large scale systems using application malleability. Procedia Comput. Sci. 9, 166–175 (2012). International Conference on Computational Science (ICCS)CrossRefGoogle Scholar
  6. 6.
    Guay, W.L., Reinemo, S.A., Johnsen, B.D., Yen, C.H., Skeie, T., Lysne, O., Tørudbakken, O.: Early experiences with live migration of SR-IOV enabled infiniband. J. Parallel Distrib. Comput. 78, 39–52 (2015)CrossRefGoogle Scholar
  7. 7.
    Hacker, T.J., Romero, F., Nielsen, J.J.: Secure live migration of parallel applications using container-based virtual machines. Int. J. Space Based Situated Comput. 2(1), 45–57 (2012)CrossRefGoogle Scholar
  8. 8.
    Huang, C., Lawlor, O., Kalé, L.V.: Adaptive MPI. In: Rauchwerger, L. (ed.) LCPC 2003. LNCS, vol. 2958, pp. 306–322. Springer, Heidelberg (2004). doi: 10.1007/978-3-540-24644-2_20 CrossRefGoogle Scholar
  9. 9.
    Hungershofer, J.: On the combined scheduling of malleable and rigid jobs. In: Computer Architecture and High Performance Computing (SBAC-PAD), pp. 206–213 (2004)Google Scholar
  10. 10.
    Information Technology Center, RIKEN. HIMENO Benchmark. http://accc.riken.jp/2444.htm. Accessed Aug 2016
  11. 11.
    Jeannot, E., Mercier, G.: Near-optimal placement of MPI processes on hierarchical NUMA architectures. In: D’Ambra, P., Guarracino, M., Talia, D. (eds.) Euro-Par 2010. LNCS, vol. 6272, pp. 199–210. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15291-7_20 CrossRefGoogle Scholar
  12. 12.
    Martín, G., Singh, D.E., Marinescu, M.C., Carretero, J.: Enhancing the performance of malleable MPI applications by using performance-aware dynamic reconfiguration. Parallel Comput. 46, 60–77 (2015)CrossRefGoogle Scholar
  13. 13.
    Nagarajan, A.B., Mueller, F., Engelmann, C., Scott, S.L.: Proactive fault tolerance for HPC with Xen virtualization. In: International Conference on Supercomputing (ICS), pp. 23–32 (2007)Google Scholar
  14. 14.
    National Aeronautics and Space Administration. The NAS Parallel Benchmarks. http://www.nas.nasa.gov/publications/npb.html. Accessed Aug 2016
  15. 15.
    Open MPI Team. Open MPI: Open Source High Performance Computing. http://www.open-mpi.org/. Accessed Aug 2016
  16. 16.
    Raveendran, A., Bicer, T., Agrawal, G.: A framework for elastic execution of existing MPI programs. In: IEEE International Symposium on Parallel and Distributed Processing Workshops (IPDPSW), pp. 940–947 (2011)Google Scholar
  17. 17.
    Ribeiro, F.S., Nascimento, A.P., Boeres, C., Rebello, V.E.F., Sena, A.C.: Autonomic malleability in iterative MPI applications. In: Computer Architecture and High Performance Computing (SBAC-PAD), pp. 192–199 (2013)Google Scholar
  18. 18.
    Rodríguez, G., Martín, M.J., González, P., Touriño, J., Doallo, R.: CPPC: a compiler-assisted tool for portable checkpointing of message-passing applications. Concurr. Comput. Pract. Exper. 22(6), 749–766 (2010)CrossRefGoogle Scholar
  19. 19.
    Rodríguez, M., Cores, I., González, P., Martín, M.J.: Improving an MPI application-level migration approach through checkpoint file splitting. In: Computer Architecture and High Performance Computing (SBAC-PAD), pp. 33–40 (2014)Google Scholar
  20. 20.
    Vadhiyar, S.S., Dongarra, J.J.: SRS - a framework for developing malleable and migratable parallel applications for distributed systems. Parallel Process. Lett. 13(02), 291–312 (2003)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Wang, C., Mueller, F., Engelmann, C., Scott, S.L.: Proactive process-level live migration and back migration in HPC environments. J. Parallel Distrib. Comput. 72(2), 254–267 (2012)CrossRefGoogle Scholar
  22. 22.
    Weatherly, D.B., Lowenthal, D.K., Nakazawa, M., Lowenthal, F.: Dyn-MPI: supporting MPI on non dedicated clusters. In: ACM/IEEE Conference on High Performance Networking and Computing (SC), p. 5 (2003)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Iván Cores
    • 1
  • Patricia González
    • 1
  • Emmanuel Jeannot
    • 2
  • María J. Martín
    • 1
  • Gabriel Rodríguez
    • 1
  1. 1.Grupo de Arquitectura de ComputadoresUniversidade da CoruñaA CoruñaSpain
  2. 2.INRIA Bordeaux Sud-OuestBordeauxFrance

Personalised recommendations