Do Moldable Applications Perform Better on Failure-Prone HPC Platforms?

  • Valentin Le FèvreEmail author
  • George Bosilca
  • Aurelien Bouteiller
  • Thomas Herault
  • Atsushi Hori
  • Yves Robert
  • Jack Dongarra
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11339)


This paper compares the performance of different approaches to tolerate failures using checkpoint/restart when executed on large-scale failure-prone platforms. We study (i) \(\textsc {Rigid}\) applications, which use a constant number of processors throughout execution; (ii) \(\textsc {Moldable}\) applications, which can use a different number of processors after each restart following a fail-stop error; and (iii) \(\textsc {GridShaped}\) applications, which are moldable applications restricted to use rectangular processor grids (such as many dense linear algebra kernels). For each application type, we compute the optimal number of failures to tolerate before relinquishing the current allocation and waiting until a new resource can be allocated, and we determine the optimal yield that can be achieved. We instantiate our performance model with a realistic applicative scenario and make it publicly available for further usage.


Resilience Spare nodes Moldable applications Checkpoint Restart Allocation length Wait time 


  1. 1.
    Amdahl, G.: The validity of the single processor approach to achieving large scale computing capabilities. In: AFIPS Conference Proceedings, vol. 30, pp. 483–485. AFIPS Press (1967)Google Scholar
  2. 2.
    Ashraf, R.A., Hukerikar, S., Engelmann, C.: Shrink or substitute: handling process failures in HPC systems using in-situ recovery. CoRR abs/1801.04523 (2018).
  3. 3.
    Bland, W., Bouteiller, A., Herault, T., Bosilca, G., Dongarra, J.: Post-failure recovery of MPI communication capability: design and rationale. Int. J. High Perform. Comput. Appl. 27(3), 244–254 (2013)., Scholar
  4. 4.
    Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomput. Front. Innov. 1(1), 5–28 (2014)Google Scholar
  5. 5.
    Cavelan, A., Li, J., Robert, Y., Sun, H.: When Amdahl meets Young/Daly. In: Cluster 2016. IEEE Computer Society Press (2016)Google Scholar
  6. 6.
    Cirne, W., Berman, F.: Using moldability to improve the performance of supercomputer jobs. J. Parallel Distrib. Comput. 62(10), 1571–1601 (2002)CrossRefGoogle Scholar
  7. 7.
    CORAL: Collaboration of Oak Ridge, Argonne and Livermore National Laboratorie: Draft CORAL-2 build statement of work. Technical report LLNL-TM-7390608, Lawrence Livermore National Laboratory, 30 March 2018Google Scholar
  8. 8.
    Daly, J.T.: A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comp. Syst. 22(3), 303–312 (2006)CrossRefGoogle Scholar
  9. 9.
    Du, P., Bouteiller, A., et al.: Algorithm-based fault tolerance for dense matrix factorizations. In: PPoPP, pp. 225–234. ACM (2012)Google Scholar
  10. 10.
    Dutot, P., Mounié, G., Trystram, D.: Scheduling parallel tasks approximation algorithms. In: Leung, J.Y. (ed.) Handbook of Scheduling - Algorithms, Models, and Performance Analysis. CRC Press (2004)Google Scholar
  11. 11.
    Fang, A., Fujita, H., Chien, A.A.: Towards understanding post-recovery efficiency for shrinking and non-shrinking recovery. In: Hunold, S., et al. (eds.) Euro-Par 2015. LNCS, vol. 9523, pp. 656–668. Springer, Cham (2015). Scholar
  12. 12.
    Fèvre, V.L., et al.: Do moldable applications perform better on failure-prone HPC platforms? Research report RR-9174, INRIA (2018)Google Scholar
  13. 13.
    Guo, Y., Bland, W., Balaji, P., Zhou, X.: Fault tolerant MapReduce-MPI for HPC clusters. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, Austin, TX, USA, 15–20 November 2015, pp. 34:1–34:12 (2015)Google Scholar
  14. 14.
    Gupta, S., Patel, T., Engelmann, C., Tiwari, D.: Failures in large scale systems: long-term measurement, analysis, and implications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017, New York, NY, USA, pp. 44:1–44:12 (2017)Google Scholar
  15. 15.
    Hori, A., Yoshinaga, K., Herault, T., Bouteiller, A., Bosilca, G., Ishikawa, Y.: Sliding substitution of failed nodes. In: Proceedings of the 22nd European MPI Users’ Group Meeting, EuroMPI 2015, pp. 14:1–14:10. ACM, New York (2015).
  16. 16.
    Huang, K.H., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33(6), 518–528 (1984)CrossRefGoogle Scholar
  17. 17.
    Hérault, T., Robert, Y. (eds.): Fault-Tolerance Techniques for High-Performance Computing. Springer, Heidelberg (2015). Scholar
  18. 18.
    Jin, H., Chen, Y., Zhu, H., Sun, X.H.: Optimizing HPC fault-tolerant environment: an analytical approach. In: Proceedings of the ICPP 2010 (2010)Google Scholar
  19. 19.
    Moreira, J.E., Naik, V.K.: Dynamic resource management on distributed systems using reconfigurable applications. IBM J. Res. Dev. 41(3), 303–330 (1997)CrossRefGoogle Scholar
  20. 20.
    Prabhakaranw, S.: Dynamic resource management and job scheduling for high performance computing. Ph.D. thesis, Technische Universität Darmstadt (2016)Google Scholar
  21. 21.
    Simulation software: computing the yield (2018).
  22. 22.
    Sudarsan, R., Ribbens, C.J.: Design and performance of a scheduling framework for resizable parallel applications. Parallel Comput. 36(1), 48–64 (2010)CrossRefGoogle Scholar
  23. 23.
    Sudarsan, R., Ribbens, C.J., Farkas, D.: Dynamic resizing of parallel scientific simulations: a case study using LAMMPS. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2009. LNCS, vol. 5544, pp. 175–184. Springer, Heidelberg (2009). Scholar
  24. 24.
    Yamamoto, K., et al.: The K computer operations: experiences and statistics. Procedia Comput. Sci. (ICCS) 29, 576–585 (2014)CrossRefGoogle Scholar
  25. 25.
    Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531 (1974)CrossRefGoogle Scholar
  26. 26.
    Zheng, Z., Yu, L., Lan, Z.: Reliability-aware speedup models for parallel applications with coordinated checkpointing/restart. IEEE Trans. Comput. 64(5), 1402–1415 (2015)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Valentin Le Fèvre
    • 1
    Email author
  • George Bosilca
    • 2
  • Aurelien Bouteiller
    • 2
  • Thomas Herault
    • 2
  • Atsushi Hori
    • 3
  • Yves Robert
    • 1
    • 2
  • Jack Dongarra
    • 2
    • 4
  1. 1.Laboratoire LIPÉcole Normale Supérieure de Lyon & InriaLyonFrance
  2. 2.University of TennesseeKnoxvilleUSA
  3. 3.RIKEN Center for Computational ScienceKobeJapan
  4. 4.University of ManchesterManchesterUK

Personalised recommendations