Towards Understanding Post-recovery Efficiency for Shrinking and Non-shrinking Recovery

  • Aiman FangEmail author
  • Hajime Fujita
  • Andrew A. Chien
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9523)


We explore the post-recovery efficiency of shrinking and non-shrinking recovery schemes on high performance computing systems using a synthetic benchmark. We study the impact of network topology on post-recovery communication performance. Our experiments on the IBM BG/Q System Mira show that shrinking recovery can deliver up to 7.5 % better efficiency for neighbor communication pattern, as the non-shrinking recovery can reduce communication performance. We expected a similar situation for our synthetic benchmark with collective communication, but the situation is quite different. Both shrinking and non-shrinking recovery reduce MPI performance (MPICH3.1) dramatically on collective communication; up to 14\(\times \) worse, swamping any differences between the two approaches. This suggests that making MPI performance less sensitive to irregularity in performance and communicator size are critical for both recovery approaches.


Resilience HPC Post-recovery Shrinking Non-shrinking Network topology 



This work was supported by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Award DE-SC0008603 and completed in part with resources provided by ALCF under Contract DE-AC02-06CH11357. We thank Ignacio Laguna and David Richards of LLNL for discussion and suggestion that assisted our work.


  1. 1.
  2. 2.
  3. 3.
    Bhandarkar, M.A., et al.: Adaptive load balancing for mpi programs. In: International Conference on Computational Science, ICCS 2001 (2001)Google Scholar
  4. 4.
    Bhatele, A., et al.: Mapping applications with collectives over sub-communicators on torus networks. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012 (2012)Google Scholar
  5. 5.
    Bhatele, A., et al.: Optimizing the performance of parallel applications on a 5d torus via task mapping. In: IEEE International Conference on High Performance Computing. IEEE Computer Society (2014)Google Scholar
  6. 6.
    Bland, W., et al.: An evaluation of user-level failure mitigation support in MPI. In: Proceedings of the 19th European Conference on Recent Advances in the Message Passing Interface, EuroMPI 2012 (2012)Google Scholar
  7. 7.
    Cappello, F.: Fault tolerance in petascale/exascale systems: current knowledge, challenges and research opportunities. Int. J. High Perform. Comput. Appl. 23(3), 212–226 (2009)CrossRefGoogle Scholar
  8. 8.
    Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomput. Front. Innovations 1(1), 5–28 (2014)Google Scholar
  9. 9.
    Cybenko, G.: Dynamic load balancing for distributed memory multiprocessors. J. Parallel Distrib. Comput. 7, 279–301 (1989)CrossRefGoogle Scholar
  10. 10.
    Heroux, M.A.: Toward resilient algorithms and applications. In: Proceedings of the 3rd Workshop on Fault-tolerance for HPC at extreme scale, FTXS 2013 (2013)Google Scholar
  11. 11.
    Laguna, I., et al.: Evaluating user-level fault tolerance for MPI applications. In: Proceedings of the 21st European MPI Users’ Group Meeting, EuroMPI/ASIA 2014 (2014)Google Scholar
  12. 12.
    Moody, A., Bronevetsky, G., Mohror, K., Supinski, B.R.d.: Design, modeling, and evaluation of a scalable multi-level checkpointing system. In: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010 (2010)Google Scholar
  13. 13.
    Pearce, O., et al.: Load balancing n-body simulations with highly non-uniform density. In: Proceedings of the 28th ACM International Conference on Supercomputing, ICS 2014 (2014)Google Scholar
  14. 14.
    Schloegel, K., et al.: A unified algorithm for load-balancing adaptive scientific simulations. In: Proceedings of the 2000 ACM/IEEE Conference on Supercomputing, SC 2000 (2000)Google Scholar
  15. 15.
    Snir, M., et al.: Addressing failures in exascale computing*. Int. J. High Perform. Comput. IJHPC 28(2), 129–173 (2013)CrossRefMathSciNetGoogle Scholar
  16. 16.
    Teranishi, K., Heroux, M.A.: Toward local failure local recovery resilience model using MPI-ULFM. In: Proceedings of the 21st European MPI Users’ Group Meeting, EuroMPI/ASIA 2014 (2014)Google Scholar
  17. 17.
    Widener, P., Ferreira, K.B., Levy, S., Hoefler, T.: Exploring the effect of noise on the performance benefit of nonblocking allreduce. In: Proceedings of the 21st European MPI Users’ Group Meeting, EuroMPI/ASIA 2014 (2014)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.University of ChicagoChicagoUSA
  2. 2.Argonne National LaboratoryLemontUSA

Personalised recommendations