Impact of Over-Decomposition on Coordinated Checkpoint/Rollback Protocol

  • Xavier Besseron
  • Thierry Gautier
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7156)


Failure free execution will become rare in the future exascale computers. Thus, fault tolerance is now an active field of research. In this paper, we study the impact of decomposing an application in much more parallelism that the physical parallelism on the rollback step of fault tolerant coordinated protocols. This over-decomposition gives the runtime a better opportunity to balance workload after failure without the need of spare nodes, while preserving performance. We show that the overhead on normal execution remains low for relevant factor of over-decomposition. With over-decomposition, restart execution on the remaining nodes after failures shows very good performance compared to classic decomposition approach: our experiments show that the execution time after restart can be reduced by 42 %. We also consider a partial restart protocol to reduce the amount of lost work in case of failure by tracking the task dependencies inside processes. In some cases and thanks to over-decomposition, this partial restart time can represent only 54 % of the global restart time.


parallel computing checkpoint/rollback over-decomposition global restart partial restart 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Badia, R.M., Herrero, J.R., Labarta, J., Pérez, J.M., Quintana-Ortí, E.S., Quintana-Ortí, G.: Parallelizing dense and banded linear algebra libraries using smpss. Concurr. Comput. : Pract. Exper. (2009)Google Scholar
  2. 2.
    Besseron, X., Gautier, T.: Optimised recovery with a coordinated checkpoint/rollback protocol for domain decomposition applications. In: MCO 2008 (2008)Google Scholar
  3. 3.
    Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: An efficient multithreaded runtime system. Parallel and Distributed Computing (1996)Google Scholar
  4. 4.
    Bongo, L.A., Vinter, B., Anshus, O.J., Larsen, T., Bjorndalen, J.M.: Using overdecomposition to overlap communication latencies with computation and take advantage of smt processors. In: ICPP Workshops (2006)Google Scholar
  5. 5.
    Bouteiller, A., Hérault, T., Krawezik, G., Lemarinier, P., Cappello, F.: MPICH-V project: a multiprotocol automatic fault tolerant MPI. High Performance Computing Applications (2006)Google Scholar
  6. 6.
    Chakravorty, S., Kale, L.V.: A fault tolerant protocol for massively parallel systems. In: IPDPS (2004)Google Scholar
  7. 7.
    Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Transactions on Computer Systems (1985)Google Scholar
  8. 8.
    Elnozahy, E.N., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (2002)Google Scholar
  9. 9.
    Galilée, F., Roch, J.L., Cavalheiro, G., Doreille, M.: Athapascan-1: On-line building data flow graph in a parallel language. In: PACT 1998 (1998)Google Scholar
  10. 10.
    Gao, Q., Yu, W., Huang, W., Panda, D.K.: Application-transparent checkpoint/restart for mpi programs over infiniband. In: ICPP 2006 (2006)Google Scholar
  11. 11.
    Gautier, T., Besseron, X., Pigeon, L.: Kaapi: a thread scheduling runtime system for data flow computations on cluster of multi-processors. In: PASCO 2007 (2007)Google Scholar
  12. 12.
    Guermouche, A., Ropars, T., Brunet, E., Snir, M., Cappello, F.: Uncoordinated checkpointing without domino effect for send-deterministic message passing applications. In: IPDPS (2011)Google Scholar
  13. 13.
    Hursey, J., Squyres, J.M., Mattox, T.I., Lumsdaine, A.: The design and implementation of checkpoint/restart process fault tolerance for Open MPI. In: IPDPS (2007)Google Scholar
  14. 14.
    Jafar, S., Krings, A.W., Gautier, T.: Flexible rollback recovery in dynamic heterogeneous grid computing. IEEE Transactions on Dependable and Secure Computing (2008)Google Scholar
  15. 15.
    Jafar, S., Pigeon, L., Gautier, T., Roch, J.L.: Self-adaptation of parallel applications in heterogeneous and dynamic architectures. In: ICTTA 2006 (2006)Google Scholar
  16. 16.
    Jose, J., Luo, M., Sur, S., Panda, D.K.: Unifying UPC and MPI Runtimes: Experience with MVAPICH. In: PGAS 2010 (2010)Google Scholar
  17. 17.
    Kale, L.V., Mendes, C., Meneses, E.: Adaptive runtime support for fault tolerance. Talk at Los Alamos Computer Science Symposium 2009 (2009)Google Scholar
  18. 18.
    Kale, L.V., Zheng, G.: Charm++ and AMPI: Adaptive runtime strategies via migratable objects. In: Advanced Computational Infrastructures for Parallel and Distributed Applications. Wiley-Interscience (2009)Google Scholar
  19. 19.
    Naik, V.K., Setia, S.K., Squillante, M.S.: Processor allocation in multiprogrammed distributed-memory parallel computer systems. Parallel Distributed Computing (1997)Google Scholar
  20. 20.
    Rabenseifner, R., Hager, G., Jost, G.: Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes. In: PDP 2009 (2009)Google Scholar
  21. 21.
    Song, F., YarKhan, A., Dongarra, J.: Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems. In: SC 2009 (2009)Google Scholar
  22. 22.
    Tamir, Y., Séquin, C.H.: Error recovery in multicomputers using global checkpoints. In: ICPP 1984 (1984)Google Scholar
  23. 23.
    Zheng, G., Shi, L., Kale, L.V.: FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. Cluster Computing (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Xavier Besseron
    • 1
  • Thierry Gautier
    • 2
  1. 1.Dept. of Computer Science and EngineeringThe Ohio State UniversityUSA
  2. 2.MOAIS ProjectINRIAFrance

Personalised recommendations