Optimised Recovery with a Coordinated Checkpoint/Rollback Protocol for Domain Decomposition Applications

  • Xavier Besseron
  • Thierry Gautier
Part of the Communications in Computer and Information Science book series (CCIS, volume 14)


Fault-tolerance protocols play an important role in today long runtime scientific parallel applications. The probability of a failure may be important due to the number of unreliable components involved during an execution. In this paper we present our approach and preliminary results about a new checkpoint/rollback protocol based on a coordinated scheme. One feature of this protocol is that fault recovery only requires a partial restart of other processes thanks to the availability of an abstract representation of the execution. Simulations on a domain decomposition application show that the amount of computations required to restart and the number of involved processes are reduced compared to the classical global rollback protocol.


grid fault-tolerance parallel computing data flow graph 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Elnozahy, E.N.M., Alvisi, L., Wang, Y.M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)CrossRefGoogle Scholar
  2. 2.
    Zheng, G., Shi, L., Kalé, L.V.: Ftc-charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and MPI. In: 2004 IEEE International Conference on Cluster Computing, San Dieago, CA (September 2004)Google Scholar
  3. 3.
    Elnozahy, E.N., Plank, J.S.: Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Transactions on Dependable and Secure Computing 1(2), 97–108 (2004)CrossRefGoogle Scholar
  4. 4.
    Jafar, S., Krings, A.W., Gautier, T., Roch, J.L.: Theft-induced checkpointing for reconfigurable dataflow applications. In: IEEE, (ed.): IEEE Electro/Information Technology Conference (EIT, Lincoln, Nebraska (May 2005) This paper received the EIT 2005 Best Paper AwardGoogle Scholar
  5. 5.
    Bouteiller, A., Lemarinier, P., Krawezik, G., Cappello, F.: Coordinated checkpoint versus message log for fault tolerant MPI. In: Proceedings of The 2003 IEEE International Conference on Cluster Computing, Honk Hong,China (2003)Google Scholar
  6. 6.
    Jafar, S., Krings, A., Gautier, T.: Flexible Rollback Recovery in Dynamic Heterogeneous Grid Computing. IEEE Transactions on Dependable and Secure Computing (TDSC) (in print, 2008)Google Scholar
  7. 7.
    Xie, M., Dai, Y.S., Poh, K.L.: Reliability of Grid Computing Systems. In: Computing System Reliability, pp. 179–205. Springer, US (2004)Google Scholar
  8. 8.
    Neokleous, K., Dikaiakos, M., Fragopoulou, P., Markatos, E.: Grid reliability: A study of failures on the egee infrastructure. In: Gorlatch, S., Bubak, M., Priol, T. (eds.) Proceedings of the CoreGRID Integration Workshop 2006, pp. 165–176 (Octobert 2006)Google Scholar
  9. 9.
    Anstreicher, K.M., Brixius, N.W., Goux, J.P., Linderoth, J.: Solving large quadratic assignment problems on computational grids. Technical report, Iowa City, Iowa 52242 (2000)Google Scholar
  10. 10.
    Wang, Y.M., Huang, Y., Vo, K.P., Chung, P.Y., Kintala, C.: Checkpointing and its applications. In: Fault-Tolerant Computing, 1995. FTCS-25. Digest of Papers, Twenty-Fifth International Symposium on (27-30 Jun 1995), pp. 22–31 (1995)Google Scholar
  11. 11.
    Jafar, S., Pigeon, L., Gautier, T., Roch, J.L.: Self-adaptation of parallel applications in heterogeneous and dynamic architectures. In: IEEE, (ed.): ICTTA 2006, IEEE Conference on Information and Communication Technologies: from Theory to Applications, Damascus, Syria, pp. 3347–3352 (April 2006)Google Scholar
  12. 12.
    Avizienis, A.: Fault-tolerant systems. IEEE Trans. Computers 25(12), 1304–1312 (1976)MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Bosilca, G., Bouteiller, A., Cappello, F., Djilali, S., Fédak, G., Germain, C., Hérault, T., Lemarinier, P., Lodygensky, O., Magniette, F., Néri, V., Selikhov, A.: Mpich-v: Toward a scalable fault tolerant mpi for volatile nodes. In: SuperComputing, Baltimore, USA (2002)Google Scholar
  14. 14.
    Jafar, S., Gautier, T., Krings, A.W., Roch, J.-L.: A checkpoint/recovery model for heterogeneous dataflow computations using work-stealing. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 675–684. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  15. 15.
    Baude, F., Caromel, D., Delbé, C., Henrio, L.: A hybrid message logging-cic protocol for constrained checkpointability. In: Cunha, J.C., Medeiros, P.D. (eds.) Euro-Par 2005. LNCS, vol. 3648, pp. 644–653. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  16. 16.
    Gautier, T., Besseron, X., Pigeon, L.: Kaapi: A thread scheduling runtime system for data flow computations on cluster of multi-processors. In: PASCO 2007: Proceedings of the 2007 international workshop on Parallel symbolic computation, pp. 15–23 (2007)Google Scholar
  17. 17.
    Kal, L., Skeel, R., Bhandarkar, M., Brunner, R., Gursoy, A., Krawetz, N., Phillips, J., Shinozaki, A., Varadarajan, K., Schulten, K.: Namd2: greater scalability for parallel molecular dynamics. J. Comput. Phys. 151(1), 283–312 (1999)CrossRefzbMATHGoogle Scholar
  18. 18.
    Revire, R., Zara, F., Gautier, T.: Efficient and easy parallel implementation of large numerical simulation. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840, pp. 663–666. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  19. 19.
    Wiesmann, M., Pedone, F., Schiper, A.: A systematic classification of replicated database protocols based on atomic broadcast. In: Proceedings of the 3rd Europeean Research Seminar on Advances in Distributed Systems (ERSADS 1999), Madeira Island, Portugal (1999)Google Scholar
  20. 20.
    Alvisi, L., Marzullo, K.: Message logging: Pessimistic, optimistic, causal, and optimal. IEEE Transactions on Software Engineering 24(2), 149–159 (1998)CrossRefGoogle Scholar
  21. 21.
    Chandy, K.M., Lamport, L.: Distributed snapshots: determining global states of distributed systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)CrossRefGoogle Scholar
  22. 22.
    Randell, B.: System structure for software fault tolerance. In: Proceedings of the international conference on Reliable software, pp. 437–449 (1975)Google Scholar
  23. 23.
    Baldoni, R.: A communication-induced checkpointing protocol that ensures rollback-dependency trackability. In: Proc. of the 27th International Symposium on Fault-Tolerant Computing (FTCS 1997), p. 68. IEEE Computer Society, Los Alamitos (1997)CrossRefGoogle Scholar
  24. 24.
    Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent Checkpointing under Unix. In: Proceedings of USENIX Winter 1995 Technical Conference, New Orleans, Louisiana,USA, pp. 213–224 (January 1995)Google Scholar
  25. 25.
    Galilée, F., Roch, J.L., Cavalheiro, G., Doreille, M.: Athapascan-1: On-line building data flow graph in a parallel language. In: IEEE, (ed.): Pact 1998, Paris, France, pp. 88–95 (October 1998)Google Scholar
  26. 26.
    Roch, J.L., Gautier, T., Revire, R.: Athapascan: Api for asynchronous parallel programming. Technical Report RT-0276, Projet APACHE, INRIA (February 2003)Google Scholar
  27. 27.
    Pellegrini, F., Roman, J.: Experimental analysis of the dual recursive bipartitioning algorithm for static mapping. Technical Report 1038-96, LaBRI, Université Bordeaux I (1996)Google Scholar
  28. 28.
    Karypis, G., Aggarwal, R., Kumar, V., Shekhar, S.: Multilevel hypergraph partitioning: Application in VLSI domain. In: Proceedings of the 34th annual conference on Design automation, pp. 526–529. ACM Press, New York (1997)CrossRefGoogle Scholar
  29. 29.
    Koo, R., Toueg, S.: Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Softw. Eng. 13(1), 23–31 (1987)CrossRefzbMATHGoogle Scholar
  30. 30.
    Besseron, X., Jafar, S., Gautier, T., Roch, J.L.: Cck: An improved coordinated checkpoint/rollback protocol for dataflow applications in kaapi. In: IEEE, (ed.): ICTTA 2006, IEEE Conference on Information and Communication Technologies: from Theory to Applications, Damascus, Syria, pp. 3353–3358 (April 2006)Google Scholar
  31. 31.
    Besseron, X., Pigeon, L., Gautier, T., Jafar, S.: Un protocole de sauvegarde / reprise coordonné pour les applications à flot de données reconfigurables. Technique et Science Informatiques - numéro spécial RenPar”17 27 (2008)Google Scholar
  32. 32.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. MIT Press, Cambridge (2001)zbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Xavier Besseron
    • 1
  • Thierry Gautier
    • 1
  1. 1.MOAIS Project, Laboratoire d’Informatique de Grenoble, ENSIMAG - Antenne de MontbonnotMontbonnot Saint MartinFrance

Personalised recommendations