The External Recovery Problem

  • Arkadiusz Danilecki
  • Mateusz Hołenko
  • Anna Kobusińska
  • Piotr Zierhoffer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8805)


We consider an external recovery problem, where a system is divided into autonomous subsystems which can be recovered only by the means of logging the messages exchanged between the subsystems. The question follows: what restrictions to the subsystem’s autonomy are required to make the external recovery possible? We present example solutions affecting different aspects of system’s independence.


Message logging fault tolerance checkpointing distributed system 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alvisi, L., Marzullo, K.: Message logging: Pessimistic, optimistic, causal, and optimal. Software Engineering 24(2), 149–159 (1998)CrossRefGoogle Scholar
  2. 2.
    Barga, R.S., Lomet, D.B., Shegalov, G., Weikum, G.: Recovery guarantees for internet applications. ACM Trans. Internet Techn. 4(3), 289–328 (2004)CrossRefGoogle Scholar
  3. 3.
    Bouteiller, A., Cappello, F., Hérault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: SC, p. 25. ACM (2003)Google Scholar
  4. 4.
    Bouteiller, A., Hérault, T., Bosilca, G., Dongarra, J.J.: Correlated set coordination in fault tolerant message logging protocols for many-core clusters. Concurrency and Computation: Practice and Experience 25(4), 572–585 (2013)CrossRefGoogle Scholar
  5. 5.
    Brzeziński, J., Danilecki, A., Hołenko, M., Kobusińska, A., Kobusiński, J., Zierhoffer, P.: D-reServE: Distributed reliable service environment. In: Morzy, T., Härder, T., Wrembel, R. (eds.) ADBIS 2012. LNCS, vol. 7503, pp. 71–84. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  6. 6.
    Cappello, F., Guermouche, A., Snir, M.: On communication determinism in parallel HPC applications. In: 2010 Proceedings of 19th International Conference on Computer Communications and Networks (ICCCN), pp. 1–8 (2010)Google Scholar
  7. 7.
    Elnozahy, E.N., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)CrossRefGoogle Scholar
  8. 8.
    Guermouche, A., Ropars, T., Brunet, E., Snir, M., Cappello, F.: Uncoordinated checkpointing without domino effect for send-deterministic message passing applications. In: Accepted to the 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS (May 2011)Google Scholar
  9. 9.
    Guermouche, A., Ropars, T., Snir, M., Cappello, F.: HydEE: Failure containment without event logging for large scale send-deterministic mpi applications. In: 2012 IEEE 26th International Parallel Distributed Processing Symposium (IPDPS), pp. 1216–1227 (2012)Google Scholar
  10. 10.
    Gupta, B., Rahimi, S., Allam, V., Jupally, V.: Domino-effect free crash recovery for concurrent failures in cluster federation. In: Wu, S., Yang, L.T., Xu, T.L. (eds.) GPC 2008. LNCS, vol. 5036, pp. 4–17. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  11. 11.
    Johnson, D., Zwaenepoel, W.: Recovery in distributed systems using optimistic message logging and checkpointing. J Algorithms 11, 462–491 (1990)CrossRefMathSciNetzbMATHGoogle Scholar
  12. 12.
    Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21(7), 558–565 (1978)CrossRefzbMATHGoogle Scholar
  13. 13.
    Lemarinier, P., Bouteiller, A., Herault, T., Krawezik, G., Cappello, F.: Improved message logging versus improved coordinated checkpointing for fault tolerant MPI. In: CLUSTER 2004: Proceedings of the 2004 IEEE International Conference on Cluster Computing, Washington, DC, USA, pp. 115–124 (2004)Google Scholar
  14. 14.
    Luo, Y., Manivannan, D.: Hope: A hybrid optimistic checkpointing and selective pessimistic message logging protocol for large scale distributed systems. Future Generation Comp. Syst. 28(8), 1217–1235 (2012)CrossRefGoogle Scholar
  15. 15.
    Maloney, A., Goscinski, A.: A survey and review of the current state of rollback-recovery for cluster systems. Concurrency and Computation: Practice and Experience 21(12), 1632–1666 (2009)CrossRefGoogle Scholar
  16. 16.
    Monnet, S., Morin, C., Badrinath, R.: A hierarchical checkpointing protocol for parallel applications in cluster federations. In: IPDPS (2004)Google Scholar
  17. 17.
    Netzer, R.H.B., Xu, J.: Necessary and sufficient conditions for consistent global snapshots. IEEE Transactions on Parallel and Distributed Systems 6(2), 165–169 (1995)CrossRefGoogle Scholar
  18. 18.
    Park, T., Lee, I., Yeom, H.Y.: An efficient causal logging scheme for recoverable distributed shared memory systems. Parallel Computing 28(11), 1549–1572 (2002)CrossRefzbMATHGoogle Scholar
  19. 19.
    Randell, B.: System structure for software fault tolerance. IEEE Transactions on Software Engineering 1(2), 221–232 (1975)Google Scholar
  20. 20.
    Ropars, T., Guermouche, A., Uçar, B., Meneses, E., Kalé, L.V., Cappello, F.: On the use of cluster-based partial message logging to improve fault tolerance for MPI HPC applications. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011, Part I. LNCS, vol. 6852, pp. 567–578. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  21. 21.
    Ropars, T., Martsinkevich, T.V., Guermouche, A., Schiper, A., Cappello, F.: Spbc: Leveraging the characteristics of mpi hpc applications for scalable checkpointing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2013, pp. 8:1–8:12. ACM, New York (2013)Google Scholar
  22. 22.
    Russell, D.L.: State restoration in systems of communicating processes. IEEE Trans. Software Eng. 6(2), 183–194 (1980)CrossRefzbMATHGoogle Scholar
  23. 23.
    Storm, R., Yemini, S.: Optimistic recovery in distributed systems. ACM Trans. Comput. Syst. 3(3), 204–226 (1985)CrossRefGoogle Scholar
  24. 24.
    Tarafdar, A., Garg, V.K.: Addressing false causality while detecting predicates in distributed programs. In: Proceedings of the 18th IEEE International Conference on Distributed Computing Systems (ICDCS 1998), pp. 94–101 (1998)Google Scholar
  25. 25.
    Tsai, J.: An efficient index-based checkpointing protocol with constant-size control information on messages. IEEE Trans. Dependable Sec. Comput. 2(4), 287–296 (2005)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Arkadiusz Danilecki
    • 1
  • Mateusz Hołenko
    • 1
  • Anna Kobusińska
    • 1
  • Piotr Zierhoffer
    • 1
  1. 1.Institute of Computing SciencePoznań University of TechnologyPoland

Personalised recommendations