Abstract
We consider an external recovery problem, where a system is divided into autonomous subsystems which can be recovered only by the means of logging the messages exchanged between the subsystems. The question follows: what restrictions to the subsystem’s autonomy are required to make the external recovery possible? We present example solutions affecting different aspects of system’s independence.
This work was supported by the Polish National Science Center under Grant No. DEC-2011/03/D/ST6/01331.
Chapter PDF
Similar content being viewed by others
References
Alvisi, L., Marzullo, K.: Message logging: Pessimistic, optimistic, causal, and optimal. Software Engineering 24(2), 149–159 (1998)
Barga, R.S., Lomet, D.B., Shegalov, G., Weikum, G.: Recovery guarantees for internet applications. ACM Trans. Internet Techn. 4(3), 289–328 (2004)
Bouteiller, A., Cappello, F., Hérault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. In: SC, p. 25. ACM (2003)
Bouteiller, A., Hérault, T., Bosilca, G., Dongarra, J.J.: Correlated set coordination in fault tolerant message logging protocols for many-core clusters. Concurrency and Computation: Practice and Experience 25(4), 572–585 (2013)
Brzeziński, J., Danilecki, A., Hołenko, M., Kobusińska, A., Kobusiński, J., Zierhoffer, P.: D-reServE: Distributed reliable service environment. In: Morzy, T., Härder, T., Wrembel, R. (eds.) ADBIS 2012. LNCS, vol. 7503, pp. 71–84. Springer, Heidelberg (2012)
Cappello, F., Guermouche, A., Snir, M.: On communication determinism in parallel HPC applications. In: 2010 Proceedings of 19th International Conference on Computer Communications and Networks (ICCCN), pp. 1–8 (2010)
Elnozahy, E.N., Alvisi, L., Wang, Y.-M., Johnson, D.B.: A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
Guermouche, A., Ropars, T., Brunet, E., Snir, M., Cappello, F.: Uncoordinated checkpointing without domino effect for send-deterministic message passing applications. In: Accepted to the 25th IEEE International Parallel and Distributed Processing Symposium, IPDPS (May 2011)
Guermouche, A., Ropars, T., Snir, M., Cappello, F.: HydEE: Failure containment without event logging for large scale send-deterministic mpi applications. In: 2012 IEEE 26th International Parallel Distributed Processing Symposium (IPDPS), pp. 1216–1227 (2012)
Gupta, B., Rahimi, S., Allam, V., Jupally, V.: Domino-effect free crash recovery for concurrent failures in cluster federation. In: Wu, S., Yang, L.T., Xu, T.L. (eds.) GPC 2008. LNCS, vol. 5036, pp. 4–17. Springer, Heidelberg (2008)
Johnson, D., Zwaenepoel, W.: Recovery in distributed systems using optimistic message logging and checkpointing. J Algorithms 11, 462–491 (1990)
Lamport, L.: Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21(7), 558–565 (1978)
Lemarinier, P., Bouteiller, A., Herault, T., Krawezik, G., Cappello, F.: Improved message logging versus improved coordinated checkpointing for fault tolerant MPI. In: CLUSTER 2004: Proceedings of the 2004 IEEE International Conference on Cluster Computing, Washington, DC, USA, pp. 115–124 (2004)
Luo, Y., Manivannan, D.: Hope: A hybrid optimistic checkpointing and selective pessimistic message logging protocol for large scale distributed systems. Future Generation Comp. Syst. 28(8), 1217–1235 (2012)
Maloney, A., Goscinski, A.: A survey and review of the current state of rollback-recovery for cluster systems. Concurrency and Computation: Practice and Experience 21(12), 1632–1666 (2009)
Monnet, S., Morin, C., Badrinath, R.: A hierarchical checkpointing protocol for parallel applications in cluster federations. In: IPDPS (2004)
Netzer, R.H.B., Xu, J.: Necessary and sufficient conditions for consistent global snapshots. IEEE Transactions on Parallel and Distributed Systems 6(2), 165–169 (1995)
Park, T., Lee, I., Yeom, H.Y.: An efficient causal logging scheme for recoverable distributed shared memory systems. Parallel Computing 28(11), 1549–1572 (2002)
Randell, B.: System structure for software fault tolerance. IEEE Transactions on Software Engineering 1(2), 221–232 (1975)
Ropars, T., Guermouche, A., Uçar, B., Meneses, E., Kalé, L.V., Cappello, F.: On the use of cluster-based partial message logging to improve fault tolerance for MPI HPC applications. In: Jeannot, E., Namyst, R., Roman, J. (eds.) Euro-Par 2011, Part I. LNCS, vol. 6852, pp. 567–578. Springer, Heidelberg (2011)
Ropars, T., Martsinkevich, T.V., Guermouche, A., Schiper, A., Cappello, F.: Spbc: Leveraging the characteristics of mpi hpc applications for scalable checkpointing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2013, pp. 8:1–8:12. ACM, New York (2013)
Russell, D.L.: State restoration in systems of communicating processes. IEEE Trans. Software Eng. 6(2), 183–194 (1980)
Storm, R., Yemini, S.: Optimistic recovery in distributed systems. ACM Trans. Comput. Syst. 3(3), 204–226 (1985)
Tarafdar, A., Garg, V.K.: Addressing false causality while detecting predicates in distributed programs. In: Proceedings of the 18th IEEE International Conference on Distributed Computing Systems (ICDCS 1998), pp. 94–101 (1998)
Tsai, J.: An efficient index-based checkpointing protocol with constant-size control information on messages. IEEE Trans. Dependable Sec. Comput. 2(4), 287–296 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Danilecki, A., Hołenko, M., Kobusińska, A., Zierhoffer, P. (2014). The External Recovery Problem. In: Lopes, L., et al. Euro-Par 2014: Parallel Processing Workshops. Euro-Par 2014. Lecture Notes in Computer Science, vol 8805. Springer, Cham. https://doi.org/10.1007/978-3-319-14325-5_46
Download citation
DOI: https://doi.org/10.1007/978-3-319-14325-5_46
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14324-8
Online ISBN: 978-3-319-14325-5
eBook Packages: Computer ScienceComputer Science (R0)