Skip to main content

A hybrid approach towards reduced checkpointing overhead in cloud-based applications


In recent years, cloud is being widely used to host numerous distributed applications. The expanding usage of cloud has introduced greater sensitivity in the environment. Therefore, most of the applications require that an effective fault tolerant mechanism must be in place. A fault tolerant mechanism involves detection as well as recovery from failures; traditionally checkpointing has been used to serve the purpose. The conventional checkpointing methods have also been tried in cloud e.g., periodic checkpointing and application based checkpointing; however, the periodic checkpointing is time inefficient and the application based checkpointing is space inefficient. Secondly, the above methods have been implemented using synchronous approach, which is inherently message inefficient, less scalable and has high synchronization latency. Furthermore, the asynchronous approaches are practically not viable owing to their inability to detect failures. In addition, the cloud entails massive scalability, thus we have proposed a quasi-synchronous checkpointing algorithm for cloud based distributed applications that exhibits better space efficiency while keeping latency under strict control. Our claims have been substantiated with static analysis and suitable simulation experiments.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10


  1. 1.

    Birman KP (2012) Guide to reliable distributed systems: building high-assurance applications and cloud-hosted services. Springer Science & Business Media, Berlin

    Book  Google Scholar 

  2. 2.

    Kshemkalyani AD, Singhal M (2011) Distributed computing: principles, algorithms, and systems. Cambridge University Press

    MATH  Google Scholar 

  3. 3.

    Cao J, Simonin M, Cooperman G, Morin C (2014) Checkpointing as a Service in Heterogeneous Cloud Environments. arXiv preprint arXiv:1411.1958

  4. 4.

    Koren I, Krishna CM (2010) Fault-tolerant systems. Elsevier

    MATH  Google Scholar 

  5. 5.

    Manivannan D, Singhal M (1996) A low-overhead recovery technique using quasi-synchronous checkpointing. In: Proceedings of 16th International Conference on Distributed Computing Systems, p 100–107. IEEE

  6. 6.

    Liu Y, Nassar R, Leangsuksuno C, Naksinehaboo N, Paun M, Scott SL (2008) An optimal checkpoint/restart model for a large scale high performance computing system. In: IEEE International Symposium on Parallel and Distributed Proc

  7. 7.

    Yi S, Kondo D, Andrzejak A (2010) Reducing costs of spot instances via checkpointing in the amazon elastic compute cloud. In: 3rd International Conference on Cloud Computing, p 236–243. IEEE

  8. 8.

    Jung D, Chin S, Chung K, Yu H, Gil J (2011) An efficient checkpointing scheme using price history of spot instances in cloud computing environment. In: IFIP International Conference on Network and Parallel Computing. Springer, p 185–200

  9. 9.

    Di S, Robert Y, Vivien F, Kondo D, Wang C, Cappello F (2013) Optimization of cloud task processing with checkpoint-restart mechanism. In: Proceedings of the International Conference on High Performance Computing, Networking

  10. 10.

    Voorsluys W, Buyya R (2012) Reliable provisioning of spot instances for compute-intensive applications. In: 2012 IEEE 26th International Conference on Advanced Information Networking and Applications, p 542–549. IEEE

  11. 11.

    Zhao J, Xiang Y, Lan T, Huang HH, Subramaniam S (2017) Elastic reliability optimization through peer-to-peer checkpointing in cloud computing. In: IEEE Transactions on Parallel and Distributed Systems, p 491–502

  12. 12.

    Neto JPA, Pianto DM, Ralha CG (2019) MULTS: a multi-cloud fault-tolerant architecture to manage transient servers in cloud computing. J Syst Archit 101:101651

    Article  Google Scholar 

  13. 13.

    Nogueira A, Casimiro A, Bessani A (2017) Elastic state machine replication. IEEE Trans Parallel Distrib Syst 28(9):2486–2499

    Article  Google Scholar 

  14. 14.

    Shah SAR, Jaikar AH, Noh SY (2015) A performance analysis of precopy, postcopy and hybrid live VM migration algorithms in scientific cloud computing environment. In: 2015 International Conference on High Performance Computing & Simulation (HPCS), p 229–236. IEEE

  15. 15.

    Yi S, Andrzejak A, Kondo D (2011) Monetary cost-aware checkpointing and migration on amazon cloud spot instances. IEEE Trans Serv Comput 5(4):512–524

    Article  Google Scholar 

  16. 16.

    Young JW (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17(9):530–531

    Article  Google Scholar 

  17. 17.

    Koo R, Toueg S (1987) Checkpointing and rollback-recovery for distributed systems. IEEE Trans Softw Eng (1):23–31

  18. 18.

    Juang TY, Venkatesan S (1991) Crash recovery with little overhead. In: [1991] Proceedings. 11th International Conference on Distributed Computing Systems, p 454–461. IEEE

  19. 19.

    Peterson SL, Kearns P (1993) Rollback based on vector time. In: Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems, p 68–77. IEEE

  20. 20.

    Hélary JM, Mostefaoui A, Netzer RH, Raynal M (2000) Communication-based prevention of useless checkpoints in distributed computations. Distrib Comput 13(1):29–43

    Article  Google Scholar 

  21. 21.

    Mattern F (1993) Efficient algorithms for distributed snapshots and global virtual time approximation. J Parallel Distrib Comput 18(4):423–434

  22. 22.

    Vaidya NH (1999) Staggered consistent checkpointing. IEEE Trans Parallel Distrib Syst 10(7):694–702

  23. 23.

    Ghosh R, Longo F, Frattini F, Russo S, Trivedi KS (2014) Scalable analytics for IaaS cloud availability. IEEE Trans on Cloud Comput 2(1):57–70

  24. 24.

    Li H, Pang L, Wang Z (2014) Two-level incremental checkpoint recovery scheme for reducing system total overheads. PLoS One 9(8):e104591

  25. 25.

    Meroufel BAKHTA, Belalem GHALEM (2015) Service to fault tolerance in cloud computing environment. WSEAS Trans Comput 14(1):782–791

  26. 26.

    Amoon M, El-Bahnasawy N, Sadi S, Wagdi M (2019) On the design of reactive approach with flexible checkpoint interval to tolerate faults in cloud computing systems. J Ambient Intell Humaniz Comput 10(11):4567–4577

    Article  Google Scholar 

  27. 27.

    Dey T, Sato K, Nicolae B, Guo J, Domke J, Yu W, ... Mohror K (2020, May) Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), p 1036–1043. IEEE

  28. 28.

    Frank A, Yang D, Brinkmann A, Schulz M, Süss T (2019) Reducing False Node Failure Predictions in HPC. In: 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC), p 323–332. IEEE

  29. 29.

    Oliveira D, Moreira FB, Rech P, Navaux P (2018) Predicting the Reliability Behavior of HPC Applications. In: 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), p 124–131. IEEE

  30. 30.

    Pinto J, Jain P, Kumar T (2016) Hadoop cluster monitoring and fault analysis in real time. In: 2016 International Conference on Recent Advances and Innovations in Engineering (ICRAIE), p 1–6. IEEE

  31. 31.

    de Araujo Neto JP, Pianto DM, Ralha CG (2018) A resilient agent-based architecture for efficient usage of transient servers in cloud computing. In: 2018 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), p 218–225. IEEE

  32. 32.

    Silva FM, Oliveira RL, Monteiro CC, Inacio PR, Freire M (2017) CloudSim Plus: a Cloud Computing Simulation Framework Pursuing Software Engineering Principles for Improved Modularity, Extensibility and Correctness. In: International Symposium on Integrated Network Management, p 2017. IEEE

  33. 33.

    Núñez A, Cañizares PC, Núñez M, Hierons RM (2020) TEA-Cloud: A formal framework for testing cloud computing systems. IEEE Trans Reliab 70(1):261–284

    Article  Google Scholar 

  34. 34.

    Abreu DP, Velasquez K, Assis MRM, Bittencourt LF, Curado M, Monteiro E, Madeira E (2018) A rank scheduling mechanism for fog environments. In: 2018 IEEE 6th International Conference on Future Internet of Things and Cloud (FiCloud), p 363–369. IEEE

  35. 35.

    Ran L, Shi X, Shang M (2019) SLAs-aware online task scheduling based on deep reinforcement learning method in cloud environment. In: 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), p 1518–1525. IEEE

  36. 36.

    Bendechache M, Svorobej S, Endo PT, Mario MN, Ares ME, Byrne J, Lynn T (2019) Modelling and simulation of ElasticSearch using CloudSim. In: 2019 IEEE/ACM 23rd International Symposium on Distributed Simulation and Real Time Applications (DS-RT), p 1–8. IEEE

  37. 37.

    Wei J, Cao S, Pan S, Han J, Yan L, Zhang L (2020) SatEdgeSim: A Toolkit for Modeling and Simulation of Performance Evaluation in Satellite Edge Computing Environments. In: 2020 12th International Conference on Communication Software and Networks (ICCSN), p 307–313. IEEE

Download references

Author information



Corresponding author

Correspondence to Bharati Sinha.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sinha, B., Singh, A.K. & Saini, P. A hybrid approach towards reduced checkpointing overhead in cloud-based applications. Peer-to-Peer Netw. Appl. (2021).

Download citation


  • Checkpoints
  • Failure detectors
  • Cloud computing
  • Crash faults