A hybrid approach towards reduced checkpointing overhead in cloud-based applications

Sinha, Bharati; Singh, Awadhesh Kumar; Saini, Poonam

doi:10.1007/s12083-021-01230-2

A hybrid approach towards reduced checkpointing overhead in cloud-based applications

Published: 26 October 2021

Volume 15, pages 473–483, (2022)
Cite this article

Peer-to-Peer Networking and Applications Aims and scope Submit manuscript

232 Accesses
3 Citations
Explore all metrics

Abstract

In recent years, cloud is being widely used to host numerous distributed applications. The expanding usage of cloud has introduced greater sensitivity in the environment. Therefore, most of the applications require that an effective fault tolerant mechanism must be in place. A fault tolerant mechanism involves detection as well as recovery from failures; traditionally checkpointing has been used to serve the purpose. The conventional checkpointing methods have also been tried in cloud e.g., periodic checkpointing and application based checkpointing; however, the periodic checkpointing is time inefficient and the application based checkpointing is space inefficient. Secondly, the above methods have been implemented using synchronous approach, which is inherently message inefficient, less scalable and has high synchronization latency. Furthermore, the asynchronous approaches are practically not viable owing to their inability to detect failures. In addition, the cloud entails massive scalability, thus we have proposed a quasi-synchronous checkpointing algorithm for cloud based distributed applications that exhibits better space efficiency while keeping latency under strict control. Our claims have been substantiated with static analysis and suitable simulation experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Checkpointing Algorithms for Fault-Tolerant Execution of Large-Scale Distributed Applications in Cloud

Article 16 November 2020

On the design of reactive approach with flexible checkpoint interval to tolerate faults in cloud computing systems

Article 15 November 2018

A Resilient Hierarchical Checkpointing Algorithm for Distributed Systems Running on Cluster Federation

References

Birman KP (2012) Guide to reliable distributed systems: building high-assurance applications and cloud-hosted services. Springer Science & Business Media, Berlin
Book Google Scholar
Kshemkalyani AD, Singhal M (2011) Distributed computing: principles, algorithms, and systems. Cambridge University Press
MATH Google Scholar
Cao J, Simonin M, Cooperman G, Morin C (2014) Checkpointing as a Service in Heterogeneous Cloud Environments. arXiv preprint arXiv:1411.1958
Koren I, Krishna CM (2010) Fault-tolerant systems. Elsevier
MATH Google Scholar
Manivannan D, Singhal M (1996) A low-overhead recovery technique using quasi-synchronous checkpointing. In: Proceedings of 16th International Conference on Distributed Computing Systems, p 100–107. IEEE
Liu Y, Nassar R, Leangsuksuno C, Naksinehaboo N, Paun M, Scott SL (2008) An optimal checkpoint/restart model for a large scale high performance computing system. In: IEEE International Symposium on Parallel and Distributed Proc
Yi S, Kondo D, Andrzejak A (2010) Reducing costs of spot instances via checkpointing in the amazon elastic compute cloud. In: 3rd International Conference on Cloud Computing, p 236–243. IEEE
Jung D, Chin S, Chung K, Yu H, Gil J (2011) An efficient checkpointing scheme using price history of spot instances in cloud computing environment. In: IFIP International Conference on Network and Parallel Computing. Springer, p 185–200
Di S, Robert Y, Vivien F, Kondo D, Wang C, Cappello F (2013) Optimization of cloud task processing with checkpoint-restart mechanism. In: Proceedings of the International Conference on High Performance Computing, Networking
Voorsluys W, Buyya R (2012) Reliable provisioning of spot instances for compute-intensive applications. In: 2012 IEEE 26th International Conference on Advanced Information Networking and Applications, p 542–549. IEEE
Zhao J, Xiang Y, Lan T, Huang HH, Subramaniam S (2017) Elastic reliability optimization through peer-to-peer checkpointing in cloud computing. In: IEEE Transactions on Parallel and Distributed Systems, p 491–502
Neto JPA, Pianto DM, Ralha CG (2019) MULTS: a multi-cloud fault-tolerant architecture to manage transient servers in cloud computing. J Syst Archit 101:101651
Article Google Scholar
Nogueira A, Casimiro A, Bessani A (2017) Elastic state machine replication. IEEE Trans Parallel Distrib Syst 28(9):2486–2499
Article Google Scholar
Shah SAR, Jaikar AH, Noh SY (2015) A performance analysis of precopy, postcopy and hybrid live VM migration algorithms in scientific cloud computing environment. In: 2015 International Conference on High Performance Computing & Simulation (HPCS), p 229–236. IEEE
Yi S, Andrzejak A, Kondo D (2011) Monetary cost-aware checkpointing and migration on amazon cloud spot instances. IEEE Trans Serv Comput 5(4):512–524
Article Google Scholar
Young JW (1974) A first order approximation to the optimum checkpoint interval. Commun ACM 17(9):530–531
Article Google Scholar
Koo R, Toueg S (1987) Checkpointing and rollback-recovery for distributed systems. IEEE Trans Softw Eng (1):23–31
Juang TY, Venkatesan S (1991) Crash recovery with little overhead. In: [1991] Proceedings. 11th International Conference on Distributed Computing Systems, p 454–461. IEEE
Peterson SL, Kearns P (1993) Rollback based on vector time. In: Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems, p 68–77. IEEE
Hélary JM, Mostefaoui A, Netzer RH, Raynal M (2000) Communication-based prevention of useless checkpoints in distributed computations. Distrib Comput 13(1):29–43
Article Google Scholar
Mattern F (1993) Efficient algorithms for distributed snapshots and global virtual time approximation. J Parallel Distrib Comput 18(4):423–434
Vaidya NH (1999) Staggered consistent checkpointing. IEEE Trans Parallel Distrib Syst 10(7):694–702
Ghosh R, Longo F, Frattini F, Russo S, Trivedi KS (2014) Scalable analytics for IaaS cloud availability. IEEE Trans on Cloud Comput 2(1):57–70
Li H, Pang L, Wang Z (2014) Two-level incremental checkpoint recovery scheme for reducing system total overheads. PLoS One 9(8):e104591
Meroufel BAKHTA, Belalem GHALEM (2015) Service to fault tolerance in cloud computing environment. WSEAS Trans Comput 14(1):782–791
Amoon M, El-Bahnasawy N, Sadi S, Wagdi M (2019) On the design of reactive approach with flexible checkpoint interval to tolerate faults in cloud computing systems. J Ambient Intell Humaniz Comput 10(11):4567–4577
Article Google Scholar
Dey T, Sato K, Nicolae B, Guo J, Domke J, Yu W, ... Mohror K (2020, May) Optimizing Asynchronous Multi-Level Checkpoint/Restart Configurations with Machine Learning. In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), p 1036–1043. IEEE
Frank A, Yang D, Brinkmann A, Schulz M, Süss T (2019) Reducing False Node Failure Predictions in HPC. In: 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC), p 323–332. IEEE
Oliveira D, Moreira FB, Rech P, Navaux P (2018) Predicting the Reliability Behavior of HPC Applications. In: 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), p 124–131. IEEE
Pinto J, Jain P, Kumar T (2016) Hadoop cluster monitoring and fault analysis in real time. In: 2016 International Conference on Recent Advances and Innovations in Engineering (ICRAIE), p 1–6. IEEE
de Araujo Neto JP, Pianto DM, Ralha CG (2018) A resilient agent-based architecture for efficient usage of transient servers in cloud computing. In: 2018 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), p 218–225. IEEE
Silva FM, Oliveira RL, Monteiro CC, Inacio PR, Freire M (2017) CloudSim Plus: a Cloud Computing Simulation Framework Pursuing Software Engineering Principles for Improved Modularity, Extensibility and Correctness. In: International Symposium on Integrated Network Management, p 2017. IEEE
Núñez A, Cañizares PC, Núñez M, Hierons RM (2020) TEA-Cloud: A formal framework for testing cloud computing systems. IEEE Trans Reliab 70(1):261–284
Article Google Scholar
Abreu DP, Velasquez K, Assis MRM, Bittencourt LF, Curado M, Monteiro E, Madeira E (2018) A rank scheduling mechanism for fog environments. In: 2018 IEEE 6th International Conference on Future Internet of Things and Cloud (FiCloud), p 363–369. IEEE
Ran L, Shi X, Shang M (2019) SLAs-aware online task scheduling based on deep reinforcement learning method in cloud environment. In: 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), p 1518–1525. IEEE
Bendechache M, Svorobej S, Endo PT, Mario MN, Ares ME, Byrne J, Lynn T (2019) Modelling and simulation of ElasticSearch using CloudSim. In: 2019 IEEE/ACM 23rd International Symposium on Distributed Simulation and Real Time Applications (DS-RT), p 1–8. IEEE
Wei J, Cao S, Pan S, Han J, Yan L, Zhang L (2020) SatEdgeSim: A Toolkit for Modeling and Simulation of Performance Evaluation in Satellite Edge Computing Environments. In: 2020 12th International Conference on Communication Software and Networks (ICCSN), p 307–313. IEEE

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, National Institute of Technology, Kurukshetra, 136119, India
Bharati Sinha & Awadhesh Kumar Singh
Department of Computer Science and Engineering, Punjab Engineering College, Chandigarh, 160012, India
Poonam Saini

Authors

Bharati Sinha
View author publications
You can also search for this author in PubMed Google Scholar
Awadhesh Kumar Singh
View author publications
You can also search for this author in PubMed Google Scholar
Poonam Saini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bharati Sinha.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sinha, B., Singh, A.K. & Saini, P. A hybrid approach towards reduced checkpointing overhead in cloud-based applications. Peer-to-Peer Netw. Appl. 15, 473–483 (2022). https://doi.org/10.1007/s12083-021-01230-2

Download citation

Received: 17 March 2019
Accepted: 26 July 2021
Published: 26 October 2021
Issue Date: January 2022
DOI: https://doi.org/10.1007/s12083-021-01230-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A hybrid approach towards reduced checkpointing overhead in cloud-based applications

Abstract

Access this article

Similar content being viewed by others

Checkpointing Algorithms for Fault-Tolerant Execution of Large-Scale Distributed Applications in Cloud

On the design of reactive approach with flexible checkpoint interval to tolerate faults in cloud computing systems

A Resilient Hierarchical Checkpointing Algorithm for Distributed Systems Running on Cluster Federation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A hybrid approach towards reduced checkpointing overhead in cloud-based applications

Abstract

Access this article

Similar content being viewed by others

Checkpointing Algorithms for Fault-Tolerant Execution of Large-Scale Distributed Applications in Cloud

On the design of reactive approach with flexible checkpoint interval to tolerate faults in cloud computing systems

A Resilient Hierarchical Checkpointing Algorithm for Distributed Systems Running on Cluster Federation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation