Skip to main content
Log in

Checkpointing Algorithms for Fault-Tolerant Execution of Large-Scale Distributed Applications in Cloud

  • Published:
Wireless Personal Communications Aims and scope Submit manuscript

Abstract

Cloud computing provides infinite resources and a suitable environment for the execution of large scale computing applications. However, it is also susceptible to frequent failures which can affect users as well as service providers adversely. Therefore, fault tolerance techniques are necessary for the reliable execution of applications in the cloud. This work presents checkpointing based fault tolerance protocols for two types of distributed applications. The first kind of applications is the Bags of Tasks (BoT) applications where an application comprises of a set of independent tasks that do not communicate with each other during execution. Hence, an uncoordinated checkpointing algorithm is proposed for fault tolerance of BoT applications. Subsequently, we consider large scale distributed applications composed of multiple tasks dependent on each other due to inter-task message passing. An uncoordinated checkpointing and message logging protocol is presented for this type of applications. The proposed protocols utilize storage at edge switches in a data center to reduce the bandwidth consumption for saving checkpoints and message logs. Simulation results have demonstrated that the proposed protocols provide an increased rate of successful recoveries from failures and cause lower resource overhead than other contemporary and related schemes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Jaggi, P. K., & Singh, A. K. (2015). Movement-based checkpointing and message logging for recovery in MANETs. Wireless Personal Communications, 83(3), 1971–1993.

    Article  Google Scholar 

  2. Kumari, P., & Kaur, P. (2018). A survey of fault tolerance in cloud computing. Journal of King Saud University-Computer and Information Sciences. https://doi.org/10.1016/j.jksuci.2018.09.021.

    Article  Google Scholar 

  3. Zhou, A., Sun, Q., & Li, J. (2017). Enhancing reliability via checkpointing in cloud computing systems. China Communications, 14(7), 1–10.

    Article  Google Scholar 

  4. Buyya, R., Yeo, C. S., Venugopal, S., Broberg, J., & Brandic, I. (2009). Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer Systems, 25(6), 599–616.

    Article  Google Scholar 

  5. https://www.crn.com/slide-shows/cloud/the-10-biggest-cloud-outages-of-2018, Available online 2019.

  6. Kumar, S., & Goudar, R. H. (2012). Cloud computing-research issues, challenges, architecture, platforms and applications: A survey. International Journal of Future Computer and Communication, 1(4), 356.

    Article  Google Scholar 

  7. Patel, S., & Singh, A. S. (2013). Fault tolerance mechanisms and its implementation in cloud computing–a review. International Journal of Advanced Research in Computer Science and Software Engineering, 3(12), 573–576.

    Google Scholar 

  8. Zhao, J., Xiang, Y., Lan, T., Huang, H. H., & Subramaniam, S. (2016). Elastic reliability optimization through peer-to-peer checkpointing in cloud computing. IEEE Transactions on Parallel and Distributed Systems, 28(2), 491–502.

    Google Scholar 

  9. da Silva, F. A., & Senger, H. (2011). Scalability limits of Bag-of-Tasks applications running on hierarchical platforms. Journal of Parallel and Distributed Computing, 71(6), 788–801.

    Article  Google Scholar 

  10. Sukhoroslov, O. (2018). Supporting efficient execution of many-task applications with Everest. In Proceedings of the VIII international conference “distributed computing and grid-technologies in science and education”(GRID 2018) (pp. 266–270).

  11. Saikia, L. P., & Devi, Y. L. (2014). Fault tolerance techniques and algorithms in cloud computing. International Journal of Computer Science & Communication Networks, 4(1), 01–08.

    Google Scholar 

  12. Goiri, Í., Julia, F., Guitart, J., & Torres, J. (2010). Checkpoint-based fault-tolerant infrastructure for virtualized service providers. In 2010 IEEE network operations and management symposium-NOMS 2010 (pp. 455–462). IEEE.

  13. El-Sayed, N., & Schroeder, B. (2016). Understanding practical tradeoffs in HPC checkpoint-scheduling policies. IEEE Transactions on Dependable and Secure Computing, 15(2), 336–350.

    Article  Google Scholar 

  14. Han, H., Bao, W., Zhu, X., Feng, X., & Zhou, W. (2018). Fault-tolerant scheduling for hybrid real-time tasks based on CPB model in cloud. IEEE Access, 6, 18616–18629.

    Article  Google Scholar 

  15. Han, L., Canon, L. C., Casanova, H., Robert, Y., & Vivien, F. (2018). Checkpointing workflows for fail-stop errors. IEEE Transactions on Computers, 67(8), 1105–1120.

    MathSciNet  MATH  Google Scholar 

  16. Liu, D. (2015). A fault-tolerant architecture for ROIA in cloud. Journal of Ambient Intelligence and Humanized Computing, 6(5), 587–595.

    Article  Google Scholar 

  17. Chinnathambi, S., Santhanam, A., Rajarathinam, J., & Senthilkumar, M. (2019). Scheduling and checkpointing optimization algorithm for Byzantine fault tolerance in cloud clusters. Cluster Computing, 22(6), 14637–14650.

    Article  Google Scholar 

  18. Amoon, M., El-Bahnasawy, N., Sadi, S., & Wagdi, M. (2019). On the design of reactive approach with flexible checkpoint interval to tolerate faults in cloud computing systems. Journal of Ambient Intelligence and Humanized Computing, 10(11), 4567–4577.

    Article  Google Scholar 

  19. Cheraghlou, M. N., Khademzadeh, A., & Haghparast, M. (2019). New fuzzy-based fault tolerance evaluation framework for cloud computing. Journal of Network and Systems Management, 27(4), 930–948.

    Article  Google Scholar 

  20. Rezaeipanah, A., Mojarad, M., & Fakhari, A. (2020). Providing a new approach to increase fault tolerance in cloud computing using fuzzy logic. International Journal of Computers and Applications, 1–9. https://doi.org/10.1080/1206212X.2019.1709288.

  21. Parwekar, P., Rodda, S., & Kaur, P. (2018). Mobile sink as checkpoints for fault detection towards fault tolerance in wireless sensor networks. Journal of Global Information Management (JGIM), 26(3), 78–89.

    Article  Google Scholar 

  22. Mansouri, H., Badache, N., Aliouat, M., & Pathan, A. S. K. (2018). Checkpointing distributed application running on mobile ad hoc networks. International Journal of High Performance Computing and Networking, 11(2), 95–107.

    Article  Google Scholar 

  23. Singh, A. K., & Jaggi, P. K. (2013). Asynchronous rollback recovery in cluster based multi hop mobile ad hoc networks. International Journal of Enhanced Research in Management & Computer Applications, ISSN, 2319–7471.

  24. Kshemkalyani, A. D., & Singhal, M. (2011). Distributed computing: Principles, algorithms, and systems. Cambridge: Cambridge University Press.

    MATH  Google Scholar 

  25. Mansouri, H., & Pathan, A. S. K. (2019). Checkpointing distributed computing systems: An optimisation approach. International Journal of High Performance Computing and Networking, 15(3–4), 202–209.

    Article  Google Scholar 

  26. Singh, A. K., & Kaur, P. (2011). Log based recovery with low overhead for mobile computing systems. In International conference on advances in communication, network, and computing (pp. 637–642). Springer, Berlin, Heidelberg.

  27. Liu, J., Wang, S., Zhou, A., Kumar, S. A., Yang, F., & Buyya, R. (2016). Using proactive fault-tolerance approach to enhance cloud service reliability. IEEE Transactions on Cloud Computing, 6(4), 1191–1202.

    Article  Google Scholar 

  28. Zhou, A., Wang, S., Cheng, B., Zheng, Z., Yang, F., Chang, R. N., et al. (2016). Cloud service reliability enhancement via virtual machine placement optimization. IEEE Transactions on Services Computing, 10(6), 902–913.

    Article  Google Scholar 

  29. Kumari, P., & Kaur, P. (2020). Topology-aware virtual machine replication for fault tolerance in cloud computing systems. Multiagent and Grid Systems, 16(2), 193–206.

    Article  Google Scholar 

  30. https://blogchinmaya.blogspot.com/2017/04/what-is-fat-tree-and-how-to-onstruct.html, Available online 2019.

  31. https://www.cisco.com/en/US/docs/storage/san_switches/mds9000/hw/9124/quick/quide/9124QSG.html. Available online 2019.

  32. https://www.dell.com/en-in/work/shop/povw/networking-n2000-series. Available online 2019.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Parmeet Kaur.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kumari, P., Kaur, P. Checkpointing Algorithms for Fault-Tolerant Execution of Large-Scale Distributed Applications in Cloud. Wireless Pers Commun 117, 1853–1877 (2021). https://doi.org/10.1007/s11277-020-07949-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11277-020-07949-0

Keywords

Navigation