Skip to main content

Fault-Tolerant Scheme of Cloud Task Allocation Based on Deep Reinforcement Learning

  • Conference paper
  • First Online:
Bio-Inspired Computing: Theories and Applications (BIC-TA 2021)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1566))

  • 565 Accesses

Abstract

Due to the fact that the resource is prone to be wrong during tasks execution in cloud, which leads to failed tasks, in view of the recent research, the Primary-Backup model (PB model) is mostly used to deal with fault-tolerant tasks, but the selection of passive scheme and active scheme is assumed in advance, and the advantages between the two schemes are not fully utilized. Based on the deep reinforcement learning, this paper proposes an adaptive PB model selection algorithm, Active-Passive Scheme DQN (APSDQN). The process of faulty task tolerance is regarded as a Markov decision process, taking the passive scheme and active scheme as the action spaces, the shortest completion time of the task and the highest resource utilization as the reward feedback, combine with the real environment state information, select the most suitable fault-tolerant scheme for faulty tasks to save resources and improve the robustness of cloud system. The experimental results show that APSDQN has certain advantages in the total task finish time of task allocation, and significantly improves the resource utilization and the task success rate in the cloud.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Dähling, S., Razik, L., Monti, A.: Enabling scalable and fault-tolerant multi-agent systems by utilizing cloud-native computing. Auton. Agent. Multi-Agent Syst. 35(1), 1–27 (2021)

    Article  Google Scholar 

  2. Ahmad, Z., Nazir, B., Umer, A.: A fault-tolerant workflow management system with quality-of-service-aware scheduling for scientific workflows in cloud computing. Int. J. Commun. Syst. 34(1), 66–78 (2021)

    Google Scholar 

  3. Yao, G., Ding, Y., Ren, L., et al.: An immune system-inspired rescheduling algorithm for workflow in Cloud systems. Knowl. Based Syst. 99, 39–50 (2016)

    Article  Google Scholar 

  4. Yan, H., Zhu, X., Chen, H., et al.: DEFT: dynamic fault-tolerant elastic scheduling for tasks with uncertain runtime in cloud. Inf. Sci. 477, 30–46 (2019)

    Article  Google Scholar 

  5. Liu, J., Wei, M., Hu, W., et al.: Task scheduling with fault-tolerance in real-time heterogeneous systems. J. Syst. Archit. 90, 23–33 (2018)

    Article  Google Scholar 

  6. Ansari, M., Salehi, M., Safari, S., et al.: Peak-power-aware primary-backup technique for efficient fault-tolerance in multicore embedded systems. IEEE Access 8, 142843–142857 (2020)

    Article  Google Scholar 

  7. Cuccu, G., Togelius, J., Cudré-Mauroux, P.: Playing Atari with few neurons. Auton. Agent. Multi-Agent Syst. 35(2), 1–23 (2021)

    Article  Google Scholar 

  8. Li, Z., Zhu, C., Gao, Y., et al.: AlphaGo policy network: a DCNN accelerator on FPGA. IEEE Access 8, 203039–203047 (2020)

    Article  Google Scholar 

  9. Arulkumaran, K., Cully, A., Togelius, Y.: AlphaStar: an evolutionary computation perspective. GECCO (Companion) 314–315 (2019)

    Google Scholar 

  10. Mnih, V., Kavukcuoglu, K., Silver, D., et al.: Playing Atari with deep reinforcement learning (2013). https://arxiv.org/abs/1312.5602

  11. Husamelddin, A.M.B., Sheng, C., Jing, W.: Reliability-aware: task scheduling in cloud computing using multi-agent reinforcement learning algorithm and neural fitted Q. Int. Arab J. Inf. Technol. 18(1), 36–47 (2021)

    Google Scholar 

  12. Setlur, A., Nirmala, S., Singh, H., et al.: An efficient fault tolerant workflow scheduling approach using replication heuristics and checkpointing in the cloud. J. Parallel Distrib. Comput. 136, 14–28 (2020)

    Article  Google Scholar 

  13. Xie, G., Zeng, G., Li, R., et al.: Quantitative fault-tolerance for reliable workflows on heterogeneous IaaS clouds. IEEE Trans. Cloud Comput. 8(4), 1223–1236 (2020)

    Article  Google Scholar 

  14. Jing, W., Liu, Y.: Multiple DAGs reliability model and fault-tolerant scheduling algorithm in cloud computing system. Comput. Model. New Techol. 18(8), 22–30 (2014)

    Google Scholar 

  15. Wang, J., Bao, W., Zhu, X., et al.: FESTAL: fault-tolerant elastic scheduling algorithm for real-time tasks in virtualized clouds. IEEE Trans. Comput. 64(9), 2545–2558 (2015)

    Article  MathSciNet  Google Scholar 

  16. Ding, Y., Yao, G., Hao, K.: Fault-tolerant elastic scheduling algorithm for workflow in cloud systems. Inf. Sci. 393, 47–65 (2017)

    Article  Google Scholar 

  17. Zhou, J., Cong, P., Sun, J., et al.: Throughput maximization for multicore energy-harvesting systems suffering both transient and permanent faults. IEEE Access 7, 98462–98473 (2019)

    Article  Google Scholar 

  18. Manimaran, G., Murthy, C.S.R.: A fault-tolerant dynamic scheduling algorithm for multiprocessor real-time systems and its analysis. IEEE Trans. Parallel Distrib. Syst. 9(11), 1137–1152 (1998)

    Article  Google Scholar 

  19. Moon, J., Jeong, J.: Smart manufacturing scheduling system: DQN based on cooperative edge computing. IMCOM 1–8 (2021)

    Google Scholar 

  20. Wu, Y., Dinh, T., Fu, Y., et al.: A hybrid DQN and optimization approach for strategy and resource allocation in MEC networks. IEEE Trans. Wirel. Commun. 20(7), 4282–4295 (2021)

    Article  Google Scholar 

  21. Lu, H.: Edge QoE: computation offloading with deep reinforcement learning for internet of things. IEEE Internet Things J. 7(10), 9255–9265 (2020)

    Article  Google Scholar 

  22. Shashank, S., Elhadi, M.S., Ansar, Y.: Task scheduling in cloud using deep reinforcement learning. Proc. Comput. Sci. 184, 42–51 (2021)

    Article  Google Scholar 

  23. Wei, C., Rafael, F., Ewa, D., et al.: Dynamic and fault-tolerant clustering for scientific workflows. IEEE Trans. Cloud Comput. 4(1), 49–62 (2016)

    Article  Google Scholar 

  24. Soniya, J., Sujana, J., Revathi, T.: Dynamic fault tolerant scheduling mechanism for real time tasks in cloud computing. ICEEOT 124–129 (2016)

    Google Scholar 

  25. Ismael, S., Garraghan, P., Townend, P., et al.: An approach for characterizing workloads in google cloud to derive realistic resource utilization models. SOSE 49–60 (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tingting Dong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tang, H., Tang, Z., Dong, T., Hai, Q., Xue, F. (2022). Fault-Tolerant Scheme of Cloud Task Allocation Based on Deep Reinforcement Learning. In: Pan, L., Cui, Z., Cai, J., Li, L. (eds) Bio-Inspired Computing: Theories and Applications. BIC-TA 2021. Communications in Computer and Information Science, vol 1566. Springer, Singapore. https://doi.org/10.1007/978-981-19-1253-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-981-19-1253-5_5

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-19-1252-8

  • Online ISBN: 978-981-19-1253-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics