Fault-Tolerant Scheme of Cloud Task Allocation Based on Deep Reinforcement Learning

Tang, Hengliang; Tang, Zifang; Dong, Tingting; Hai, Qiuru; Xue, Fei

doi:10.1007/978-981-19-1253-5_5

Hengliang Tang⁹,
Zifang Tang⁹,
Tingting Dong¹⁰,
Qiuru Hai⁹ &
…
Fei Xue⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1566))

Included in the following conference series:

International Conference on Bio-Inspired Computing: Theories and Applications

565 Accesses

Abstract

Due to the fact that the resource is prone to be wrong during tasks execution in cloud, which leads to failed tasks, in view of the recent research, the Primary-Backup model (PB model) is mostly used to deal with fault-tolerant tasks, but the selection of passive scheme and active scheme is assumed in advance, and the advantages between the two schemes are not fully utilized. Based on the deep reinforcement learning, this paper proposes an adaptive PB model selection algorithm, Active-Passive Scheme DQN (APSDQN). The process of faulty task tolerance is regarded as a Markov decision process, taking the passive scheme and active scheme as the action spaces, the shortest completion time of the task and the highest resource utilization as the reward feedback, combine with the real environment state information, select the most suitable fault-tolerant scheme for faulty tasks to save resources and improve the robustness of cloud system. The experimental results show that APSDQN has certain advantages in the total task finish time of task allocation, and significantly improves the resource utilization and the task success rate in the cloud.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Dähling, S., Razik, L., Monti, A.: Enabling scalable and fault-tolerant multi-agent systems by utilizing cloud-native computing. Auton. Agent. Multi-Agent Syst. 35(1), 1–27 (2021)
Article Google Scholar
Ahmad, Z., Nazir, B., Umer, A.: A fault-tolerant workflow management system with quality-of-service-aware scheduling for scientific workflows in cloud computing. Int. J. Commun. Syst. 34(1), 66–78 (2021)
Google Scholar
Yao, G., Ding, Y., Ren, L., et al.: An immune system-inspired rescheduling algorithm for workflow in Cloud systems. Knowl. Based Syst. 99, 39–50 (2016)
Article Google Scholar
Yan, H., Zhu, X., Chen, H., et al.: DEFT: dynamic fault-tolerant elastic scheduling for tasks with uncertain runtime in cloud. Inf. Sci. 477, 30–46 (2019)
Article Google Scholar
Liu, J., Wei, M., Hu, W., et al.: Task scheduling with fault-tolerance in real-time heterogeneous systems. J. Syst. Archit. 90, 23–33 (2018)
Article Google Scholar
Ansari, M., Salehi, M., Safari, S., et al.: Peak-power-aware primary-backup technique for efficient fault-tolerance in multicore embedded systems. IEEE Access 8, 142843–142857 (2020)
Article Google Scholar
Cuccu, G., Togelius, J., Cudré-Mauroux, P.: Playing Atari with few neurons. Auton. Agent. Multi-Agent Syst. 35(2), 1–23 (2021)
Article Google Scholar
Li, Z., Zhu, C., Gao, Y., et al.: AlphaGo policy network: a DCNN accelerator on FPGA. IEEE Access 8, 203039–203047 (2020)
Article Google Scholar
Arulkumaran, K., Cully, A., Togelius, Y.: AlphaStar: an evolutionary computation perspective. GECCO (Companion) 314–315 (2019)
Google Scholar
Mnih, V., Kavukcuoglu, K., Silver, D., et al.: Playing Atari with deep reinforcement learning (2013). https://arxiv.org/abs/1312.5602
Husamelddin, A.M.B., Sheng, C., Jing, W.: Reliability-aware: task scheduling in cloud computing using multi-agent reinforcement learning algorithm and neural fitted Q. Int. Arab J. Inf. Technol. 18(1), 36–47 (2021)
Google Scholar
Setlur, A., Nirmala, S., Singh, H., et al.: An efficient fault tolerant workflow scheduling approach using replication heuristics and checkpointing in the cloud. J. Parallel Distrib. Comput. 136, 14–28 (2020)
Article Google Scholar
Xie, G., Zeng, G., Li, R., et al.: Quantitative fault-tolerance for reliable workflows on heterogeneous IaaS clouds. IEEE Trans. Cloud Comput. 8(4), 1223–1236 (2020)
Article Google Scholar
Jing, W., Liu, Y.: Multiple DAGs reliability model and fault-tolerant scheduling algorithm in cloud computing system. Comput. Model. New Techol. 18(8), 22–30 (2014)
Google Scholar
Wang, J., Bao, W., Zhu, X., et al.: FESTAL: fault-tolerant elastic scheduling algorithm for real-time tasks in virtualized clouds. IEEE Trans. Comput. 64(9), 2545–2558 (2015)
Article MathSciNet Google Scholar
Ding, Y., Yao, G., Hao, K.: Fault-tolerant elastic scheduling algorithm for workflow in cloud systems. Inf. Sci. 393, 47–65 (2017)
Article Google Scholar
Zhou, J., Cong, P., Sun, J., et al.: Throughput maximization for multicore energy-harvesting systems suffering both transient and permanent faults. IEEE Access 7, 98462–98473 (2019)
Article Google Scholar
Manimaran, G., Murthy, C.S.R.: A fault-tolerant dynamic scheduling algorithm for multiprocessor real-time systems and its analysis. IEEE Trans. Parallel Distrib. Syst. 9(11), 1137–1152 (1998)
Article Google Scholar
Moon, J., Jeong, J.: Smart manufacturing scheduling system: DQN based on cooperative edge computing. IMCOM 1–8 (2021)
Google Scholar
Wu, Y., Dinh, T., Fu, Y., et al.: A hybrid DQN and optimization approach for strategy and resource allocation in MEC networks. IEEE Trans. Wirel. Commun. 20(7), 4282–4295 (2021)
Article Google Scholar
Lu, H.: Edge QoE: computation offloading with deep reinforcement learning for internet of things. IEEE Internet Things J. 7(10), 9255–9265 (2020)
Article Google Scholar
Shashank, S., Elhadi, M.S., Ansar, Y.: Task scheduling in cloud using deep reinforcement learning. Proc. Comput. Sci. 184, 42–51 (2021)
Article Google Scholar
Wei, C., Rafael, F., Ewa, D., et al.: Dynamic and fault-tolerant clustering for scientific workflows. IEEE Trans. Cloud Comput. 4(1), 49–62 (2016)
Article Google Scholar
Soniya, J., Sujana, J., Revathi, T.: Dynamic fault tolerant scheduling mechanism for real time tasks in cloud computing. ICEEOT 124–129 (2016)
Google Scholar
Ismael, S., Garraghan, P., Townend, P., et al.: An approach for characterizing workloads in google cloud to derive realistic resource utilization models. SOSE 49–60 (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information, Beijing Wuzi University, Beijing, 101149, China
Hengliang Tang, Zifang Tang, Qiuru Hai & Fei Xue
Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
Tingting Dong

Authors

Hengliang Tang
View author publications
You can also search for this author in PubMed Google Scholar
Zifang Tang
View author publications
You can also search for this author in PubMed Google Scholar
Tingting Dong
View author publications
You can also search for this author in PubMed Google Scholar
Qiuru Hai
View author publications
You can also search for this author in PubMed Google Scholar
Fei Xue
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tingting Dong .

Editor information

Editors and Affiliations

Huazhong University of Science and Technology, Wuhan, China
Linqiang Pan
Taiyuan University of Science and Technology, Taiyuan, China
Zhihua Cui
Taiyuan University of Science and Technology, Taiyuan, China
Jianghui Cai
Huazhong University of Science and Technology, Wuhan, China
Lianghao Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tang, H., Tang, Z., Dong, T., Hai, Q., Xue, F. (2022). Fault-Tolerant Scheme of Cloud Task Allocation Based on Deep Reinforcement Learning. In: Pan, L., Cui, Z., Cai, J., Li, L. (eds) Bio-Inspired Computing: Theories and Applications. BIC-TA 2021. Communications in Computer and Information Science, vol 1566. Springer, Singapore. https://doi.org/10.1007/978-981-19-1253-5_5

Download citation

DOI: https://doi.org/10.1007/978-981-19-1253-5_5
Published: 24 March 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-1252-8
Online ISBN: 978-981-19-1253-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics