Abstract
Cloud environment uses data center with a huge number of computational resources, and the probability of failing any of the resources increases with scale. Failures cause unavailability of services, which affects the reliability of the system. It is essential to consider the reliability issue for application deployment in the cloud, considering the failure of the resources. In this work, we address the reliability aware scheduling of tasks with hard deadlines in the cloud environment. We design, analyze and provide solutions for two special cases of the problem where (a) tasks have a common deadline on the machines with equal failure rate, and (b) tasks with equal execution time. For the general case of the problem, we propose two-phase heuristic approaches, one is the task ordering, and other is tasks mapping to machines. The performance of different task orderings and task mapping approaches is evaluated through simulation using synthetic and real traces. Based on the simulation result, the earliest due date ordering of tasks and mapping of the current task to the most reliable machine along with long task dropping performs better in general settings. We observe that task repetition and replication further improve the performance of the heuristics.
Similar content being viewed by others
References
Jammes F, Smit H (2005) Service-oriented paradigms in industrial automation. IEEE Trans Ind Inform 1(1):62–70
Liu Q, Cai W, Shen J, Fu Z, Liu X, Linge N (2016) A speculative approach to spatial-temporal efficiency with multi-objective optimization in a heterogeneous cloud environment. Secur Commun Netw 9(17):4002–4012
Ford D, Labelle F, Popovici FI, Stokely M, Truong V-A, Barroso L, Grimes C, Quinlan S (2010) Availability in globally distributed storage systems. In: Proceedings of the 9th USENIX conference on operating systems design and implementation, USENIX Association, pp 1–7
Machida F, Kawato M, Maeno Y (2010) Redundant virtual machine placement for fault-tolerant consolidated server clusters. In: IEEE network operations and management symposium—NOMS 2010, pp 32–39
Dai Y, Yang B, Dongarra J, Zhang G (2009) Cloud service reliability: modeling and analysis. In: IEEE Pacific Rim international symposium on dependable computing
Vishwanath KV, Nagappan N (2010) Characterizing cloud computing hardware reliability. In: Proceedings of the 1st ACM symposium on Cloud computing (SoCC’10), pp 193–204
Fu S, Xu C (2007) Exploring event correlation for failure prediction in coalitions of clusters. In: SC ’07: Proceedings of the 2007 ACM/IEEE conference on supercomputing, pp 1–12
Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region. https://aws.amazon.com/message/41926/. Accessed 5 Sept 2018
Poola D, Garg SK, Buyya R, Yang Y, Ramamohanarao K (2014) Robust scheduling of scientific workflows with deadline and budget constraints in clouds. In: IEEE 28th international conference on advanced information networking and applications, pp 858–865
Sahoo SK, Sivasubramaniam A, Squillante MS, Zhang Y (2004) Failure data analysis of a large-scale heterogeneous server environment. In: Proceedings of conference on dependable systems and networks
Zhang Y, Squillante MS, Sivasubramaniam A, Sahoo RK (2004) Performance implications of failures in large-scale cluster scheduling. In: Proceedings of the 10th workshop on job scheduling strategies for parallel processing
Sahoo RK, Oliner AJ, Rish I et al (2003) Critical event prediction for proactive management in large-scale computer clusters. In: Proceedings of ACM international conference on knowledge discovery and data mining
Yang B, Xu X, Tan F, Park DH (2011) An utility-based job scheduling algorithm for cloud computing considering reliability factor. In: International conference on cloud and service computing, pp 95–102
Beaumont O, Eyraud-Dubois L, Larchevêque H (2013) Reliable service allocation in clouds. In: IEEE 27th international symposium on parallel and distributed processing, pp 55–66
Ferreira et al K (2011) Evaluating the Viability of Process Replication Reliability for Exascale Systems, International Conference for High Performance Computing, Networking, Storage and Analysis , pp. 1-12
Xie G, Chen Y, Liu Y, Wei Y, Li R, Li K (2017) Resource consumption cost minimization of reliable parallel applications on heterogeneous embedded systems. IEEE Trans Ind Inform 13(4):1629–1640
Zhao B, Aydin H, Zhu D (2010) On maximizing reliability of real-time embedded applications under hard energy constraint. IEEE Trans Ind Inform 6(3):316–328
Alam ABMB, Zulkernine M, Haque A (2017) A reliability-based resource allocation approach for cloud computing. In: IEEE 7th international symposium on cloud and service computing (SC2), pp 249–252
Qiu X, Dai Y, Xiang Y, Xing L (2016) A hierarchical correlation model for evaluating reliability, performance, and power consumption of a cloud service. IEEE Trans Syst Man Cybern Syst 46(3):401–412
Shatz SM, Wang JP (1989) Models and algorithms for reliability-oriented task-allocation in redundant distributed computer systems. IEEE Trans Reliab 38(1):16–27
Brucker P (2001) Scheduling algorithms, 3rd edn. Springer, Berlin
Buttazzo GC, Bertogna M, Yao G (2013) Limited preemptive scheduling for real-time systems. A survey. IEEE Trans Ind Inform 9(1):3–15
Lawler EL (1983) Scheduling a single machine to minimize the number of late jobs. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/1983/6344.html. Accessed 10 Nov 2018
Baptiste P (2000) Preemptive scheduling of identical machines, Report 2000-314
Brucker P (1981) Minimizing maximum lateness in a two-machine unit-time job shop. Computing 27:367. https://doi.org/10.1007/BF02277185
Martello S, Toth P (2006) Knapsack problems. Wiley, London
Martello S, Pisinger D, Toth P (1999) Dynamic programming and strong bounds for the 0–1 knapsack problem. Manag Sci 45:414–424
Brucker P, Kravchenko SA (1999) Preemption can make parallel machine scheduling problems hard. OSM Reihe P, Heft 211, Universit at Osnabruck, Fachbereich Mathematik/Informatik
J. Wilkes—More Google cluster data. http://googleresearch.blogspot.ch/2011/11/more-google-clusterdata.html. Accessed 7 July 2018
Chen Y, Alspaugh S, Katz R (2012) Interactive analytical processing in big data systems: a cross-industry study of mapreduce workloads. In: Proceedings of 38th international conference on very large databases
Chen Y, Ganapathi A, Griffith R, Katz R (2011) The case for evaluating mapreduce performance using workload suites. In: Proceedings of IEEE/ACM international symposium on modeling, analysis and simulation of computer and telecommunication systems
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Swain, C.K., Saini, N. & Sahu, A. Reliability aware scheduling of bag of real time tasks in cloud environment. Computing 102, 451–475 (2020). https://doi.org/10.1007/s00607-019-00749-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00607-019-00749-w