Enhancing performance of failure-prone clusters by adaptive provisioning of cloud resources

Javadi, Bahman; Thulasiraman, Parimala; Buyya, Rajkumar

doi:10.1007/s11227-012-0826-2

Enhancing performance of failure-prone clusters by adaptive provisioning of cloud resources

Published: 13 October 2012

Volume 63, pages 467–489, (2013)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Bahman Javadi¹,
Parimala Thulasiraman² &
Rajkumar Buyya³

415 Accesses
14 Citations
Explore all metrics

Abstract

In this paper, we investigate Cloud computing resource provisioning to extend the computing capacity of local clusters in the presence of failures. We consider three steps in the resource provisioning including resource brokering, dispatch sequences, and scheduling. The proposed brokering strategy is based on the stochastic analysis of routing in distributed parallel queues and takes into account the response time of the Cloud provider and the local cluster while considering computing cost of both sides. Moreover, we propose dispatching with probabilistic and deterministic sequences to redirect requests to the resource providers. We also incorporate checkpointing in some well-known scheduling algorithms to provide a fault-tolerant environment. We propose two cost-aware and failure-aware provisioning policies that can be utilized by an organization that operates a cluster managed by virtual machine technology, and seeks to use resources from a public Cloud provider. Simulation results demonstrate that the proposed policies improve the response time of users’ requests by a factor of 4.10 under a moderate load with a limited cost on a public Cloud.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fault Tolerant Multiple Synchronized Parallel Load Balancing in Cloud

Real-Time Fault-Tolerant Scheduling Algorithm in Virtualized Clouds

Survey on Fault-Tolerance-Aware Scheduling in Cloud Computing

Notes

http://www.Cloudbus.org/.
http://haizea.cs.uchicago.edu.
There are several approximations for this queue in the literature, but we choose one which is a good estimate for heavily loaded systems.
This assumption is made just to focus on performance degradation due to failure.
This is the maximum amount of data for a real scientific workflow application [40].
The network latency is negligible as it is less than a second for public Cloud environments [41].
All prices obtained at time of writing this paper during May–June 2011.

References

Sotomayor B, Montero RS, Llorente IM, Foster I (2009) Virtual infrastructure management in private and hybrid clouds. IEEE Internet Comput 13(5):14–22
Article Google Scholar
Kondo D, Javadi B, Malecot P, Cappello F, Anderson DP (2009) Cost-benefit analysis of Cloud computing versus desktop grids. In: Proceedings of the 23rd IEEE international parallel and distributed processing symposium (IPDPS 2009), Rome, Italy. IEEE Computer Society, Washington, pp 1–12
Chapter Google Scholar
Deelman E, Singh G, Livny M, Berriman B, Good J (2008) The cost of doing science on the Cloud: the montage example. In: Proceedings of the 19th ACM/IEEE international conference on supercomputing (SC 2008), Austin, Texas. IEEE Press, Piscataway, pp 1–12
Google Scholar
Palankar MR, Iamnitchi A, Ripeanu M, Garfinkel S (2008) Amazon S3 for science Grids: a viable solution? In: Proceedings of the 1st international workshop on data-aware distributed computing (DADC’08) in conjunction with HPDC 2008, Boston, MA. ACM, New York, pp 55–64
Chapter Google Scholar
de Assunção MD, di Costanzo A, Buyya R (2009) Evaluating the cost-benefit of using cloud computing to extend the capacity of clusters. In: Proceedings of the 18th international symposium on high performance parallel and distributed computing (HPDC 2009), Garching, Germany. ACM, New York, pp 141–150
Google Scholar
Kondo D, Javadi B, Iosup A, Epema DHJ (2010) The failure trace archive: enabling comparative analysis of failures in diverse distributed systems. In: Proceedings of the 10th IEEE/ACM international conference on cluster, cloud and grid computing (CCGrid 2010), Melbourne, Australia. IEEE Computer Society, Washington, pp 398–407
Chapter Google Scholar
Anselmi J, Gaujal B (2010) Optimal routing in parallel, non-observable queues and the price of anarchy revisited. In: 22nd international teletraffic congress (ITC), Amsterdam, The Netherlands
Google Scholar
di Costanzo A, de Assunção MD, Buyya R (2009) Harnessing cloud technologies for a virtualized distributed computing infrastructure. IEEE Internet Comput 13(5):24–33
Article Google Scholar
Fontán J, Vázquez T, Gonzalez L, Montero RS, Llorente IM (2008) OpenNEbula: the open source virtual machine manager for cluster computing. In: Open source grid and cluster software conference, book of abstracts, San Francisco, CA
Google Scholar
Nurmi D, Wolski R, Grzegorczyk C, Obertelli G, Soman S, Youseff L, Zagorodnov D (2009) The Eucalyptus open-source cloud-computing system. In: Proceedings of the 9th IEEE/ACM international symposium on cluster computing and the grid (CCGrid 2009), Shanghai, China. IEEE Computer Society, Washington, pp 124–131
Chapter Google Scholar
Vecchiola C, Chu X, Buyya R (2009) Aneka: a software platform for .NET-based cloud computing IOS Press, Amsterdam, pp 267–295
Google Scholar
Amazon Inc., Amazon elastic compute cloud (Amazon EC2). http://aws.amazon.com/ec2
Tatezono M, Maruyama N, Matsuoka S (2006) Making wide-area, multi-site MPI feasible using Xen VM. In: Proceedings of the 4th workshop on frontiers of high performance computing and networking in conjunction with ISPA 2006, Sorrento, Italy. Springer, Berlin, pp 387–396
Chapter Google Scholar
Iosup A, Epema DHJ, Tannenbaum T, Farrellee M, Livny M (2007) Inter-operating grids through delegated matchmaking. In: Proceedings of the 18th ACM/IEEE conference on supercomputing (SC 2007), Reno, Nevada. ACM, New York, pp 1–12
Chapter Google Scholar
Balazinska M, Balakrishnan H, Stonebraker M (2004) Contract-based load management in federated distributed systems. In: Proceedings of the 1st symposium on networked systems design and implementation (NSDI 2004), San Francisco, CA. USENIX Association, Berkeley, pp 197–210
Google Scholar
Irwin D, Chase J, Grit L, Yumerefendi A, Becker D, Yocum KG (2006) Sharing networked resources with brokered leases. In: Proceedings of the USENIX annual technical conference, Boston, MA. USENIX Association, Berkeley, pp 199–212
Google Scholar
Grit L, Inwin D, Yumerefendi A, Chase J (2006) Virtual machine hosting for networked clusters: building the foundations for ‘autonomic’ orchestration. In: Proceedings of the 1st international workshop on virtualization technology in distributed computing (VTDC 2006), Tampa, Florida. IEEE Computer Society, Washington, pp 7–15
Chapter Google Scholar
Ruth P, McGachey P, Xu D (2005) VioCluster: virtualization for dynamic computational domain. In: Proceedings of the 7th IEEE international conference on cluster computing (Cluster 2005), Burlington, MA. IEEE Press, Piscataway, pp 1–10
Google Scholar
Rubio-Montero AJ, Huedo E, Montero RS, Llorente IM (2007) Management of virtual machines on globus grids using GridWay. In: Proceedings of the 21st IEEE international parallel and distributed processing symposium (IPDPS 2007), Long Beach, USA. IEEE Press, Piscataway, pp 1–7
Chapter Google Scholar
Huedo E, Montero RS, Llorente IM (2010) Grid architecture from a metascheduling perspective. Computer 43(7):51–56
Article Google Scholar
Garfinkel S (2007) Commodity grid computing with Amazons S3 and EC2. Usenix Login 32(1):7–13
Google Scholar
Marshall P, Keahey K, Freeman T (2010) Elastic site: using clouds to elastically extend site resources. In: Proceedings of the 10th IEEE/ACM international conference on cluster, cloud and grid computing (CCGrid 2010), Melbourne, Australia. IEEE Computer Society, Washington, pp 43–52
Chapter Google Scholar
Moschakis I, Karatza H (2010) Evaluation of gang scheduling performance and cost in a cloud computing system. J Supercomput 1:1–18
Google Scholar
Guo X, Lu Y, Squillante MS (2004) Optimal probabilistic routing in distributed parallel queues. ACM SIGMETRICS Perform Eval Rev 32(2):53–54
Article Google Scholar
Ross SM (1997) Stochastic processes, 2nd edn. Wiley, New York
Google Scholar
Hordijk A, van der Laan D (2004) Periodic routing to parallel queues and billiard sequences. Math Methods Oper Res 59:173–192
Article MathSciNet MATH Google Scholar
Mu’alem AW, Feitelson DG (2001) Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans Parallel Distrib Syst 12(6):529–543
Article Google Scholar
Lifka DA (1995) The ANL/IBM SP scheduling system. In: Proceedings of the 1st workshop on job scheduling strategies for parallel processing (JSSPP’95), Santa Barbara, CA. Springer, London, pp 295–303
Chapter Google Scholar
Srinivasan S, Kettimuthu R, Subramani V, Sadayappan P (2002) Selective reservation strategies for backfill job scheduling. In: Proceedings of the 8th international workshop on job scheduling strategies for parallel processing (JSSPP’02), Edinburgh, Scotland, UK. Springer, London, pp 55–71
Chapter Google Scholar
Bouguerra M, Gautier T, Trystram D, Vincent J-M (2010) A flexible checkpoint/restart model in distributed systems. In: Proceedings of the 9th international conference on parallel processing and applied mathematics (PPAM 2010), Torun, Poland. Springer, Berlin, pp 206–215
Google Scholar
Kleinrock L, Korfhage W (1993) Collecting unused processing capacity: an analysis of transient distributed systems. IEEE Trans Parallel Distrib Syst 4(5):535–546
Article Google Scholar
Varia J (2011) Best practices in architecting cloud applications in the AWS cloud. Wiley, Hoboken, pp 459–490
Google Scholar
Hoelzle U, Barroso LA (2009) The datacenter as a computer: an introduction to the design of warehouse-scale machines. Morgan and Claypool Publishers, San Rafael
Google Scholar
Ostermann S, Iosup A, Yigitbasi N, Prodan R, Fahringer T, Epema D (2009) A performance analysis of EC2 Cloud computing services for scientific computing. In: Proceedings of the 1st international conference on cloud computing (CloudComp 2009), Beijing, China. Springer, Berlin, pp 115–131
Google Scholar
Calheiros RN, Ranjan R, Beloglazov A, De Rose CAF, Buyya R (2011) CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Softw Pract Exp 41(1):23–50
Article Google Scholar
Grimme C, Lepping J, Papaspyrou A (2008) Prospects of collaboration between compute providers by means of job interchange. In: 13th job scheduling strategies for parallel processing. Lecture notes in computer science, vol 4942. Springer, Berlin, pp 132–151
Chapter Google Scholar
Feitelson DG, Rudolph L, Schwiegelshohn U, Sevcik KC, Wong P (1997) Theory and practice in parallel job scheduling. In: Proceedings of the 3rd international workshop on job scheduling strategies for parallel processing (JSSPP’97), Seattle, WA. Springer, London, pp 1–34
Chapter Google Scholar
Iosup A, Li H, Jan M, Anoep S, Dumitrescu C, Wolters L, Epema DHJ (2008) The grid workloads archive. Future Gener Comput Syst 24(7):672–686
Article Google Scholar
Li H, Groep D, Wolters L (2004) Workload characteristics of a multi-cluster supercomputer. In: Proceedings of the 10th international workshop on job scheduling strategies for parallel processing (JSSPP’04), New York, USA. Springer, Berlin, pp 176–193
Google Scholar
Pandey S, Voorsluys W, Rahman M, Buyya R, Dobson JE, Chiu K (2009) A grid workflow environment for brain imaging analysis on distributed systems. Concurr Comput 21(16):2118–2139
Article Google Scholar
CloudHarmony, http://cloudharmony.com/

Download references

Acknowledgements

The authors would like to thank Jonatha Anselmi, Rodrigo N. Calheiros, Mohsen Amini, and Amir Vahid for useful discussions.

Author information

Authors and Affiliations

School of Computing, Engineering and Mathematics, University of Western Sydney, Sydney, Australia
Bahman Javadi
InterDisciplinary Evolving Algorithmic Sciences (IDEAS) Laboratory, Department of Computer Science, University of Manitoba, Winnipeg, Canada
Parimala Thulasiraman
Cloud Computing and Distributed Systems (CLOUDS) Laboratory, Department of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
Rajkumar Buyya

Authors

Bahman Javadi
View author publications
You can also search for this author in PubMed Google Scholar
Parimala Thulasiraman
View author publications
You can also search for this author in PubMed Google Scholar
Rajkumar Buyya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bahman Javadi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Javadi, B., Thulasiraman, P. & Buyya, R. Enhancing performance of failure-prone clusters by adaptive provisioning of cloud resources. J Supercomput 63, 467–489 (2013). https://doi.org/10.1007/s11227-012-0826-2

Download citation

Published: 13 October 2012
Issue Date: February 2013
DOI: https://doi.org/10.1007/s11227-012-0826-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing performance of failure-prone clusters by adaptive provisioning of cloud resources

Abstract

Access this article

Similar content being viewed by others

Fault Tolerant Multiple Synchronized Parallel Load Balancing in Cloud

Real-Time Fault-Tolerant Scheduling Algorithm in Virtualized Clouds

Survey on Fault-Tolerance-Aware Scheduling in Cloud Computing

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Enhancing performance of failure-prone clusters by adaptive provisioning of cloud resources

Abstract

Access this article

Similar content being viewed by others

Fault Tolerant Multiple Synchronized Parallel Load Balancing in Cloud

Real-Time Fault-Tolerant Scheduling Algorithm in Virtualized Clouds

Survey on Fault-Tolerance-Aware Scheduling in Cloud Computing

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation