Skip to main content
Log in

Enhancing performance of failure-prone clusters by adaptive provisioning of cloud resources

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

In this paper, we investigate Cloud computing resource provisioning to extend the computing capacity of local clusters in the presence of failures. We consider three steps in the resource provisioning including resource brokering, dispatch sequences, and scheduling. The proposed brokering strategy is based on the stochastic analysis of routing in distributed parallel queues and takes into account the response time of the Cloud provider and the local cluster while considering computing cost of both sides. Moreover, we propose dispatching with probabilistic and deterministic sequences to redirect requests to the resource providers. We also incorporate checkpointing in some well-known scheduling algorithms to provide a fault-tolerant environment. We propose two cost-aware and failure-aware provisioning policies that can be utilized by an organization that operates a cluster managed by virtual machine technology, and seeks to use resources from a public Cloud provider. Simulation results demonstrate that the proposed policies improve the response time of users’ requests by a factor of 4.10 under a moderate load with a limited cost on a public Cloud.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. http://www.Cloudbus.org/.

  2. http://haizea.cs.uchicago.edu.

  3. There are several approximations for this queue in the literature, but we choose one which is a good estimate for heavily loaded systems.

  4. This assumption is made just to focus on performance degradation due to failure.

  5. This is the maximum amount of data for a real scientific workflow application [40].

  6. The network latency is negligible as it is less than a second for public Cloud environments [41].

  7. All prices obtained at time of writing this paper during May–June 2011.

References

  1. Sotomayor B, Montero RS, Llorente IM, Foster I (2009) Virtual infrastructure management in private and hybrid clouds. IEEE Internet Comput 13(5):14–22

    Article  Google Scholar 

  2. Kondo D, Javadi B, Malecot P, Cappello F, Anderson DP (2009) Cost-benefit analysis of Cloud computing versus desktop grids. In: Proceedings of the 23rd IEEE international parallel and distributed processing symposium (IPDPS 2009), Rome, Italy. IEEE Computer Society, Washington, pp 1–12

    Chapter  Google Scholar 

  3. Deelman E, Singh G, Livny M, Berriman B, Good J (2008) The cost of doing science on the Cloud: the montage example. In: Proceedings of the 19th ACM/IEEE international conference on supercomputing (SC 2008), Austin, Texas. IEEE Press, Piscataway, pp 1–12

    Google Scholar 

  4. Palankar MR, Iamnitchi A, Ripeanu M, Garfinkel S (2008) Amazon S3 for science Grids: a viable solution? In: Proceedings of the 1st international workshop on data-aware distributed computing (DADC’08) in conjunction with HPDC 2008, Boston, MA. ACM, New York, pp 55–64

    Chapter  Google Scholar 

  5. de Assunção MD, di Costanzo A, Buyya R (2009) Evaluating the cost-benefit of using cloud computing to extend the capacity of clusters. In: Proceedings of the 18th international symposium on high performance parallel and distributed computing (HPDC 2009), Garching, Germany. ACM, New York, pp 141–150

    Google Scholar 

  6. Kondo D, Javadi B, Iosup A, Epema DHJ (2010) The failure trace archive: enabling comparative analysis of failures in diverse distributed systems. In: Proceedings of the 10th IEEE/ACM international conference on cluster, cloud and grid computing (CCGrid 2010), Melbourne, Australia. IEEE Computer Society, Washington, pp 398–407

    Chapter  Google Scholar 

  7. Anselmi J, Gaujal B (2010) Optimal routing in parallel, non-observable queues and the price of anarchy revisited. In: 22nd international teletraffic congress (ITC), Amsterdam, The Netherlands

    Google Scholar 

  8. di Costanzo A, de Assunção MD, Buyya R (2009) Harnessing cloud technologies for a virtualized distributed computing infrastructure. IEEE Internet Comput 13(5):24–33

    Article  Google Scholar 

  9. Fontán J, Vázquez T, Gonzalez L, Montero RS, Llorente IM (2008) OpenNEbula: the open source virtual machine manager for cluster computing. In: Open source grid and cluster software conference, book of abstracts, San Francisco, CA

    Google Scholar 

  10. Nurmi D, Wolski R, Grzegorczyk C, Obertelli G, Soman S, Youseff L, Zagorodnov D (2009) The Eucalyptus open-source cloud-computing system. In: Proceedings of the 9th IEEE/ACM international symposium on cluster computing and the grid (CCGrid 2009), Shanghai, China. IEEE Computer Society, Washington, pp 124–131

    Chapter  Google Scholar 

  11. Vecchiola C, Chu X, Buyya R (2009) Aneka: a software platform for .NET-based cloud computing IOS Press, Amsterdam, pp 267–295

    Google Scholar 

  12. Amazon Inc., Amazon elastic compute cloud (Amazon EC2). http://aws.amazon.com/ec2

  13. Tatezono M, Maruyama N, Matsuoka S (2006) Making wide-area, multi-site MPI feasible using Xen VM. In: Proceedings of the 4th workshop on frontiers of high performance computing and networking in conjunction with ISPA 2006, Sorrento, Italy. Springer, Berlin, pp 387–396

    Chapter  Google Scholar 

  14. Iosup A, Epema DHJ, Tannenbaum T, Farrellee M, Livny M (2007) Inter-operating grids through delegated matchmaking. In: Proceedings of the 18th ACM/IEEE conference on supercomputing (SC 2007), Reno, Nevada. ACM, New York, pp 1–12

    Chapter  Google Scholar 

  15. Balazinska M, Balakrishnan H, Stonebraker M (2004) Contract-based load management in federated distributed systems. In: Proceedings of the 1st symposium on networked systems design and implementation (NSDI 2004), San Francisco, CA. USENIX Association, Berkeley, pp 197–210

    Google Scholar 

  16. Irwin D, Chase J, Grit L, Yumerefendi A, Becker D, Yocum KG (2006) Sharing networked resources with brokered leases. In: Proceedings of the USENIX annual technical conference, Boston, MA. USENIX Association, Berkeley, pp 199–212

    Google Scholar 

  17. Grit L, Inwin D, Yumerefendi A, Chase J (2006) Virtual machine hosting for networked clusters: building the foundations for ‘autonomic’ orchestration. In: Proceedings of the 1st international workshop on virtualization technology in distributed computing (VTDC 2006), Tampa, Florida. IEEE Computer Society, Washington, pp 7–15

    Chapter  Google Scholar 

  18. Ruth P, McGachey P, Xu D (2005) VioCluster: virtualization for dynamic computational domain. In: Proceedings of the 7th IEEE international conference on cluster computing (Cluster 2005), Burlington, MA. IEEE Press, Piscataway, pp 1–10

    Google Scholar 

  19. Rubio-Montero AJ, Huedo E, Montero RS, Llorente IM (2007) Management of virtual machines on globus grids using GridWay. In: Proceedings of the 21st IEEE international parallel and distributed processing symposium (IPDPS 2007), Long Beach, USA. IEEE Press, Piscataway, pp 1–7

    Chapter  Google Scholar 

  20. Huedo E, Montero RS, Llorente IM (2010) Grid architecture from a metascheduling perspective. Computer 43(7):51–56

    Article  Google Scholar 

  21. Garfinkel S (2007) Commodity grid computing with Amazons S3 and EC2. Usenix Login 32(1):7–13

    Google Scholar 

  22. Marshall P, Keahey K, Freeman T (2010) Elastic site: using clouds to elastically extend site resources. In: Proceedings of the 10th IEEE/ACM international conference on cluster, cloud and grid computing (CCGrid 2010), Melbourne, Australia. IEEE Computer Society, Washington, pp 43–52

    Chapter  Google Scholar 

  23. Moschakis I, Karatza H (2010) Evaluation of gang scheduling performance and cost in a cloud computing system. J Supercomput 1:1–18

    Google Scholar 

  24. Guo X, Lu Y, Squillante MS (2004) Optimal probabilistic routing in distributed parallel queues. ACM SIGMETRICS Perform Eval Rev 32(2):53–54

    Article  Google Scholar 

  25. Ross SM (1997) Stochastic processes, 2nd edn. Wiley, New York

    Google Scholar 

  26. Hordijk A, van der Laan D (2004) Periodic routing to parallel queues and billiard sequences. Math Methods Oper Res 59:173–192

    Article  MathSciNet  MATH  Google Scholar 

  27. Mu’alem AW, Feitelson DG (2001) Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Trans Parallel Distrib Syst 12(6):529–543

    Article  Google Scholar 

  28. Lifka DA (1995) The ANL/IBM SP scheduling system. In: Proceedings of the 1st workshop on job scheduling strategies for parallel processing (JSSPP’95), Santa Barbara, CA. Springer, London, pp 295–303

    Chapter  Google Scholar 

  29. Srinivasan S, Kettimuthu R, Subramani V, Sadayappan P (2002) Selective reservation strategies for backfill job scheduling. In: Proceedings of the 8th international workshop on job scheduling strategies for parallel processing (JSSPP’02), Edinburgh, Scotland, UK. Springer, London, pp 55–71

    Chapter  Google Scholar 

  30. Bouguerra M, Gautier T, Trystram D, Vincent J-M (2010) A flexible checkpoint/restart model in distributed systems. In: Proceedings of the 9th international conference on parallel processing and applied mathematics (PPAM 2010), Torun, Poland. Springer, Berlin, pp 206–215

    Google Scholar 

  31. Kleinrock L, Korfhage W (1993) Collecting unused processing capacity: an analysis of transient distributed systems. IEEE Trans Parallel Distrib Syst 4(5):535–546

    Article  Google Scholar 

  32. Varia J (2011) Best practices in architecting cloud applications in the AWS cloud. Wiley, Hoboken, pp 459–490

    Google Scholar 

  33. Hoelzle U, Barroso LA (2009) The datacenter as a computer: an introduction to the design of warehouse-scale machines. Morgan and Claypool Publishers, San Rafael

    Google Scholar 

  34. Ostermann S, Iosup A, Yigitbasi N, Prodan R, Fahringer T, Epema D (2009) A performance analysis of EC2 Cloud computing services for scientific computing. In: Proceedings of the 1st international conference on cloud computing (CloudComp 2009), Beijing, China. Springer, Berlin, pp 115–131

    Google Scholar 

  35. Calheiros RN, Ranjan R, Beloglazov A, De Rose CAF, Buyya R (2011) CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Softw Pract Exp 41(1):23–50

    Article  Google Scholar 

  36. Grimme C, Lepping J, Papaspyrou A (2008) Prospects of collaboration between compute providers by means of job interchange. In: 13th job scheduling strategies for parallel processing. Lecture notes in computer science, vol 4942. Springer, Berlin, pp 132–151

    Chapter  Google Scholar 

  37. Feitelson DG, Rudolph L, Schwiegelshohn U, Sevcik KC, Wong P (1997) Theory and practice in parallel job scheduling. In: Proceedings of the 3rd international workshop on job scheduling strategies for parallel processing (JSSPP’97), Seattle, WA. Springer, London, pp 1–34

    Chapter  Google Scholar 

  38. Iosup A, Li H, Jan M, Anoep S, Dumitrescu C, Wolters L, Epema DHJ (2008) The grid workloads archive. Future Gener Comput Syst 24(7):672–686

    Article  Google Scholar 

  39. Li H, Groep D, Wolters L (2004) Workload characteristics of a multi-cluster supercomputer. In: Proceedings of the 10th international workshop on job scheduling strategies for parallel processing (JSSPP’04), New York, USA. Springer, Berlin, pp 176–193

    Google Scholar 

  40. Pandey S, Voorsluys W, Rahman M, Buyya R, Dobson JE, Chiu K (2009) A grid workflow environment for brain imaging analysis on distributed systems. Concurr Comput 21(16):2118–2139

    Article  Google Scholar 

  41. CloudHarmony, http://cloudharmony.com/

Download references

Acknowledgements

The authors would like to thank Jonatha Anselmi, Rodrigo N. Calheiros, Mohsen Amini, and Amir Vahid for useful discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bahman Javadi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Javadi, B., Thulasiraman, P. & Buyya, R. Enhancing performance of failure-prone clusters by adaptive provisioning of cloud resources. J Supercomput 63, 467–489 (2013). https://doi.org/10.1007/s11227-012-0826-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-012-0826-2

Keywords

Navigation