Abstract
Cloud data center providers benefit from software-defined infrastructure once it promotes flexibility, automation, and scalability. The new paradigm of software-defined infrastructure helps facing current management challenges of a large-scale infrastructure, and guarantying service level agreements with established availability levels. Assessing the availability of a data center remains a complex task as it requires gathering information of a complex infrastructure and generating accurate models to estimate its availability. This paper covers this gap by proposing a methodology to automatically acquire data center hardware configuration to assess, through models, its availability. The proposed methodology leverages the emerging standardized Redfish API and relevant modeling frameworks. Through such approach, we analyzed the availability benefits of migrating from a conventional data center infrastructure (named Performance Optimization Data center (POD) with redundant servers) to a next-generation virtual Performance Optimized Data center (named virtual POD (vPOD) composed of a pool of disaggregated hardware resources). Results show that vPOD improves availability compared to conventional data center configurations.
Similar content being viewed by others
Notes
The system is considered high available if it presents five 9’s of availability, meaning that its downtime is about only 5.26 min per year.
References
Trivedi KS, Bobbio A (2017) Reliability and availability engineering: modeling, analysis, and applications. Cambridge University Press, Cambridge
British air data center outage feeds outrage at airline cost cuts (2017). http://www.datacenterknowledge.com. Accessed Nov 2018
Al-Yatama A, Ahmad I, Al-Dabbous N (2017) Memory allocation algorithm for cloud services. J Supercomput 73(11):5006–5033
Fard SYZ, Ahmadi MR, Adabi S (2017) A dynamic VM consolidation technique for QOS and energy consumption in cloud environment. J Supercomput 73(10):4347–4368
Han S, Egi N, Panda A, Ratnasamy S, Shi G, Shenker S (2013) Network support for resource disaggregation in next-generation datacenters. In: Proceedings of the Twelfth ACM Workshop on Hot Topics in Networks, p 10. ACM
Li CS, Franke H, Parris C, Abali B, Kesavan M, Chang V (2017) Composable architecture for rack scale big data computing. Future Gener Comput Syst 67:180–193
Fareghzadeh N, Seyyedi MA, Mohsenzadeh M (2019) Toward holistic performance management in clouds: taxonomy, challenges and opportunities. J Supercomput 75(1):272–313
Chen H, Zhu J, Zhang Z, Ma M, Shen X (2017) Real-time workflows oriented online scheduling in uncertain cloud environment. J Supercomput 73(11):4906–4922
Li C, Zhu L, Liu Y, Luo Y (2017) Resource scheduling approach for multimedia cloud content management. J Supercomput 73(12):5150–5172
Addabbo T, Fort A, Mugnaini M, Vignoli V, Simoni E, Mancini M (2016) Availability and reliability modeling of multicore controlled ups for datacenter applications. Reliab Eng Syst Saf 149:56–62. https://doi.org/10.1016/j.ress.2015.12.010
Alissa HA, Nemati K, Sammakia BG, Seymour MJ, Tipton R, Mendo D, Demetriou DW, Schneebeli K (2016) Chip to chiller experimental cooling failure analysis of data centers: the interaction between it and facility. IEEE Trans Compon Packag Manuf Technol 6(9):1361–1378. https://doi.org/10.1109/TCPMT.2016.2599025
Callou G, Maciel P, Tutsch D, Araújo J (2012) Models for dependability and sustainability analysis of data center cooling architectures. In: IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), pp 1–6. https://doi.org/10.1109/DSNW.2012.6264697
Liu Z, Chen Y, Bash C, Wierman A, Gmach D, Wang Z, Marwah M, Hyser C (2012) Renewable and cooling aware workload management for sustainable data centers. In: Proceedings of the 12th ACM SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’12, pp 175–186. ACM, New York, NY, USA. https://doi.org/10.1145/2254756.2254779
Callou G, Maciel P, Tutsch D, Ferreira J, Araújo J, Souza R (2013) Estimating sustainability impact of high dependable data centers: a comparative study between brazilian and US energy mixes. Computing 95(12):1137–1170. https://doi.org/10.1007/s00607-013-0328-y
Gomes D, Endo P, Gonçalves G, Rosendo D, Santos G, Kelner J, Sadok D, Mahloo M (2017) Evaluating the cooling subsystem availability on a cloud data center. In: IEEE Symposium on Computers and Communications. IEEE
Santos G, Endo P, Gonçalves G, Rosendo D, Gomes D, Kelner J, Sadok D, Mahloo M (2017) Analyzing the it subsystem failure impact on availability of cloud services. In: IEEE Symposium on Computers and Communications. IEEE
Rosendo D, Santos G, Gomes D, Moreira A, Gonçalves G, Endo P, Kelner J, Sadok D, Mahloo M (2017) How to improve cloud services availability? Investigating the impact of power and it subsystems failures. In: HICSS Hawaii International Conference on System Sciences. HICSS
Redfish composability white paper (2017). https://www.dmtf.org/sites/default/files/standards/documents/DSP2050_1.0.0.pdf. Accessed Apr 2018
Cheng J, Grinnemo KJ (2017) Telco distributed DC with transport protocol enhancement for 5G mobile networks: a survey. Karlstads universitet
Intel rack scale design architecture specification (2018) Software v2.3.3
Intel rack scale design architecture (2019). https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/rack-scale-design-architecture-white-paper.pdf. Accessed Mar 2019
Megarac solutions for intel rack scale design standards (2019). https://ami.com/ami_downloads/MegaRAC_Solutions_for_Intel_Rack_Scale_Design_Data_Sheet.pdf. Accessed Mar 2019
Supermicro rack scale design (rsd) solution overview (2019). https://www.supermicro.com/solutions/SRSD.cfm. Accessed Mar 2019
Redfish scalable platforms management api specification (2018) DMTF Redfish DSP0266
Fazlollahtabar H, Akhavan Niaki ST (2017) Integration of fault tree analysis, reliability block diagram and hazard decision tree for industrial robot reliability evaluation. Ind Robot Int J 44(6):754–764
Maciel P, Trivedi K, Matias R, Kim D (2010) Dependability modeling. In: Performance and dependability in service computing: Concepts, Techniques and Research Directions. IGI Global, Hershey, Pennsylvania, USA, 13
Araujo J, Maciel P, Torquato M, Callou G, Andrade E (2014) Availability evaluation of digital library cloud services. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp 666–671. IEEE
Kitchin JF (1988) Practical Markov modeling for reliability analysis. In: 1988 Proceedings of the Annual Reliability and Maintainability Symposium, pp 290–296. IEEE
Malhotra M, Reibman A (1993) Selecting and implementing phase approximations for semi-markov models. Stoch Models 9(4):473–506
Høyland A, Rausand M (2009) System reliability theory: models and statistical methods, vol 420. Wiley, New York
Vu-Bac N, Lahmer T, Zhuang X, Nguyen-Thoi T, Rabczuk T (2016) A software framework for probabilistic sensitivity analysis for computationally expensive models. Adv Eng Softw 100:19–31
Pianosi F, Beven K, Freer J, Hall JW, Rougier J, Stephenson DB, Wagener T (2016) Sensitivity analysis of environmental models: a systematic review with practical workflow. Environ Model Softw 79:214–232
Hamby D (1994) A review of techniques for parameter sensitivity analysis of environmental models. Environ Monit Assess 32(2):135–154
Andrade E, Nogueira B, Matos R, Callou G, Maciel P (2017) Availability modeling and analysis of a disaster-recovery-as-a-service solution. Computing 99:1–26
Kumari P, Saleem F, Sill A, Chen Y (2017) Validation of redfish: the scalable platform management standard. In: Companion Proceedings of the 10th International Conference on Utility and Cloud Computing, pp 113–117. ACM
Redfish resource and schema guide (2017) DSP2046 DMTF Redfish
Cassandras CG, Lafortune S (2009) Introduction to discrete event systems. Springer, Berlin
Verma AK, Ajit S, Karanki DR (2010) Reliability and safety engineering, vol 43. Springer, Berlin
Maciel P, Matos R, Silva B, Figueiredo J, Oliveira D, Fé I, Maciel R, Dantas J (2017) Mercury: Performance and dependability evaluation of systems with exponential, expolynomial, and general distributions. In: 2017 IEEE 22nd Pacific Rim International Symposium on Dependable Computing (PRDC), pp 50–57. IEEE
Mercury tool manual v4.7.0 (2019). http://www.modcs.org/wp-content/uploads/tools/Mercury_Tool_Manual_v4.7.0.pdf. Accessed Mar 2019
Oliveira D (2019) The mercury scripting language cookbook. Available at: http://www.modcs.org/?page_id=1703. Accessed Apr 2019
Smith WE, Trivedi KS, Tomek LA, Ackaret J (2008) Availability analysis of blade server systems. IBM Syst J 47(4):621–640
Brosch F, Koziolek H, Buhnova B, Reussner R (2010) Parameterized reliability prediction for component-based software architectures. In: International Conference on the Quality of Software Architectures, pp 36–51. Springer
Gomes D, Santos GL, Rosendo D, Gonçalves G, Moreira A, Kelner J, Sadok D, Endo PT (2019) Measuring the impact of data center failures on a cloud-based emergency medical call system. Concurr Comput Pract Exper. https://doi.org/10.1002/cpe.5156
Cérin C, Coti C, Delort P, Diaz F, Gagnaire M, Gaumer Q, Guillaume N, Lous J, Lubiarz S, Raffaelli J et al (2013) Downtime statistics of current cloud solutions. International Working Group on Cloud Computing Resiliency. Technical Report
Endo PT, Santos GL, Rosendo D, Gomes DM, Moreira A, Kelner J, Sadok D, Gonçalves GE, Mahloo M (2017) Minimizing and managing cloud failures. Computer 50(11):86–90
Jammal M, Kanso A, Heidari P, Shami A (2017) Evaluating high availability-aware deployments using stochastic petri net model and cloud scoring selection tool. IEEE Trans Serv Comput PP:1
Acknowledgements
This work was supported by the Research, Development and Innovation Center, Ericsson Telecomunicações S.A., Brazil. Authors would like to thank Carolina Cani for her support in our images.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Rosendo, D., Gomes, D., Santos, G.L. et al. A methodology to assess the availability of next-generation data centers. J Supercomput 75, 6361–6385 (2019). https://doi.org/10.1007/s11227-019-02852-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-019-02852-3