A methodology to assess the availability of next-generation data centers

  • Daniel Rosendo
  • Demis Gomes
  • Guto Leoni Santos
  • Glauco Goncalves
  • Andre Moreira
  • Leylane Ferreira
  • Patricia Takako EndoEmail author
  • Judith Kelner
  • Djamel Sadok
  • Amardeep Mehta
  • Mattias Wildeman


Cloud data center providers benefit from software-defined infrastructure once it promotes flexibility, automation, and scalability. The new paradigm of software-defined infrastructure helps facing current management challenges of a large-scale infrastructure, and guarantying service level agreements with established availability levels. Assessing the availability of a data center remains a complex task as it requires gathering information of a complex infrastructure and generating accurate models to estimate its availability. This paper covers this gap by proposing a methodology to automatically acquire data center hardware configuration to assess, through models, its availability. The proposed methodology leverages the emerging standardized Redfish API and relevant modeling frameworks. Through such approach, we analyzed the availability benefits of migrating from a conventional data center infrastructure (named Performance Optimization Data center (POD) with redundant servers) to a next-generation virtual Performance Optimized Data center (named virtual POD (vPOD) composed of a pool of disaggregated hardware resources). Results show that vPOD improves availability compared to conventional data center configurations.


Next-generation data center Redfish standard Hardware disaggregation vPOD Availability Sensitivity analysis 



This work was supported by the Research, Development and Innovation Center, Ericsson Telecomunicações S.A., Brazil. Authors would like to thank Carolina Cani for her support in our images.


  1. 1.
    Trivedi KS, Bobbio A (2017) Reliability and availability engineering: modeling, analysis, and applications. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  2. 2.
    British air data center outage feeds outrage at airline cost cuts (2017). Accessed Nov 2018
  3. 3.
    Al-Yatama A, Ahmad I, Al-Dabbous N (2017) Memory allocation algorithm for cloud services. J Supercomput 73(11):5006–5033CrossRefGoogle Scholar
  4. 4.
    Fard SYZ, Ahmadi MR, Adabi S (2017) A dynamic VM consolidation technique for QOS and energy consumption in cloud environment. J Supercomput 73(10):4347–4368CrossRefGoogle Scholar
  5. 5.
    Han S, Egi N, Panda A, Ratnasamy S, Shi G, Shenker S (2013) Network support for resource disaggregation in next-generation datacenters. In: Proceedings of the Twelfth ACM Workshop on Hot Topics in Networks, p 10. ACMGoogle Scholar
  6. 6.
    Li CS, Franke H, Parris C, Abali B, Kesavan M, Chang V (2017) Composable architecture for rack scale big data computing. Future Gener Comput Syst 67:180–193CrossRefGoogle Scholar
  7. 7.
    Fareghzadeh N, Seyyedi MA, Mohsenzadeh M (2019) Toward holistic performance management in clouds: taxonomy, challenges and opportunities. J Supercomput 75(1):272–313CrossRefGoogle Scholar
  8. 8.
    Chen H, Zhu J, Zhang Z, Ma M, Shen X (2017) Real-time workflows oriented online scheduling in uncertain cloud environment. J Supercomput 73(11):4906–4922CrossRefGoogle Scholar
  9. 9.
    Li C, Zhu L, Liu Y, Luo Y (2017) Resource scheduling approach for multimedia cloud content management. J Supercomput 73(12):5150–5172CrossRefGoogle Scholar
  10. 10.
    Addabbo T, Fort A, Mugnaini M, Vignoli V, Simoni E, Mancini M (2016) Availability and reliability modeling of multicore controlled ups for datacenter applications. Reliab Eng Syst Saf 149:56–62. CrossRefGoogle Scholar
  11. 11.
    Alissa HA, Nemati K, Sammakia BG, Seymour MJ, Tipton R, Mendo D, Demetriou DW, Schneebeli K (2016) Chip to chiller experimental cooling failure analysis of data centers: the interaction between it and facility. IEEE Trans Compon Packag Manuf Technol 6(9):1361–1378. CrossRefGoogle Scholar
  12. 12.
    Callou G, Maciel P, Tutsch D, Araújo J (2012) Models for dependability and sustainability analysis of data center cooling architectures. In: IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012), pp 1–6.
  13. 13.
    Liu Z, Chen Y, Bash C, Wierman A, Gmach D, Wang Z, Marwah M, Hyser C (2012) Renewable and cooling aware workload management for sustainable data centers. In: Proceedings of the 12th ACM SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’12, pp 175–186. ACM, New York, NY, USA.
  14. 14.
    Callou G, Maciel P, Tutsch D, Ferreira J, Araújo J, Souza R (2013) Estimating sustainability impact of high dependable data centers: a comparative study between brazilian and US energy mixes. Computing 95(12):1137–1170. CrossRefGoogle Scholar
  15. 15.
    Gomes D, Endo P, Gonçalves G, Rosendo D, Santos G, Kelner J, Sadok D, Mahloo M (2017) Evaluating the cooling subsystem availability on a cloud data center. In: IEEE Symposium on Computers and Communications. IEEEGoogle Scholar
  16. 16.
    Santos G, Endo P, Gonçalves G, Rosendo D, Gomes D, Kelner J, Sadok D, Mahloo M (2017) Analyzing the it subsystem failure impact on availability of cloud services. In: IEEE Symposium on Computers and Communications. IEEEGoogle Scholar
  17. 17.
    Rosendo D, Santos G, Gomes D, Moreira A, Gonçalves G, Endo P, Kelner J, Sadok D, Mahloo M (2017) How to improve cloud services availability? Investigating the impact of power and it subsystems failures. In: HICSS Hawaii International Conference on System Sciences. HICSSGoogle Scholar
  18. 18.
    Redfish composability white paper (2017). Accessed Apr 2018
  19. 19.
    Cheng J, Grinnemo KJ (2017) Telco distributed DC with transport protocol enhancement for 5G mobile networks: a survey. Karlstads universitetGoogle Scholar
  20. 20.
    Intel rack scale design architecture specification (2018) Software v2.3.3Google Scholar
  21. 21.
  22. 22.
    Megarac solutions for intel rack scale design standards (2019). Accessed Mar 2019
  23. 23.
    Supermicro rack scale design (rsd) solution overview (2019). Accessed Mar 2019
  24. 24.
    Redfish scalable platforms management api specification (2018) DMTF Redfish DSP0266Google Scholar
  25. 25.
    Fazlollahtabar H, Akhavan Niaki ST (2017) Integration of fault tree analysis, reliability block diagram and hazard decision tree for industrial robot reliability evaluation. Ind Robot Int J 44(6):754–764CrossRefGoogle Scholar
  26. 26.
    Maciel P, Trivedi K, Matias R, Kim D (2010) Dependability modeling. In: Performance and dependability in service computing: Concepts, Techniques and Research Directions. IGI Global, Hershey, Pennsylvania, USA, 13Google Scholar
  27. 27.
    Araujo J, Maciel P, Torquato M, Callou G, Andrade E (2014) Availability evaluation of digital library cloud services. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp 666–671. IEEEGoogle Scholar
  28. 28.
    Kitchin JF (1988) Practical Markov modeling for reliability analysis. In: 1988 Proceedings of the Annual Reliability and Maintainability Symposium, pp 290–296. IEEEGoogle Scholar
  29. 29.
    Malhotra M, Reibman A (1993) Selecting and implementing phase approximations for semi-markov models. Stoch Models 9(4):473–506MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Høyland A, Rausand M (2009) System reliability theory: models and statistical methods, vol 420. Wiley, New YorkzbMATHGoogle Scholar
  31. 31.
    Vu-Bac N, Lahmer T, Zhuang X, Nguyen-Thoi T, Rabczuk T (2016) A software framework for probabilistic sensitivity analysis for computationally expensive models. Adv Eng Softw 100:19–31CrossRefGoogle Scholar
  32. 32.
    Pianosi F, Beven K, Freer J, Hall JW, Rougier J, Stephenson DB, Wagener T (2016) Sensitivity analysis of environmental models: a systematic review with practical workflow. Environ Model Softw 79:214–232CrossRefGoogle Scholar
  33. 33.
    Hamby D (1994) A review of techniques for parameter sensitivity analysis of environmental models. Environ Monit Assess 32(2):135–154CrossRefGoogle Scholar
  34. 34.
    Andrade E, Nogueira B, Matos R, Callou G, Maciel P (2017) Availability modeling and analysis of a disaster-recovery-as-a-service solution. Computing 99:1–26MathSciNetCrossRefzbMATHGoogle Scholar
  35. 35.
    Kumari P, Saleem F, Sill A, Chen Y (2017) Validation of redfish: the scalable platform management standard. In: Companion Proceedings of the 10th International Conference on Utility and Cloud Computing, pp 113–117. ACMGoogle Scholar
  36. 36.
    Redfish resource and schema guide (2017) DSP2046 DMTF RedfishGoogle Scholar
  37. 37.
    Cassandras CG, Lafortune S (2009) Introduction to discrete event systems. Springer, BerlinzbMATHGoogle Scholar
  38. 38.
    Verma AK, Ajit S, Karanki DR (2010) Reliability and safety engineering, vol 43. Springer, BerlinCrossRefGoogle Scholar
  39. 39.
    Maciel P, Matos R, Silva B, Figueiredo J, Oliveira D, Fé I, Maciel R, Dantas J (2017) Mercury: Performance and dependability evaluation of systems with exponential, expolynomial, and general distributions. In: 2017 IEEE 22nd Pacific Rim International Symposium on Dependable Computing (PRDC), pp 50–57. IEEEGoogle Scholar
  40. 40.
    Mercury tool manual v4.7.0 (2019). Accessed Mar 2019
  41. 41.
    Oliveira D (2019) The mercury scripting language cookbook. Available at: Accessed Apr 2019
  42. 42.
    Smith WE, Trivedi KS, Tomek LA, Ackaret J (2008) Availability analysis of blade server systems. IBM Syst J 47(4):621–640CrossRefGoogle Scholar
  43. 43.
    Brosch F, Koziolek H, Buhnova B, Reussner R (2010) Parameterized reliability prediction for component-based software architectures. In: International Conference on the Quality of Software Architectures, pp 36–51. SpringerGoogle Scholar
  44. 44.
    Gomes D, Santos GL, Rosendo D, Gonçalves G, Moreira A, Kelner J, Sadok D, Endo PT (2019) Measuring the impact of data center failures on a cloud-based emergency medical call system. Concurr Comput Pract Exper. Google Scholar
  45. 45.
    Cérin C, Coti C, Delort P, Diaz F, Gagnaire M, Gaumer Q, Guillaume N, Lous J, Lubiarz S, Raffaelli J et al (2013) Downtime statistics of current cloud solutions. International Working Group on Cloud Computing Resiliency. Technical ReportGoogle Scholar
  46. 46.
    Endo PT, Santos GL, Rosendo D, Gomes DM, Moreira A, Kelner J, Sadok D, Gonçalves GE, Mahloo M (2017) Minimizing and managing cloud failures. Computer 50(11):86–90CrossRefGoogle Scholar
  47. 47.
    Jammal M, Kanso A, Heidari P, Shami A (2017) Evaluating high availability-aware deployments using stochastic petri net model and cloud scoring selection tool. IEEE Trans Serv Comput PP:1CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Daniel Rosendo
    • 1
  • Demis Gomes
    • 1
  • Guto Leoni Santos
    • 1
  • Glauco Goncalves
    • 2
  • Andre Moreira
    • 1
  • Leylane Ferreira
    • 1
  • Patricia Takako Endo
    • 3
    Email author
  • Judith Kelner
    • 1
  • Djamel Sadok
    • 1
  • Amardeep Mehta
    • 4
  • Mattias Wildeman
    • 4
  1. 1.Universidade Federal de Pernambuco (UFPE)RecifeBrazil
  2. 2.Universidade Federal Rural de Pernambuco (UFRPE)RecifeBrazil
  3. 3.Universidade de Pernambuco (UPE)RecifeBrazil
  4. 4.Ericsson ResearchStockholmSweden

Personalised recommendations