Cluster Computing

, Volume 19, Issue 2, pp 865–878 | Cite as

Towards operator-less data centers through data-driven, predictive, proactive autonomics

  • Alina SîrbuEmail author
  • Ozalp Babaoglu


Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using live data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating predictive models for node failures. Our results support the practicality of a data-driven approach by showing the effectiveness of predictive models based on data found in typical data center logs. We use BigQuery, the big data SQL platform from the Google Cloud suite, to process massive amounts of data and generate a rich feature set characterizing node state over time. We describe how an ensemble classifier can be built out of many Random Forest classifiers each trained on these features, to predict if nodes will fail in a future 24-h window. Our evaluation reveals that if we limit false positive rates to 5 %, we can achieve true positive rates between 27 and 88 % with precision varying between 50 and 72 %. This level of performance allows us to recover large fraction of jobs’ executions (by redirecting them to other nodes when a failure of the present node is predicted) that would otherwise have been wasted due to failures. We discuss the feasibility of including our predictive model as the central component of a data-driven autonomic manager and operating it on-line with live data streams (rather than off-line on data logs). All of the scripts used for BigQuery and classification analyses are publicly available on GitHub.


Data science Predictive analytics Google cluster trace Log data analysis Failure prediction Machine learning classification Ensemble classifier Random forest  BigQuery 



BigQuery analysis was carried out through a generous Cloud Credits grant from Google. We are grateful to John Wilkes of Google for helpful discussions regarding the cluster trace data.


  1. 1.
    Abdul-Rahman, O.A., Aida, K.: Towards understanding the usage behavior of Google cloud users: the mice and elephants phenomenon. In: IEEE International Conference on Cloud Computing Technology and Science (CloudCom), pp. 272–277. Singapore (2014)Google Scholar
  2. 2.
    Balliu, A., Olivetti, D., Babaoglu, O., Marzolla, M., Sîrbu, A.: Bidal: Big data analyzer for cluster traces. In: Informatika (BigSys workshop). Lecture Notes in Informatics, vol. 232, pp. 1781–1795. GI-Edition (2014)Google Scholar
  3. 3.
    Balliu, A., Olivetti, D., Babaoglu, O., Marzolla, M., Sîrbu, A.: A big data analyzer for large trace logs. Computing. (2015). doi: 10.1007/s00607-015-0480-7
  4. 4.
    Breitgand, D., Dubitzky, Z., Epstein, A., Feder, O., Glikson, A., Shapira, I., Toffetti, G.: An adaptive utilization accelerator for virtualized environments. In: International Conference on Cloud Engineering (IC2E), pp. 165–174. IEEE, Boston (2014)Google Scholar
  5. 5.
    Caglar, F., Gokhale, A.: iOverbook: Intelligent resource-overbooking to support soft real-time applications in the cloud. In: 7th IEEE International Conference on Cloud Computing (IEEE CLOUD). Anchorage (2014).
  6. 6.
    Di, S., Kondo, D., Cirne, W.: Characterization and comparison of google cloud load versus grids. In: International Conference on Cluster Computing (IEEE CLUSTER), pp. 230–238 (2012)Google Scholar
  7. 7.
    Di, S., Kondo, D., Cirne, W.: Host load prediction in a Google compute cloud with a Bayesian model. In: 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2012). doi: 10.1109/SC.2012.68
  8. 8.
    Di, S., Robert, Y., Vivien, F., Kondo, D., Wang, C.L., Cappello, F.: Optimization of cloud task processing with checkpoint-restart mechanism. In: 25th International Conference on High Performance Computing, Networking, Storage and Analysis (SC). Denver (2013)Google Scholar
  9. 9.
    Dudko, R., Sharma, A., Tedesco, J.: Effective Failure Prediction in Hadoop Clusters. University of Idaho White Paper, pp. 1–8 (2012)Google Scholar
  10. 10.
    Gainaru, A., Bouguerra, M.S., Cappello, F., Snir, M., Kramer, W.: Navigating the blue waters: online failure prediction in the petascale era. Argonne National Laboratory Technical Report, ANL/MCS-P5219-1014 (2014)Google Scholar
  11. 11.
    Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. C 42(4), 463–484 (2012)CrossRefGoogle Scholar
  12. 12.
    Guan, Q., Fu, S.: Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures. In: 32nd IEEE Symposium on Reliable Distributed Systems (SRDS), pp. 205–214. Braga (2013)Google Scholar
  13. 13.
    Iglesias, J.O., Lero, L.M., Cauwer, M.D., Mehta, D., O’Sullivan, B.: A methodology for online consolidation of tasks through more accurate resource estimations. In: IEEE/ACM International Conference on Utility and Cloud Computing (UCC). London (2014)Google Scholar
  14. 14.
    Javadi, B., Kondo, D., Losup, A., Epema, D.: The Failure Trace Archive: enabling the comparison of failure measurements and models of distributed systems. J. Parallel Distrib. Comput. 73(8), 1208–1223 (2013)CrossRefGoogle Scholar
  15. 15.
    Khoshgoftaar, T.M., Golawala, M., Van Hulse, J.: An empirical study of learning from imbalanced data using random forest. In: 19th IEEE International Conference on Tools with Artificial Intelligence, 2007 (ICTAI 2007), vol. 2, pp. 310–317. IEEE (2007)Google Scholar
  16. 16.
    Kuncheva, L.I., Whitaker, C.J., Shipp, C.A., Duin, R.P.: Is independence good for combining classifiers? In: Proceedings of the 15th International Conference on Pattern Recognition, 2000, vol. 2, pp. 168–171. IEEE (2000)Google Scholar
  17. 17.
    Liang, Y., Zhang, Y., Xiong, H., Sahoo, R.: Failure prediction in IBM BlueGene/L event logs. In: Seventh IEEE International Conference on Data Mining (ICDM 2007), pp. 583–588 (2007). doi: 10.1109/ICDM.2007.46
  18. 18.
    Liu, Z., Cho, S.: Characterizing machines and workloads on a Google cluster. In: 8th International Workshop on Scheduling and Resource Management for Parallel and Distributed Systems (SRMPDS) (2012)Google Scholar
  19. 19.
    Mishra, A.K., Hellerstein, J.L., Cirne, W., Das, C.R.: Towards characterizing cloud backend workloads: insights from Google compute clusters. Sigmetrics Perform. Eval. Rev. 37(4), 34–41 (2010)CrossRefGoogle Scholar
  20. 20.
    Opitz, D.W., Shavlik, J.W., et al.: Generating accurate and diverse members of a neural-network ensemble. In: Advances in Neural Information Processing Systems, pp. 535–541 (1996)Google Scholar
  21. 21.
    Reiss, C., Tumanov, A., Ganger, G.R., Katz, R.H., Kozuch, M.A.: Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In: ACM Symposium on Cloud Computing (SoCC) (2012)Google Scholar
  22. 22.
    Reiss, C., Tumanov, A., Ganger, G.R., Katz, R.H., Kozuch, M.A.: Towards understanding heterogeneous clouds at scale: Google trace analysis. Carnegie Mellon University Technical Reports ISTC-CC-TR(12–101) (2012)Google Scholar
  23. 23.
    Reiss, C., Wilkes, J., Hellerstein, J.L.: Obfuscatory obscanturism: making workload traces of commercially-sensitive systems safe to release. In: Network Operations and Management Symposium (NOMS), 2012, pp. 1279–1286. IEEE (2012)Google Scholar
  24. 24.
    Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1–2), 1–39 (2010)CrossRefGoogle Scholar
  25. 25.
    Rosà, A., Chen, L., Birke, R., Binder, W.: Demystifying casualties of evictions in big data priority scheduling. ACM SIGMETRICS Perform. Eval. Rev. 42(4), 12–21 (2015)CrossRefGoogle Scholar
  26. 26.
    Rosa, A., Chen, L.Y., Binder, W.: Predicting and mitigating jobs failures in big data clusters. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 221–230. IEEE. (2015). doi: 10.1109/CCGrid.2015.139
  27. 27.
    Salfner, F., Lenk, M., Malek, M.: A survey of online failure prediction methods. ACM Comput. Surv. (CSUR) 42(3), 1–68 (2010)CrossRefGoogle Scholar
  28. 28.
    Samak, T., Gunter, D., Goode, M., Deelman, E., Juve, G., Silva, F., Vahi, K.: Failure analysis of distributed scientific workflows executing in the cloud. In: Network and Service Management (cnsm). In: 2012 8th International Conference and 2012 Workshop on Systems Virtualiztion Management (svm), pp. 46–54. IEEE (2012)Google Scholar
  29. 29.
    Sharma, B., Chudnovsky, V., Hellerstein, J.L., Rifaat, R., Das, C.R.: Modeling and synthesizing task placement constraints in google compute clusters. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, p. 3. ACM (2011)Google Scholar
  30. 30.
    Shipp, C.A., Kuncheva, L.I.: Relationships between combination methods and measures of diversity in combining classifiers. Inf. Fusion 3(2), 135–148 (2002). doi: 10.1016/S1566-2535(02)00051-9
  31. 31.
    Sîrbu, A., Babaoglu, O.: BigQuery and ML scripts. GitHub (2015). Available at
  32. 32.
    Sîrbu, A., Babaoglu, O.: Towards data-driven autonomics in data centers. In: 2015 International Conference on Cloud and Autonomic Computing (ICCAC), pp. 45–56 (2015). doi: 10.1109/ICCAC.2015.19
  33. 33.
    Tigani, J., Naidu, S.: Google BigQuery Analytics. Wiley, Indianapolis (2014)Google Scholar
  34. 34.
    Verma, A., Pedrosa, L., Korupolu, M.R., Oppenheimer, D., Tune, E., Wilkes, J.: Large-scale cluster management at Google with Borg. In: Proceedings of the European Conference on Computer Systems (EuroSys). Bordeaux, France (2015)Google Scholar
  35. 35.
    Wang, G., Butt, A.R., Monti, H., Gupta, K.: Towards synthesizing realistic workload traces for studying the hadoop ecosystem. In: 19th IEEE Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 400–408 (2011)Google Scholar
  36. 36.
    Wilkes, J.: More Google cluster data. Google research blog (2011).
  37. 37.
    Zhang, Q., Hellerstein, J.L., Boutaba, R.: Characterizing task usage shapes in Google’s compute clusters. In: Proceedings of the 5th International Workshop on Large Scale Distributed Systems and Middleware (2011)Google Scholar
  38. 38.
    Zhang, Q., Zhani, M.F., Boutaba, R., Hellerstein, J.L.: Dynamic heterogeneity-aware resource provisioning in the cloud. IEEE Trans. Cloud Comput. 2(1), 14–28 (2014)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of PisaPisaItaly
  2. 2.Department of Computer Science and EngineeringUniversity of BolognaBolognaItaly

Personalised recommendations