Skip to main content
Log in

Towards operator-less data centers through data-driven, predictive, proactive autonomics

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using live data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating predictive models for node failures. Our results support the practicality of a data-driven approach by showing the effectiveness of predictive models based on data found in typical data center logs. We use BigQuery, the big data SQL platform from the Google Cloud suite, to process massive amounts of data and generate a rich feature set characterizing node state over time. We describe how an ensemble classifier can be built out of many Random Forest classifiers each trained on these features, to predict if nodes will fail in a future 24-h window. Our evaluation reveals that if we limit false positive rates to 5 %, we can achieve true positive rates between 27 and 88 % with precision varying between 50 and 72 %. This level of performance allows us to recover large fraction of jobs’ executions (by redirecting them to other nodes when a failure of the present node is predicted) that would otherwise have been wasted due to failures. We discuss the feasibility of including our predictive model as the central component of a data-driven autonomic manager and operating it on-line with live data streams (rather than off-line on data logs). All of the scripts used for BigQuery and classification analyses are publicly available on GitHub.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Delfina Eberly, Director of Data Center Operations at Facebook, speaking on “Operations at Scale” at the 7 \(\times \) 24 Exchange 2013 Fall Conference.

  2. Based on current Google BigQuery pricing.

References

  1. Abdul-Rahman, O.A., Aida, K.: Towards understanding the usage behavior of Google cloud users: the mice and elephants phenomenon. In: IEEE International Conference on Cloud Computing Technology and Science (CloudCom), pp. 272–277. Singapore (2014)

  2. Balliu, A., Olivetti, D., Babaoglu, O., Marzolla, M., Sîrbu, A.: Bidal: Big data analyzer for cluster traces. In: Informatika (BigSys workshop). Lecture Notes in Informatics, vol. 232, pp. 1781–1795. GI-Edition (2014)

  3. Balliu, A., Olivetti, D., Babaoglu, O., Marzolla, M., Sîrbu, A.: A big data analyzer for large trace logs. Computing. (2015). doi:10.1007/s00607-015-0480-7

  4. Breitgand, D., Dubitzky, Z., Epstein, A., Feder, O., Glikson, A., Shapira, I., Toffetti, G.: An adaptive utilization accelerator for virtualized environments. In: International Conference on Cloud Engineering (IC2E), pp. 165–174. IEEE, Boston (2014)

  5. Caglar, F., Gokhale, A.: iOverbook: Intelligent resource-overbooking to support soft real-time applications in the cloud. In: 7th IEEE International Conference on Cloud Computing (IEEE CLOUD). Anchorage (2014). http://www.dre.vanderbilt.edu/gokhale/WWW/papers/CLOUD-2014.pdf

  6. Di, S., Kondo, D., Cirne, W.: Characterization and comparison of google cloud load versus grids. In: International Conference on Cluster Computing (IEEE CLUSTER), pp. 230–238 (2012)

  7. Di, S., Kondo, D., Cirne, W.: Host load prediction in a Google compute cloud with a Bayesian model. In: 2012 International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2012). doi:10.1109/SC.2012.68

  8. Di, S., Robert, Y., Vivien, F., Kondo, D., Wang, C.L., Cappello, F.: Optimization of cloud task processing with checkpoint-restart mechanism. In: 25th International Conference on High Performance Computing, Networking, Storage and Analysis (SC). Denver (2013)

  9. Dudko, R., Sharma, A., Tedesco, J.: Effective Failure Prediction in Hadoop Clusters. University of Idaho White Paper, pp. 1–8 (2012)

  10. Gainaru, A., Bouguerra, M.S., Cappello, F., Snir, M., Kramer, W.: Navigating the blue waters: online failure prediction in the petascale era. Argonne National Laboratory Technical Report, ANL/MCS-P5219-1014 (2014)

  11. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. C 42(4), 463–484 (2012)

    Article  Google Scholar 

  12. Guan, Q., Fu, S.: Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures. In: 32nd IEEE Symposium on Reliable Distributed Systems (SRDS), pp. 205–214. Braga (2013)

  13. Iglesias, J.O., Lero, L.M., Cauwer, M.D., Mehta, D., O’Sullivan, B.: A methodology for online consolidation of tasks through more accurate resource estimations. In: IEEE/ACM International Conference on Utility and Cloud Computing (UCC). London (2014)

  14. Javadi, B., Kondo, D., Losup, A., Epema, D.: The Failure Trace Archive: enabling the comparison of failure measurements and models of distributed systems. J. Parallel Distrib. Comput. 73(8), 1208–1223 (2013)

    Article  Google Scholar 

  15. Khoshgoftaar, T.M., Golawala, M., Van Hulse, J.: An empirical study of learning from imbalanced data using random forest. In: 19th IEEE International Conference on Tools with Artificial Intelligence, 2007 (ICTAI 2007), vol. 2, pp. 310–317. IEEE (2007)

  16. Kuncheva, L.I., Whitaker, C.J., Shipp, C.A., Duin, R.P.: Is independence good for combining classifiers? In: Proceedings of the 15th International Conference on Pattern Recognition, 2000, vol. 2, pp. 168–171. IEEE (2000)

  17. Liang, Y., Zhang, Y., Xiong, H., Sahoo, R.: Failure prediction in IBM BlueGene/L event logs. In: Seventh IEEE International Conference on Data Mining (ICDM 2007), pp. 583–588 (2007). doi:10.1109/ICDM.2007.46

  18. Liu, Z., Cho, S.: Characterizing machines and workloads on a Google cluster. In: 8th International Workshop on Scheduling and Resource Management for Parallel and Distributed Systems (SRMPDS) (2012)

  19. Mishra, A.K., Hellerstein, J.L., Cirne, W., Das, C.R.: Towards characterizing cloud backend workloads: insights from Google compute clusters. Sigmetrics Perform. Eval. Rev. 37(4), 34–41 (2010)

    Article  Google Scholar 

  20. Opitz, D.W., Shavlik, J.W., et al.: Generating accurate and diverse members of a neural-network ensemble. In: Advances in Neural Information Processing Systems, pp. 535–541 (1996)

  21. Reiss, C., Tumanov, A., Ganger, G.R., Katz, R.H., Kozuch, M.A.: Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In: ACM Symposium on Cloud Computing (SoCC) (2012)

  22. Reiss, C., Tumanov, A., Ganger, G.R., Katz, R.H., Kozuch, M.A.: Towards understanding heterogeneous clouds at scale: Google trace analysis. Carnegie Mellon University Technical Reports ISTC-CC-TR(12–101) (2012)

  23. Reiss, C., Wilkes, J., Hellerstein, J.L.: Obfuscatory obscanturism: making workload traces of commercially-sensitive systems safe to release. In: Network Operations and Management Symposium (NOMS), 2012, pp. 1279–1286. IEEE (2012)

  24. Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33(1–2), 1–39 (2010)

    Article  Google Scholar 

  25. Rosà, A., Chen, L., Birke, R., Binder, W.: Demystifying casualties of evictions in big data priority scheduling. ACM SIGMETRICS Perform. Eval. Rev. 42(4), 12–21 (2015)

    Article  Google Scholar 

  26. Rosa, A., Chen, L.Y., Binder, W.: Predicting and mitigating jobs failures in big data clusters. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 221–230. IEEE. (2015). doi:10.1109/CCGrid.2015.139

  27. Salfner, F., Lenk, M., Malek, M.: A survey of online failure prediction methods. ACM Comput. Surv. (CSUR) 42(3), 1–68 (2010)

    Article  Google Scholar 

  28. Samak, T., Gunter, D., Goode, M., Deelman, E., Juve, G., Silva, F., Vahi, K.: Failure analysis of distributed scientific workflows executing in the cloud. In: Network and Service Management (cnsm). In: 2012 8th International Conference and 2012 Workshop on Systems Virtualiztion Management (svm), pp. 46–54. IEEE (2012)

  29. Sharma, B., Chudnovsky, V., Hellerstein, J.L., Rifaat, R., Das, C.R.: Modeling and synthesizing task placement constraints in google compute clusters. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, p. 3. ACM (2011)

  30. Shipp, C.A., Kuncheva, L.I.: Relationships between combination methods and measures of diversity in combining classifiers. Inf. Fusion 3(2), 135–148 (2002). doi:10.1016/S1566-2535(02)00051-9

  31. Sîrbu, A., Babaoglu, O.: BigQuery and ML scripts. GitHub (2015). Available at https://github.com/alinasirbu/google_cluster_failure_prediction

  32. Sîrbu, A., Babaoglu, O.: Towards data-driven autonomics in data centers. In: 2015 International Conference on Cloud and Autonomic Computing (ICCAC), pp. 45–56 (2015). doi:10.1109/ICCAC.2015.19

  33. Tigani, J., Naidu, S.: Google BigQuery Analytics. Wiley, Indianapolis (2014)

    Google Scholar 

  34. Verma, A., Pedrosa, L., Korupolu, M.R., Oppenheimer, D., Tune, E., Wilkes, J.: Large-scale cluster management at Google with Borg. In: Proceedings of the European Conference on Computer Systems (EuroSys). Bordeaux, France (2015)

  35. Wang, G., Butt, A.R., Monti, H., Gupta, K.: Towards synthesizing realistic workload traces for studying the hadoop ecosystem. In: 19th IEEE Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 400–408 (2011)

  36. Wilkes, J.: More Google cluster data. Google research blog (2011). http://googleresearch.blogspot.com/2011/11/more-google-cluster-data.html

  37. Zhang, Q., Hellerstein, J.L., Boutaba, R.: Characterizing task usage shapes in Google’s compute clusters. In: Proceedings of the 5th International Workshop on Large Scale Distributed Systems and Middleware (2011)

  38. Zhang, Q., Zhani, M.F., Boutaba, R., Hellerstein, J.L.: Dynamic heterogeneity-aware resource provisioning in the cloud. IEEE Trans. Cloud Comput. 2(1), 14–28 (2014)

    Article  Google Scholar 

Download references

Acknowledgments

BigQuery analysis was carried out through a generous Cloud Credits grant from Google. We are grateful to John Wilkes of Google for helpful discussions regarding the cluster trace data.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alina Sîrbu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sîrbu, A., Babaoglu, O. Towards operator-less data centers through data-driven, predictive, proactive autonomics. Cluster Comput 19, 865–878 (2016). https://doi.org/10.1007/s10586-016-0564-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-016-0564-y

Keywords

Navigation