Skip to main content

Failure prediction using machine learning in a virtualised HPC system and application


Failure is an increasingly important issue in high performance computing and cloud systems. As large-scale systems continue to grow in scale and complexity, mitigating the impact of failure and providing accurate predictions with sufficient lead time remains a challenging research problem. Traditional existing fault-tolerance strategies such as regular check-pointing and replication are not adequate because of the emerging complexities of high performance computing systems. This necessitates the importance of having an effective as well as proactive failure management approach in place aimed at minimizing the effect of failure within the system. With the advent of machine learning techniques, the ability to learn from past information to predict future pattern of behaviours makes it possible to predict potential system failure more accurately. Thus, in this paper, we explore the predictive abilities of machine learning by applying a number of algorithms to improve the accuracy of failure prediction. We have developed a failure prediction model using time series and machine learning, and performed comparison based tests on the prediction accuracy. The primary algorithms we considered are the support vector machine (SVM), random forest (RF), k-nearest neighbors (KNN), classification and regression trees (CART) and linear discriminant analysis (LDA). Experimental results indicates that the average prediction accuracy of our model using SVM when predicting failure is 90% accurate and effective compared to other algorithms. This finding implies that our method can effectively predict all possible future system and application failures within the system.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10


  1. 1.

    Beaumont, O., Eyraud-Dubois, L., Lorenzo-Del-Castillo, J.A.: Analyzing real cluster data for formulating allocation algorithms in cloud platforms. Parallel Comput. 54, 83–96 (2016)

    MathSciNet  Article  Google Scholar 

  2. 2.

    Singh, K., Smallen, S., Tilak, S., Saul, L.: Failure analysis and prediction for the CIPRES science gateway Kritika. Concurr. Comput. Pract. Exp. 22(6), 685–701 (2016)

    Google Scholar 

  3. 3.

    Garraghan, P., Townend, P., Xu, J.: An empirical failure-analysis of a large-scale cloud computing environment. In: Proceedings of IEEE 15th International Symposium on High Assurance Systems Engineering HASE 2014, pp. 113–120 (2014)

  4. 4.

    Elliott, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K., Engelmann, C.: Combining partial redundancy and checkpointing for HPC. In: Proceedings of International Conference on Distributed Computing Systems, pp. 615–626 (2012)

  5. 5.

    Mohammed, B., Kiran, M., Maiyama, K.M., Kamala, M.M., Awan, I.-U.: Failover strategy for fault tolerance in cloud computing environment. Softw. Pract. Exp. 47(9), 1243–1247 (2017)

    Article  Google Scholar 

  6. 6.

    Pantic, Z., Babar, M.: Guidelines for building a private cloud infrastructure. In: ITU Tech. Rep.—TR-2012-153TR-2012-153 (2012)

  7. 7.

    Sefraoui, O., Aissaoui, M., Eleuldj, M.: Cloud computing migration and IT resources rationalization. In: International Conference on Multimedia Computing and Systems, pp. 1164–1168 (2014)

  8. 8.

    Sen, A., Madria, S.: Off-line risk assessment of cloud service provider. In: 2014 IEEE World Congress on Services, pp. 58–65 (2014)

  9. 9.

    Yadav, S.: Comparative study on open source software for cloud computing platform: eucalyptus. In: Openstack and Opennebula, Res. Inven. Int. J. Eng. Sci. vol. 3, no. 10, pp. 51–54 (2013)

  10. 10.

    Bontempi, G., Ben Taieb, S., Le Borgne, Y.A.: Machine learning strategies for time series forecasting. In: Lecture Notes in Business Information Processing (LNBIP), vol. 138, pp. 62–77 (2013)

  11. 11.

    Chigurupati, A., Thibaux, R., Lassar, N.: Predicting hardware failure using machine learning. In: 2016 Annual Reliability and Maintainability Symposium, p. 16 (2016)

  12. 12.

    Fulp, E., Fink, G., Haack, J.: Predicting computer system failures using support vector machines. In: Proceedings of First USENIX Conference Anal. Syst. logs, p. 55 (2008)

  13. 13.

    Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secur. Comput. 7(4), 337–350 (2010)

    Article  Google Scholar 

  14. 14.

    Sahoo, R.K., Squillante, M.S., Sivasubramaniam, A., Zhang, Y.Z.Y.: Failure data analysis of a large-scale heterogeneous server environment. Int. Conf. Dependable Syst. Netw. 2004, 110 (2004)

    Google Scholar 

  15. 15.

    Vishwanath, K.V., Nagappan, N.: Characterizing cloud computing hardware reliability. In: Proceedings of the 1st ACM symposium on cloud computing–SoCC 10, p. 193 (2010)

  16. 16.

    Kavulya, S., Tany, J., Gandhi, R., Narasimhan, P.: An analysis of traces from a production MapReduce cluster. In: CCGrid 2010—10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp. 94–103 (2010)

  17. 17.

    Abu-Samah, A., Shahzad, M. K., Zamai, E., Ben Said, A.: Failure prediction methodology for improved proactive maintenance using Bayesian approach. In: IFAC Proceedings, vol. 48, no. 21, pp. 844–851 (2015)

  18. 18.

    Khan, A., Bussone, B., Richards, J., Miguel, A.: A practical approach to hard disk failure prediction in cloud platforms. In: 2016 IEEE Second International Conference on Big Data Computing Service and Applications, pp. 105–116 (2016)

  19. 19.

    Thomas, G.H., Gungl, K.P.: Patent US9319030—integrated circuit failure prediction using clock duty cycle recording (2016)

  20. 20.

    Choi, J., Kim, Y.: Adaptive resource provisioning method using application-aware machine learning based on job history in heterogeneous infrastructures. Clust. Comput. 20(4), 35373549 (2017)

    Article  Google Scholar 

  21. 21.

    Li, Z.: An adaptive overload threshold selection process using Markov decision processes of virtual machine in cloud data center. Cluster Comput. 1–13 (2018)

  22. 22.

    Jayanthi, R., Florence, L.: Software defect prediction techniques using metrics based on neural network classifier. Cluster Comput. 1–12 (2018)

  23. 23.

    Kumaresan, K., Ganeshkumar, P.: Software reliability modeling using increased failure interval. Clust. Comput. 1–18 (2018)

  24. 24.

    Padhy, N., Singh, R.P., Satapathy, S.C.: Cost-effective and fault-resilient reusability prediction model by using adaptive genetic algorithm based neural network for web-of-service applications. Clust. Comput. 9, 1–23 (2018)

    Google Scholar 

  25. 25.

    Manjula, C., Florence, L.: Deep neural network based hybrid approach for software defect prediction using software metrics. Clust. Comput. 1–17 (2018)

  26. 26.

    Keke, G., Qiu, M., Elnagdy, S.A.: Security-aware information classifications using supervised learning for cloud-based cyber risk management in financial big data. In: 2016 IEEE 2nd International Conference on Big Data Security on Cloud, IEEE International Conference on High Performance and Smart Computing, IEEE International Conference on Intelligent Data and Security, pp. 197–202 (2016)

  27. 27.

    Zhang, L., Rao, K., Wang, R., Jia, Y.: Risk prediction model based on improved AdaBoost method for cloud users. Open Cybern. Syst. J. 9, 44–49 (2015)

    Article  Google Scholar 

  28. 28.

    Pop, D.: Machine learning and cloud computing: survey of distributed and SaaS solutions. Inst. e-Austria Timisoara, Tech. Rep 1 (2012)

  29. 29.

    Bsch, S., Nissen, V., Wnscher, A.: Automatic classification of data-warehouse-data for information lifecycle management using machine learning techniques. Inf. Syst. Front. 19(5), 1085–1099 (2016)

    Article  Google Scholar 

  30. 30.

    Fall, D., Okuda, T., Kadobayashi, Y., Yamaguchi, S.: Risk adaptive authorization mechanism (RAdAM) for cloud computing. J. Inf. Process. 24(2), 371380 (2016)

    Google Scholar 

  31. 31.

    Guo, C., Liu, Y., Huang, M.: Obtaining evidence model of an expert system based on machine learning in cloud environment. J. Internet Technol. 16(7), 13391349 (2015)

    Google Scholar 

  32. 32.

    Amin, Z., Sethi, N., Singh, H.: Review on fault tolerance techniques in cloud computing. Int. J. Comput. Appl. 116(18), 1117 (2015)

    Google Scholar 

  33. 33.

    Pellegrini, A., Di Sanzo, P., Avresky, D.R.: Proactive cloud management for highly heterogeneous multi-cloud infrastructures. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 1311–1318 (2016)

  34. 34.

    Thakur, K.S.S.P.P, Godavarthi, T.R.: vol. 3, no. 6, pp. 698–703 (2013)

  35. 35.

    Shen, C., Tong, W., Choo, K. K. R., Kausar, S.: Performance prediction of parallel computing models to analyze cloud-based big data applications. Clust. Comput. pp. 1–16 (2017)

  36. 36.

    Kwon, D., Kim, H., Kim, J., Suh, S. C., Kim, I., Kim, K. J.: A survey of deep learning-based network anomaly detection. Clust. Comput. pp. 1–13 (2017)

  37. 37.

    Muthusankar, D., Kalaavathi, B., Kaladevi, P.: High performance feature selection algorithms using filter method for cloud-based recommendation system. Clust. Comput. 0(i), 1–12 (2018)

    Google Scholar 

  38. 38.

    Madni, S.H.H., Latiff, M.S.A., Coulibaly, Y., Abdulhamid, S.M.: Recent advancements in resource allocation techniques for cloud computing environment: a systematic review. Clust. Comput. 20(3), 24892533 (2017)

    Article  Google Scholar 

  39. 39.

    Schroeder, B., Gibson, G.: The computer failure data repository (CFDR): collecting, sharing and analyzing failure data. In: SC 06 Proceedings of 2006 ACM/IEEE Conference Supercomputing, March, p. 154 (2006)

  40. 40.

    Schroeder, B., Gibson, G.: The computer failure data repository (CFDR). In: Workshop on Reliability Analysis of System Failure Data (RAF’07), MSR Cambridge, p. 6 (2007)

  41. 41.

    Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw. 10(5), 988–999 (1999)

    Article  Google Scholar 

  42. 42.

    Medeiros, M.C., Veiga, A., Resende, M.G.C.: A combinatorial approach to piecewise linear time series analysis. J. Comput. Graph. Stat. 11(1), 236–258 (2002)

    MathSciNet  Article  Google Scholar 

  43. 43.

    Zhou, Y.: Failure trend analysis using time series model. In: 2017 29th Chinese Control and Decision Conference, no. 1, pp. 859–862 (2017)

  44. 44.

    Ho, S., Xie, M., Goh, T.: A comparative study of neural network and Box-Jenkins ARIMA modeling in time series prediction. Comput. Ind. Eng. 42(24), 371–375 (2002)

    Article  Google Scholar 

  45. 45.

    Casalicchio, E.: A study on performance measures for auto-scaling CPU-intensive containerized applications. Clust. Comput. 1–12 (2019)

  46. 46.

    Nussbaum, L., Anhalt, F., Mornard, O., Gelas, J., Nussbaum, L., Anhalt, F., Mornard, O., Linux-based, J. G., Nussbaum, L., Mornard, O.: Linux-based virtualization for HPC clusters. In: Montreal Linux Symposium (2009)

  47. 47.

    Benedicic, L., Cruz, F.A., Madonna, A., Mariotti, K.: Portable, High-Performance Containers for HPC. Cornell University, Ithaca (2017)

    Google Scholar 

  48. 48.

    Nanda, S., Hacker, T.J.: Racc: resource-aware container consolidation using a deep learning approach. In: Proceedings of First Workshop on Machine Learning Computing System— MLCS18, pp. 1–5 (2018)

  49. 49.

    CANONICAL LTD, Linux containers, infrastructure for container projects, 2018. Accessed 21 Jan 2019

  50. 50.

    Dwyer, T., Fedorova, A., Blagodurov, S., Roth, M., Gaud, F., Pei, J.: A practical method for estimating performance degradation on multicore processors, and its application to HPC workloads. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) (2012)

  51. 51.

    Buyya, R., Ranjan, R., Calheiros, R.N.: Modeling and simulation of scalable cloud computing environments and the cloudsim toolkit: challenges and opportunities. In: Proceedings of 2009 International Conference on High Performance Computing Simulation, HPCS 2009, pp. 1–11 (2009)

  52. 52.

    Fulay, A.: Database containerization platform checklist—Container Journal (2016). Accessed 21 Jan 2019

  53. 53.

    Onur, C.: Utilizing containers for HPC and deep learning workloads—CIO, DELL EMC: innovating to transform (2018). Accessed 21 Jan 2019

Download references


The authors would like to thank the anonymous reviewers for their useful review in improving the quality of this paper. We would also like to thank Bill Kramer and Akbar Mokhtarani from NERSC for collecting the data and sharing it. One of the authors Bashir Mohammed is a Petroleum Technology Development Fund (PTDF) scholar. We would like to express our sincere gratitude to PTDF for its funding support under the OSS scheme with Grant Number (PTDF/E/OSS/PHD/MB/651/14).

Author information



Corresponding author

Correspondence to Bashir Mohammed.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Mohammed, B., Awan, I., Ugail, H. et al. Failure prediction using machine learning in a virtualised HPC system and application. Cluster Comput 22, 471–485 (2019).

Download citation


  • Failure
  • Machine learning
  • High performance computing
  • Cloud computing