Skip to main content

To Fail or Not to Fail: Predicting Hard Disk Drive Failure Time Windows

  • Conference paper
  • First Online:
Measurement, Modelling and Evaluation of Computing Systems (MMB 2020)

Abstract

Due to the increasing size of today’s data centers as well as the expectation of 24/7 availability, the complexity in the administration of hardware continuously increases. Techniques as the Self-Monitoring, Analysis, and Reporting Technology (S.M.A.R.T.) support the monitoring of the hardware. However, those techniques often lack algorithms for intelligent data analytics. Especially, the integration of machine learning to identify potential failures in advance seems to be promising to reduce administration overhead. In this work, we present three machine learning approaches to (i) identify imminent failures, (ii) predict time windows for failures, as well as (iii) predict the exact time-to-failure. In a case study with real data from 369 hard disks, we achieve an F1-score of up to 98.0% and 97.6% for predicting potential failures with two or multiple time windows, respectively, and a hit rate of 84.9% (with a mean absolute error of 4.5 h) for predicting the time-to-failure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Weak learners are classification methods that correlate rather weakly with the true classification, while strong learners correlate very well with the true classification.

  2. 2.

    Sampling with replacement means that instances can be selected multiple times in the same sample.

  3. 3.

    The handling of multiple classes is not explicitly required for comparing data preparation for binary classification. However, it is necessary for multi-class classification in Sect. 4.2 and to maintain comparability between the approaches.

  4. 4.

    This does not apply if the data set is sufficiently large to still be large enough after undersampling.

  5. 5.

    HDD data set: http://dsp.ucsd.edu/~jfmurray/software.htm.

References

  1. Aussel, N., Jaulin, S., Gandon, G., Petetin, Y., Fazli, E., Chabridon, S.: Predictive models of hard drive failures based on operational data. In: 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 619–625. IEEE (2017)

    Google Scholar 

  2. Botezatu, M.M., Giurgiu, I., Bogojeska, J., Wiesmann, D.: Predicting disk replacement towards reliable data centers. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48. ACM (2016)

    Google Scholar 

  3. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)

    MATH  Google Scholar 

  4. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  5. Breiman, L., Cutler, A., Liaw, A., Wiener, M.: Breiman and Cutler’s random forests for classification and regression (2018). https://cran.r-project.org/web/packages/randomForest/randomForest.pdf

  6. Cao, H., Li, X.L., Woon, D.Y.K., Ng, S.K.: Integrated oversampling for imbalanced time series classification. IEEE Trans. Knowl. Data Eng. 25(12), 2809–2822 (2013)

    Article  Google Scholar 

  7. Chaves, I.C., de Paula, M.R.P., Leite, L.G., Gomes, J.P.P., Machado, J.C.: Hard disk drive failure prediction method based on a Bayesian network. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE (2018)

    Google Scholar 

  8. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  9. Dal Pozzolo, A., Caelen, O., Bontempi, G.: Unbalanced (2015). https://cran.r-project.org/web/packages/unbalanced/unbalanced.pdf

  10. Dixon, M., Klabjan, D., Wei, L.: OSTSC (2017). https://cran.r-project.org/web/packages/OSTSC/OSTSC.pdf

  11. Hamerly, G., Elkan, C., et al.: Bayesian approaches to failure prediction for disk drives. In: ICML, vol. 1, pp. 202–209 (2001)

    Google Scholar 

  12. Ho, T.K.: Random decision forests. In: Proceedings of 3rd International Conference on Document Analysis and Recognition, vol. 1, pp. 278–282. IEEE (1995)

    Google Scholar 

  13. Krupitzer, C., Roth, F.M., VanSyckel, S., Schiele, G., Becker, C.: A survey on engineering approaches for self-adaptive systems. Pervasive Mob. Comput. J. 17(Part B), 184–206 (2015)

    Article  Google Scholar 

  14. Li, J., et al.: Hard drive failure prediction using classification and regression trees. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 383–394. IEEE (2014)

    Google Scholar 

  15. Li, J., Stones, R.J., Wang, G., Liu, X., Li, Z., Xu, M.: Hard drive failure prediction using decision trees. Reliab. Eng. Syst. Saf. 164, 55–65 (2017)

    Article  Google Scholar 

  16. Murray, J.F., Hughes, G.F., Kreutz-Delgado, K.: Machine learning methods for predicting failures in hard drives: a multiple-instance application. J. Mach. Learn. Res. 6(May), 783–816 (2005)

    MathSciNet  MATH  Google Scholar 

  17. Ottem, E., Plummer, J.: Playing it smart: The emergence of reliability prediction technology. Technical report, Seagate Technology Paper (1995)

    Google Scholar 

  18. Pinheiro, E., Weber, W.D., Barroso, L.A.: Failure trends in a large disk drive population. In: 5th USENIX Conference on File and Storage Technologies (FAST 2007), pp. 17–29 (2007)

    Google Scholar 

  19. Pitakrat, T., Van Hoorn, A., Grunske, L.: A comparison of machine learning algorithms for proactive hard disk drive failure detection. In: Proceedings of the 4th International ACM SIGSOFT Symposium on Architecting Critical Systems, pp. 1–10. ACM (2013)

    Google Scholar 

  20. dos Santos Lima, F.D., Pereira, F.L.F., Chaves, I.C., Gomes, J.P.P., de Castro Machado, J.: Evaluation of recurrent neural networks for hard disk drives failure prediction. In: 2018 7th Brazilian Conference on Intelligent Systems (BRACIS), pp. 85–90. IEEE (2018)

    Google Scholar 

  21. Seagate Product Marketing: Get S.M.A.R.T. for reliability. Technical report, Seagate Technology Paper (1999)

    Google Scholar 

  22. Shen, J., Wan, J., Lim, S.J., Yu, L.: Random-forest-based failure prediction for hard disk drives. Int. J. Distrib. Sens. Netw. 14(11), 1550147718806480 (2018)

    Article  Google Scholar 

  23. Sun, X., et al.: System-level hardware failure prediction using deep learning. In: Proceedings of the 56th Annual Design Automation Conference 2019, p. 20. ACM (2019)

    Google Scholar 

  24. Wang, Y., Ma, E.W., Chow, T.W., Tsui, K.L.: A two-step parametric method for failure prediction in hard disk drives. IEEE Trans. Industr. Inf. 10(1), 419–430 (2013)

    Article  Google Scholar 

  25. Xiao, J., Xiong, Z., Wu, S., Yi, Y., Jin, H., Hu, K.: Disk failure prediction in data centers via online learning. In: Proceedings of the 47th International Conference on Parallel Processing, p. 35. ACM (2018)

    Google Scholar 

  26. Xu, C., Wang, G., Liu, X., Guo, D., Liu, T.Y.: Health status assessment and failure prediction for hard drives with recurrent neural networks. IEEE Trans. Comput. 65(11), 3502–3508 (2016)

    Article  MathSciNet  Google Scholar 

  27. Yang, W., Hu, D., Liu, Y., Wang, S., Jiang, T.: Hard drive failure prediction using big data. In: 2015 IEEE 34th Symposium on Reliable Distributed Systems Workshop (SRDSW), pp. 13–18. IEEE (2015)

    Google Scholar 

  28. Zhao, Y., Liu, X., Gan, S., Zheng, W.: Predicting disk failures with HMM- and HSMM-based approaches. In: Perner, P. (ed.) ICDM 2010. LNCS (LNAI), vol. 6171, pp. 390–404. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14400-4_30

    Chapter  Google Scholar 

  29. Zhu, B., Wang, G., Liu, X., Hu, D., Lin, S., Ma, J.: Proactive drive failure prediction for large scale storage systems. In: 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–5. IEEE (2013)

    Google Scholar 

  30. Züfle, M., et al.: Autonomic forecasting method selection: examination and ways ahead. In: Proceedings of the 16th IEEE International Conference on Autonomic Computing (ICAC). IEEE (2019)

    Google Scholar 

Download references

Acknowledgements

This work was co-funded by the German Research Foundation (DFG) under grant No. (KO 3445/11-1) and the IHK (Industrie- und Handelskammer) Würz-burg-Schweinfurt.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marwin Züfle .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Züfle, M., Krupitzer, C., Erhard, F., Grohmann, J., Kounev, S. (2020). To Fail or Not to Fail: Predicting Hard Disk Drive Failure Time Windows. In: Hermanns, H. (eds) Measurement, Modelling and Evaluation of Computing Systems. MMB 2020. Lecture Notes in Computer Science(), vol 12040. Springer, Cham. https://doi.org/10.1007/978-3-030-43024-5_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-43024-5_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-43023-8

  • Online ISBN: 978-3-030-43024-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics