Skip to main content

Characterizing and Modeling for Proactive Disk Failure Prediction to Improve Reliability of Data Centers

  • Conference paper
  • First Online:
Large-Scale Disk Failure Prediction (AI Ops 2020)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1261))

Included in the following conference series:

  • 543 Accesses

Abstract

In modern datacenter, hard disk drive has the highest failure rate. Current storage system has data protection feature to avoid data loss caused by disk failure. However, data reconstruction process always slows down or even suspends system services. If disk failures can be predicted accurately, data protection mechanism can be performed before disk failures really happen. Disk failure prediction dramatically improve the reliability and availability of storage system. This paper analyzes disk SMART data features in detail. According the analysis results, we design an effective feature extraction and preprocessing method. And we have optimized the XGBoost’s hyperparameters. Finally, ensemble learning is applied to further improve the accuracy of prediction. The experimental results of Alibaba data set show that our system predict disk failures within 30 days. And the F-score achieves 39.98.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. https://en.wikipedia.org/wiki/S.M.A.R.T

  2. https://github.com/alibaba-edu/dcbrain/tree/master/diskdata

  3. https://tianchi.aliyun.com/competition/entrance/231775/information

  4. Hongzhang, Y., Yahui, Y., Yaofeng, T., et al.: Proactive fault tolerance based on “collection-prediction-migration-feedback” mechanism. J. Comput. Res. Dev. 57(2), 306–317 (2020)

    Google Scholar 

  5. Sidi, L., Bing, L., Tirthak, P., et al.: Making disk failure predictions SMARTer! In: Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST 2020), Santa Clara, CA, USA, pp. 151–167 (2020)

    Google Scholar 

  6. Yong, X., Kaixin, S., Randolph, Y., et al: Improving service availability of cloud systems by predicting disk error. In: Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC 2018), Boston, MA, USA, pp. 481–493 (2018)

    Google Scholar 

  7. Yanwen, X., Dan, F., Fang, W., et al: DFPE: explaining predictive models for disk failure prediction. In: 2019 35th Symposium on Mass Storage Systems and Technologies (MSST), Santa Clara, CA, USA (2019)

    Google Scholar 

  8. Ganguly, S., Consul, A., Khan, A., Bussone, B., Richards, J., Miguel, A.: A practical approach to hard disk failure prediction in cloud platforms: big data model for failure management in datacenters. In: 2016 IEEE Second International Conference on Big Data Computing Service and Applications, pp. 105–116. IEEE (2016)

    Google Scholar 

  9. Ma, A., et al.: RAIDShield: Characterizing, monitoring, and proactively protecting against disk failures. ACM Trans. Storage 11(4), 17:1–17:28 (2015)

    Article  Google Scholar 

  10. Nicolas, A., Samuel, J., Guillaume, G., Yohan, P., Eriza, F., Sophie, C.: Predictive models of hard drive failures based on operational data. In: 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 619–625 (2017)

    Google Scholar 

  11. Tan, Y., Gu, X.: On predictability of system anomalies in real world. In: 2010 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 133–140. IEEE (2010)

    Google Scholar 

  12. Dean, D.J., Nguyen, H., Gu, X.: UBL: unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems. In: Proceedings of the 9th International Conference on Autonomic Computing, pp. 191–200. ACM (2012)

    Google Scholar 

  13. Wang, Y., Miao, Q., Ma, E.W.M., Tsui, K.L., Pecht, M.G.: Online anomaly detection for hard disk drives based on mahalanobis distance. IEEE Trans. Reliab. 62(1), 136–145 (2013)

    Article  Google Scholar 

  14. http://docs.ceph.com/docs/master/mgr/diskprediction

  15. Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998). https://doi.org/10.1023/A:1009715923555

    Article  Google Scholar 

  16. Chen, T, Guestrin, C: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM (2016)

    Google Scholar 

  17. Liaw, A., Wiener, M., et al.: Classification and regression by randomForest. R News 2(3), 18–22 (2002)

    Google Scholar 

  18. dos Santos Lima, F.D., Amaral, G.M.R., de Moura Leite, L.G., Gomes, J.P.P., de Castro Machado, J.: Predicting failures in hard drives with LSTM networks. In: Proceedings of the 2017 Brazilian Conference on Intelligent Systems (BRACIS), pp. 222–227. IEEE (2017)

    Google Scholar 

  19. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

Download references

Acknowledgements

Thanks to Alibaba and PAKDD for organizing the PAKDD2020 Alibaba Intelligent Operation and Maintenance Algorithm Contest, which give us precious training data sets. This competition also gives us the opportunity to communicate with experts. We especially thank Inspur and the leaders for their support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qiang Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, T., Liang, X., Xie, Q., Li, Q., Li, H., Zhang, K. (2020). Characterizing and Modeling for Proactive Disk Failure Prediction to Improve Reliability of Data Centers. In: He, C., Feng, M., Lee, P., Wang, P., Han, S., Liu, Y. (eds) Large-Scale Disk Failure Prediction. AI Ops 2020. Communications in Computer and Information Science, vol 1261. Springer, Singapore. https://doi.org/10.1007/978-981-15-7749-9_12

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-7749-9_12

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-7748-2

  • Online ISBN: 978-981-15-7749-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics