Characterizing and Modeling for Proactive Disk Failure Prediction to Improve Reliability of Data Centers

Wang, Tuanjie; Liang, Xinhui; Xie, Quanquan; Li, Qiang; Li, Hui; Zhang, Kai

doi:10.1007/978-981-15-7749-9_12

Tuanjie Wang¹²,
Xinhui Liang¹²,
Quanquan Xie¹²,
Qiang Li¹²,
Hui Li¹² &
…
Kai Zhang¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1261))

Included in the following conference series:

AI Ops Competition

543 Accesses

Abstract

In modern datacenter, hard disk drive has the highest failure rate. Current storage system has data protection feature to avoid data loss caused by disk failure. However, data reconstruction process always slows down or even suspends system services. If disk failures can be predicted accurately, data protection mechanism can be performed before disk failures really happen. Disk failure prediction dramatically improve the reliability and availability of storage system. This paper analyzes disk SMART data features in detail. According the analysis results, we design an effective feature extraction and preprocessing method. And we have optimized the XGBoost’s hyperparameters. Finally, ensemble learning is applied to further improve the accuracy of prediction. The experimental results of Alibaba data set show that our system predict disk failures within 30 days. And the F-score achieves 39.98.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

https://en.wikipedia.org/wiki/S.M.A.R.T
https://github.com/alibaba-edu/dcbrain/tree/master/diskdata
https://tianchi.aliyun.com/competition/entrance/231775/information
Hongzhang, Y., Yahui, Y., Yaofeng, T., et al.: Proactive fault tolerance based on “collection-prediction-migration-feedback” mechanism. J. Comput. Res. Dev. 57(2), 306–317 (2020)
Google Scholar
Sidi, L., Bing, L., Tirthak, P., et al.: Making disk failure predictions SMARTer! In: Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST 2020), Santa Clara, CA, USA, pp. 151–167 (2020)
Google Scholar
Yong, X., Kaixin, S., Randolph, Y., et al: Improving service availability of cloud systems by predicting disk error. In: Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC 2018), Boston, MA, USA, pp. 481–493 (2018)
Google Scholar
Yanwen, X., Dan, F., Fang, W., et al: DFPE: explaining predictive models for disk failure prediction. In: 2019 35th Symposium on Mass Storage Systems and Technologies (MSST), Santa Clara, CA, USA (2019)
Google Scholar
Ganguly, S., Consul, A., Khan, A., Bussone, B., Richards, J., Miguel, A.: A practical approach to hard disk failure prediction in cloud platforms: big data model for failure management in datacenters. In: 2016 IEEE Second International Conference on Big Data Computing Service and Applications, pp. 105–116. IEEE (2016)
Google Scholar
Ma, A., et al.: RAIDShield: Characterizing, monitoring, and proactively protecting against disk failures. ACM Trans. Storage 11(4), 17:1–17:28 (2015)
Article Google Scholar
Nicolas, A., Samuel, J., Guillaume, G., Yohan, P., Eriza, F., Sophie, C.: Predictive models of hard drive failures based on operational data. In: 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 619–625 (2017)
Google Scholar
Tan, Y., Gu, X.: On predictability of system anomalies in real world. In: 2010 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 133–140. IEEE (2010)
Google Scholar
Dean, D.J., Nguyen, H., Gu, X.: UBL: unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems. In: Proceedings of the 9th International Conference on Autonomic Computing, pp. 191–200. ACM (2012)
Google Scholar
Wang, Y., Miao, Q., Ma, E.W.M., Tsui, K.L., Pecht, M.G.: Online anomaly detection for hard disk drives based on mahalanobis distance. IEEE Trans. Reliab. 62(1), 136–145 (2013)
Article Google Scholar
http://docs.ceph.com/docs/master/mgr/diskprediction
Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998). https://doi.org/10.1023/A:1009715923555
Article Google Scholar
Chen, T, Guestrin, C: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM (2016)
Google Scholar
Liaw, A., Wiener, M., et al.: Classification and regression by randomForest. R News 2(3), 18–22 (2002)
Google Scholar
dos Santos Lima, F.D., Amaral, G.M.R., de Moura Leite, L.G., Gomes, J.P.P., de Castro Machado, J.: Predicting failures in hard drives with LSTM networks. In: Proceedings of the 2017 Brazilian Conference on Intelligent Systems (BRACIS), pp. 222–227. IEEE (2017)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar

Download references

Acknowledgements

Thanks to Alibaba and PAKDD for organizing the PAKDD2020 Alibaba Intelligent Operation and Maintenance Algorithm Contest, which give us precious training data sets. This competition also gives us the opportunity to communicate with experts. We especially thank Inspur and the leaders for their support.

Author information

Authors and Affiliations

State Key Laboratory of High-end Server and Storage Technology, Beijing, China
Tuanjie Wang, Xinhui Liang, Quanquan Xie, Qiang Li, Hui Li & Kai Zhang

Authors

Tuanjie Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xinhui Liang
View author publications
You can also search for this author in PubMed Google Scholar
Quanquan Xie
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Hui Li
View author publications
You can also search for this author in PubMed Google Scholar
Kai Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qiang Li .

Editor information

Editors and Affiliations

Alibaba Group (China), Hangzhou, China
Cheng He
National University of Singapore, Singapore, Singapore
Mengling Feng
Chinese University of Hong Kong, Hong Kong, China
Patrick P. C. Lee
Xi'an Jiaotong University, Xi'an, China
Pinghui Wang
Chinese University of Hong Kong, Hong Kong, China
Shujie Han
Alibaba Group (China), Hangzhou, China
Yi Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, T., Liang, X., Xie, Q., Li, Q., Li, H., Zhang, K. (2020). Characterizing and Modeling for Proactive Disk Failure Prediction to Improve Reliability of Data Centers. In: He, C., Feng, M., Lee, P., Wang, P., Han, S., Liu, Y. (eds) Large-Scale Disk Failure Prediction. AI Ops 2020. Communications in Computer and Information Science, vol 1261. Springer, Singapore. https://doi.org/10.1007/978-981-15-7749-9_12

Download citation

DOI: https://doi.org/10.1007/978-981-15-7749-9_12
Published: 06 August 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-7748-2
Online ISBN: 978-981-15-7749-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics