Abstract
In modern datacenter, hard disk drive has the highest failure rate. Current storage system has data protection feature to avoid data loss caused by disk failure. However, data reconstruction process always slows down or even suspends system services. If disk failures can be predicted accurately, data protection mechanism can be performed before disk failures really happen. Disk failure prediction dramatically improve the reliability and availability of storage system. This paper analyzes disk SMART data features in detail. According the analysis results, we design an effective feature extraction and preprocessing method. And we have optimized the XGBoost’s hyperparameters. Finally, ensemble learning is applied to further improve the accuracy of prediction. The experimental results of Alibaba data set show that our system predict disk failures within 30 days. And the F-score achieves 39.98.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
https://tianchi.aliyun.com/competition/entrance/231775/information
Hongzhang, Y., Yahui, Y., Yaofeng, T., et al.: Proactive fault tolerance based on “collection-prediction-migration-feedback” mechanism. J. Comput. Res. Dev. 57(2), 306–317 (2020)
Sidi, L., Bing, L., Tirthak, P., et al.: Making disk failure predictions SMARTer! In: Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST 2020), Santa Clara, CA, USA, pp. 151–167 (2020)
Yong, X., Kaixin, S., Randolph, Y., et al: Improving service availability of cloud systems by predicting disk error. In: Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC 2018), Boston, MA, USA, pp. 481–493 (2018)
Yanwen, X., Dan, F., Fang, W., et al: DFPE: explaining predictive models for disk failure prediction. In: 2019 35th Symposium on Mass Storage Systems and Technologies (MSST), Santa Clara, CA, USA (2019)
Ganguly, S., Consul, A., Khan, A., Bussone, B., Richards, J., Miguel, A.: A practical approach to hard disk failure prediction in cloud platforms: big data model for failure management in datacenters. In: 2016 IEEE Second International Conference on Big Data Computing Service and Applications, pp. 105–116. IEEE (2016)
Ma, A., et al.: RAIDShield: Characterizing, monitoring, and proactively protecting against disk failures. ACM Trans. Storage 11(4), 17:1–17:28 (2015)
Nicolas, A., Samuel, J., Guillaume, G., Yohan, P., Eriza, F., Sophie, C.: Predictive models of hard drive failures based on operational data. In: 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 619–625 (2017)
Tan, Y., Gu, X.: On predictability of system anomalies in real world. In: 2010 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 133–140. IEEE (2010)
Dean, D.J., Nguyen, H., Gu, X.: UBL: unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems. In: Proceedings of the 9th International Conference on Autonomic Computing, pp. 191–200. ACM (2012)
Wang, Y., Miao, Q., Ma, E.W.M., Tsui, K.L., Pecht, M.G.: Online anomaly detection for hard disk drives based on mahalanobis distance. IEEE Trans. Reliab. 62(1), 136–145 (2013)
Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998). https://doi.org/10.1023/A:1009715923555
Chen, T, Guestrin, C: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. ACM (2016)
Liaw, A., Wiener, M., et al.: Classification and regression by randomForest. R News 2(3), 18–22 (2002)
dos Santos Lima, F.D., Amaral, G.M.R., de Moura Leite, L.G., Gomes, J.P.P., de Castro Machado, J.: Predicting failures in hard drives with LSTM networks. In: Proceedings of the 2017 Brazilian Conference on Intelligent Systems (BRACIS), pp. 222–227. IEEE (2017)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Acknowledgements
Thanks to Alibaba and PAKDD for organizing the PAKDD2020 Alibaba Intelligent Operation and Maintenance Algorithm Contest, which give us precious training data sets. This competition also gives us the opportunity to communicate with experts. We especially thank Inspur and the leaders for their support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, T., Liang, X., Xie, Q., Li, Q., Li, H., Zhang, K. (2020). Characterizing and Modeling for Proactive Disk Failure Prediction to Improve Reliability of Data Centers. In: He, C., Feng, M., Lee, P., Wang, P., Han, S., Liu, Y. (eds) Large-Scale Disk Failure Prediction. AI Ops 2020. Communications in Computer and Information Science, vol 1261. Springer, Singapore. https://doi.org/10.1007/978-981-15-7749-9_12
Download citation
DOI: https://doi.org/10.1007/978-981-15-7749-9_12
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-7748-2
Online ISBN: 978-981-15-7749-9
eBook Packages: Computer ScienceComputer Science (R0)