Abstract
Growing demand for software reliability requires developers to analyze many production logs under time pressure. Unfortunately, some failures cannot be detected during the testing phase in large complex systems because they are specific to deployment, configuration parameters, non-deterministic system behaviour, and real-life user input. This article presents a novel and light approach to failure diagnosis based on natural language processing techniques. The aim is to extract as much information as it is possible from data available in standard logs attached to problem description. The approach uses unit test logs (test suites) to gather knowledge about the system. This knowledge is then used to analyze the production log and determine the test suites and the corresponding code block that most likely describes the runtime scenario. The experiments on Apache Hadoop HDFS and NOKIA systems show that the hints given by the framework are helpful to locate the fault.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Apache Hadoop HDFS architecture. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
Apache Hadoop HDFS hdfs-10453. https://issues.apache.org/jira/browse/HDFS-10453
Beschastnikh, I., Brun, Y., Ernst, M.D., Krishnamurthy, A.: Inferring models of concurrent systems from logs of their behavior with CSight. In: Proceedings of the 36th International Conference on Software Engineering, pp. 468–479 (2014),
Beschastnikh, I., Liu, P., Xing, A., Wang, P., Brun, Y., Ernst, M.D.: Visualizing distributed system executions. ACM Trans. Softw. Eng. Methodol. (TOSEM) 29(2), 1–38 (2020)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Chen, A.R.: An empirical study on leveraging logs for debugging production failures. In: 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), pp. 126–128. IEEE (2019)
Du, M., Li, F., Zheng, G., Srikumar, V.: DeepLog: anomaly detection and diagnosis from system logs through deep learning. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, pp. 1285–1298. Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3133956.3134015
Kam, H.T., et al.: Random decision forest. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition. vol. 1416, pp. 278–282, Montreal, Canada, August 1995
Lin, Q., Zhang, H., Lou, J.G., Zhang, Y., Chen, X.: Log clustering based problem identification for online service systems. In: 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C), pp. 102–111. IEEE (2016)
Mason, L., Baxter, J., Bartlett, P., Frean, M.: Boosting algorithms as gradient descent. In: Advances in Neural Information Processing Systems, vol. 12 (1999)
Pedregosa, F., et al.: scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Wang, J., et al.: LogEvent2vec: LogEvent-to-vector based anomaly detection for large-scale logs in internet of things. Sensors 20(9), 2451 (2020)
Yuan, D., Mai, H., Xiong, W., Tan, L., Zhou, Y., Pasupathy, S.: SherLog: error diagnosis by connecting clues from run-time logs. In: Proceedings of the fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 143–154 (2010)
Zhang, H.: The optimality of Naive Bayes. Aa 1(2), 3 (2004)
Zhang, X., et al.: Robust log-based anomaly detection on unstable log data. In: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 807–817 (2019)
Zhang, Y., Rodrigues, K., Luo, Y., Stumm, M., Yuan, D.: The inflection point hypothesis: a principled debugging approach for locating the root cause of a failure. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles. pp. 131–146 (2019)
Acknowledgements
The authors appreciate the valuable comments provided by the anonymous reviewers. This work was supported by NOKIA company and financed by the Polish Ministry of Education and Science. Funds were allocated from “Implementation Doctorate” program.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Dobrowolski, W., Nikodem, M., Zawistowski, M., Unold, O. (2022). Improved Software Reliability Through Failure Diagnosis Based on Clues from Test and Production Logs. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds) New Advances in Dependability of Networks and Systems. DepCoS-RELCOMEX 2022. Lecture Notes in Networks and Systems, vol 484. Springer, Cham. https://doi.org/10.1007/978-3-031-06746-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-06746-4_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06745-7
Online ISBN: 978-3-031-06746-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)