Skip to main content

Improved Software Reliability Through Failure Diagnosis Based on Clues from Test and Production Logs

  • Conference paper
  • First Online:
New Advances in Dependability of Networks and Systems (DepCoS-RELCOMEX 2022)

Abstract

Growing demand for software reliability requires developers to analyze many production logs under time pressure. Unfortunately, some failures cannot be detected during the testing phase in large complex systems because they are specific to deployment, configuration parameters, non-deterministic system behaviour, and real-life user input. This article presents a novel and light approach to failure diagnosis based on natural language processing techniques. The aim is to extract as much information as it is possible from data available in standard logs attached to problem description. The approach uses unit test logs (test suites) to gather knowledge about the system. This knowledge is then used to analyze the production log and determine the test suites and the corresponding code block that most likely describes the runtime scenario. The experiments on Apache Hadoop HDFS and NOKIA systems show that the hints given by the framework are helpful to locate the fault.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/dobrowol/defects4all.

References

  1. Apache Hadoop HDFS architecture. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

  2. Apache Hadoop HDFS hdfs-10453. https://issues.apache.org/jira/browse/HDFS-10453

  3. Beschastnikh, I., Brun, Y., Ernst, M.D., Krishnamurthy, A.: Inferring models of concurrent systems from logs of their behavior with CSight. In: Proceedings of the 36th International Conference on Software Engineering, pp. 468–479 (2014),

    Google Scholar 

  4. Beschastnikh, I., Liu, P., Xing, A., Wang, P., Brun, Y., Ernst, M.D.: Visualizing distributed system executions. ACM Trans. Softw. Eng. Methodol. (TOSEM) 29(2), 1–38 (2020)

    Article  Google Scholar 

  5. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  6. Chen, A.R.: An empirical study on leveraging logs for debugging production failures. In: 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), pp. 126–128. IEEE (2019)

    Google Scholar 

  7. Du, M., Li, F., Zheng, G., Srikumar, V.: DeepLog: anomaly detection and diagnosis from system logs through deep learning. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, pp. 1285–1298. Association for Computing Machinery, New York (2017). https://doi.org/10.1145/3133956.3134015

  8. Kam, H.T., et al.: Random decision forest. In: Proceedings of the 3rd International Conference on Document Analysis and Recognition. vol. 1416, pp. 278–282, Montreal, Canada, August 1995

    Google Scholar 

  9. Lin, Q., Zhang, H., Lou, J.G., Zhang, Y., Chen, X.: Log clustering based problem identification for online service systems. In: 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C), pp. 102–111. IEEE (2016)

    Google Scholar 

  10. Mason, L., Baxter, J., Bartlett, P., Frean, M.: Boosting algorithms as gradient descent. In: Advances in Neural Information Processing Systems, vol. 12 (1999)

    Google Scholar 

  11. Pedregosa, F., et al.: scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  12. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)

    Article  Google Scholar 

  13. Wang, J., et al.: LogEvent2vec: LogEvent-to-vector based anomaly detection for large-scale logs in internet of things. Sensors 20(9), 2451 (2020)

    Article  Google Scholar 

  14. Yuan, D., Mai, H., Xiong, W., Tan, L., Zhou, Y., Pasupathy, S.: SherLog: error diagnosis by connecting clues from run-time logs. In: Proceedings of the fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 143–154 (2010)

    Google Scholar 

  15. Zhang, H.: The optimality of Naive Bayes. Aa 1(2), 3 (2004)

    Google Scholar 

  16. Zhang, X., et al.: Robust log-based anomaly detection on unstable log data. In: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 807–817 (2019)

    Google Scholar 

  17. Zhang, Y., Rodrigues, K., Luo, Y., Stumm, M., Yuan, D.: The inflection point hypothesis: a principled debugging approach for locating the root cause of a failure. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles. pp. 131–146 (2019)

    Google Scholar 

Download references

Acknowledgements

The authors appreciate the valuable comments provided by the anonymous reviewers. This work was supported by NOKIA company and financed by the Polish Ministry of Education and Science. Funds were allocated from “Implementation Doctorate” program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wojciech Dobrowolski .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Dobrowolski, W., Nikodem, M., Zawistowski, M., Unold, O. (2022). Improved Software Reliability Through Failure Diagnosis Based on Clues from Test and Production Logs. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds) New Advances in Dependability of Networks and Systems. DepCoS-RELCOMEX 2022. Lecture Notes in Networks and Systems, vol 484. Springer, Cham. https://doi.org/10.1007/978-3-031-06746-4_5

Download citation

Publish with us

Policies and ethics