Abstract
Log-based anomaly detection identifies systems’ anomalous behaviors by analyzing system runtime information recorded in logs. While many approaches have been proposed, all of them have in common an essential pre-processing step called log parsing. This step is needed because automated log analysis requires structured input logs, whereas original logs contain semi-structured text printed by logging statements. Log parsing bridges this gap by converting the original logs into structured input logs fit for anomaly detection.
Despite the intrinsic dependency between log parsing and anomaly detection, no existing work has investigated the impact of the “quality” of log parsing results on anomaly detection. In particular, the concept of “ideal” log parsing results with respect to anomaly detection has not been formalized yet. This makes it difficult to determine, upon obtaining inaccurate results from anomaly detection, if (and why) the root cause for such results lies in the log parsing step.
In this short paper, we lay the theoretical foundations for defining the concept of “ideal” log parsing results for anomaly detection. Based on these foundations, we discuss practical implications regarding the identification and localization of root causes, when dealing with inaccurate anomaly detection, and the identification of irrelevant log messages.
This work has received funding from the Celtic-Next project CRITISEC and NSERC of Canada under the Discovery and CRC programs. Donghwan Shin was partially supported by the Basic Science Research Programme through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (2019R1A6A3A03033444).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
In general, logs may contain extra information, such as timestamps and logging levels (e.g., info, debug) for individual log messages. However, we omit such information since log parsing deals with log messages characterizing the states or events of the system.
- 2.
This is because distinct \(\tau (m)\) for each message m that appear in L can lead to one or more dimensions. In ML, dimensionality reduction is an essential topic to improve predictive power [1].
- 3.
Though the length of logs can be reduced in a pre-processing step by omitting certain messages or events based on domain knowledge, this is independent from log parsing, which just abstracts messages.
References
Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. The MIT Press (2012), https://dl.acm.org/doi/book/10.5555/3360093
Aigner, M.: A characterization of the bell numbers. Discret. Math. 205(1), 207–210 (1999). https://doi.org/10.1016/S0012-365X(99)00108-9
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 1-58 (2009). https://doi.org/10.1145/1541880.1541882
Dai, H., Li, H., Chen, C.S., Shang, W., Chen, T.: Logram: efficient log parsing using n-gram dictionaries. IEEE Trans. Softw. Eng. 1 (2020). https://doi.org/10.1109/TSE.2020.3007554
Du, M., Li, F.: Spell: streaming parsing of system event logs. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 859–864. IEEE, Barcelona, Spain (2016)
El-Masri, D., Petrillo, F., Guéhéneuc, Y.G., Hamou-Lhadj, A., Bouziane, A.: A systematic literature review on automated log abstraction techniques. Inf. Softw. Technol. 122, 106276 (2020). https://doi.org/10.1016/j.infsof.2020.106276
Hamooni, H., Debnath, B., Xu, J., Zhang, H., Jiang, G., Mueen, A.: Logmine: fast pattern recognition for log analytics. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 1573–1582. Association for Computing Machinery, Indianapolis, IN, USA (2016)
He, P., Zhu, J., Zheng, Z., Lyu, M.R.: Drain: an online log parsing approach with fixed depth tree. In: 2017 IEEE International Conference on Web Services (ICWS), pp. 33–40. IEEE, Honolulu, HI, USA (2017)
He, S., He, P., Chen, Z., Yang, T., Su, Y., Lyu, M.R.: A survey on automated log analysis for reliability engineering. CoRR abs/2009.07237 (2020). https://arxiv.org/abs/2009.07237
He, S., Zhu, J., He, P., Lyu, M.R.: Loghub: a large collection of system log datasets towards automated log analytics (2020)
Jiang, Z.M., Hassan, A.E., Flora, P., Hamann, G.: Abstracting execution logs to execution events for enterprise applications. In: 2008 The Eighth International Conference on Quality Software, pp. 181–186. IEEE, Oxford, UK (2008)
Liu, Z., Xia, X., Lo, D., Xing, Z., Hassan, A.E., Li, S.: Which variables should i log? IEEE Trans. Softw. Eng. 47(9), 2012–2031 (2019). https://doi.org/10.1109/TSE.2019.2941943
Luke, S.: Essentials of Metaheuristics. Lulu, second edn. (2013), available for free at http://cs.gmu.edu/~sean/book/metaheuristics/
Makanju, A.A., Zincir-Heywood, A.N., Milios, E.E.: Clustering event logs using iterative partitioning. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1255–1264. Association for Computing Machinery, New York, NY, USA (2009)
Messaoudi, S., Panichella, A., Bianculli, D., Briand, L., Sasnauskas, R.: A search-based approach for accurate identification of log message formats. In: 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC), pp. 167–16710. ACM, Association for Computing Machinery, Gothenburg, Sweden (2018)
Mizutani, M.: Incremental mining of system log format. In: 2013 IEEE International Conference on Services Computing, pp. 595–602. IEEE, Santa Clara, CA, USA (2013)
Nagappan, M., Vouk, M.A.: Abstracting log lines to log event types for mining software system logs. In: 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010), pp. 114–117. IEEE, IEEE, Cape Town, South Africa (2010)
Shima, K.: Length matters: clustering system log messages using length of words (2016)
Tang, L., Li, T., Perng, C.S.: Logsig: Generating system events from raw textual logs. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 785–794. ACM, New York, NY, USA (2011)
Vaarandi, R., Pihelgas, M.: Logcluster - a data clustering and pattern mining algorithm for event logs. In: 2015 11th International Conference on Network and Service Management (CNSM), pp. 1–7. IEEE, Barcelona, Spain (2015). https://doi.org/10.1109/CNSM.2015.7367331
Vaarandi, R.: A data clustering algorithm for mining patterns from event logs. In: Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM 2003)(IEEE Cat. No. 03EX764), pp. 119–126. IEEE, Kansas City, MO, USA (2003)
Yuan, D., Park, S., Huang, P., Liu, Y., Lee, M.M., Tang, X., Zhou, Y., Savage, S.: Be conservative: enhancing failure diagnosis with proactive logging. In: 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), pp. 293–306. USENIX Association, Hollywood, CA (October 2012). https://www.usenix.org/conference/osdi12/technical-sessions/presentation/yuan
Yuan, D., Zheng, J., Park, S., Zhou, Y., Savage, S.: Improving software diagnosability via log enhancement. ACM Trans. Comput. Syst. 30(1), 1-28 (2012). https://doi.org/10.1145/2110356.2110360
Zhao, X., Rodrigues, K., Luo, Y., Stumm, M., Yuan, D., Zhou, Y.: Log20: fully automated optimal placement of log printing statements under specified overhead threshold. In: Proceedings of the 26th Symposium on Operating Systems Principles, pp. 565–581. SOSP 2017, Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3132747.3132778
Zhu, J., He, S., Liu, J., He, P., Xie, Q., Zheng, Z., Lyu, M.R.: Tools and benchmarks for automated log parsing. In: 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 121–130. IEEE, Madrid, Spain (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Shin, D., Khan, Z.A., Bianculli, D., Briand, L. (2021). A Theoretical Framework for Understanding the Relationship Between Log Parsing and Anomaly Detection. In: Feng, L., Fisman, D. (eds) Runtime Verification. RV 2021. Lecture Notes in Computer Science(), vol 12974. Springer, Cham. https://doi.org/10.1007/978-3-030-88494-9_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-88494-9_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-88493-2
Online ISBN: 978-3-030-88494-9
eBook Packages: Computer ScienceComputer Science (R0)