Abstract
Software systems can automatically submit crash reports to a repository for investigation when program failures occur. A significant portion of these crash reports are duplicate, i.e., they are caused by the same software issue. Therefore, if the volume of submitted reports is very large, automatic grouping of duplicate crash reports can significantly ease and speed up analysis of software failures. This task is known as crash report deduplication. Given a huge volume of incoming reports, increasing quality of deduplication is an important task. The majority of studies address it via information retrieval or sequence matching methods based on the similarity of stack traces from two crash reports. While information retrieval methods disregard the position of a frame in a stack trace, the existing works based on sequence matching algorithms do not fully consider subroutine global frequency and unmatched frames. Besides, due to data distribution differences among software projects, parameters that are learned using machine learning algorithms are necessary to provide more flexibility to the methods. In this paper, we propose TraceSim – an approach for crash report deduplication which combines TF-IDF, optimum global alignment, and machine learning (ML) in a novel way. Moreover, we propose a new evaluation methodology for this task that is more comprehensive and robust than previously used evaluation approaches. TraceSim significantly outperforms seven baselines and state-of-the-art methods in the majority of the scenarios. It is the only approach that achieves competitive results on all datasets regarding all considered metrics. Moreover, we conduct an extensive ablation study that demonstrates the importance of each TraceSim’s element to its final performance and robustness. Finally, we provide the source code for all considered methods and evaluation methodology as well as the created datasets.
Similar content being viewed by others
Notes
We conducted a preliminary investigation to find the best number of iterations.
These strategies were designed for techniques that consider the frame order. Since these information retrieval techniques are based on the bag-of-words model, such strategies are not effective for them.
References
Ahmed I, Mohan N, Jensen C (2014) The impact of automatic crash reports on bug triaging and development in mozilla. In: Proceedings of The International Symposium on Open Collaboration, Association for Computing Machinery, New York, NY, USA, OpenSym ’14, pp 1–8. https://doi.org/10.1145/2641580.2641585
Banerjee S, Syed Z, Helmick J, Culp M, Ryan K, Cukic B (2017) Automated triaging of very large bug repositories. Information and Software Technology 89:1–13. https://doi.org/10.1016/j.infsof.2016.09.006. http://www.sciencedirect.com/science/article/pii/S0950584916301653
Bartz K, Stokes JW, Platt JC, Kivett R, Grant D, Calinoiu S, Loihle G (2008) Finding similar failures using callstack similarity. In: Proceedings of the Third Conference on Tackling Computer Systems Problems with Machine Learning Techniques, USENIX Association, Berkeley, CA, USA, SysML’08, pp 1–1
Bergstra J, Yamins D, Cox DD (2013a) Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In: Proceedings of the 12th Python in Science Conference, Citeseer, pp 13–20
Bergstra J, Yamins D, Cox DD (2013b) Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In: Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, JMLR.org, ICML’13, p I–115–I–123
Brodie M, Ma S, Lohman G, Mignet L, Modani N, Wilding M, Champlin J, Sohn P (2005) Quickly finding known software problems via automated symptom matching. In: Second International Conference on Autonomic Computing (ICAC’05), pp 101–110. https://doi.org/10.1109/ICAC.2005.49
Campbell JC, Santos EA, Hindle A (2016) The unreasonable effectiveness of traditional information retrieval in crash report deduplication. In: Proceedings of the 13th International Conference on Mining Software Repositories, ACM, New York, NY, USA, MSR ’16, pp 269–280. https://doi.org/10.1145/2901739.2901766
Chierichetti F, Kumar R, Pandey S, Vassilvitskii S (2010) Finding the jaccard median. In: Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms, SIAM, pp 293–311
Dang Y, Wu R, Zhang H, Zhang D, Nobel P (2012) Rebucket: A method for clustering duplicate crash reports based on call stack similarity. In: Proceedings of the 34th International Conference on Software Engineering, IEEE Press, Piscataway, NJ, USA, ICSE ’12, pp 1084–1093. http://dl.acm.org/citation.cfm?id=2337223.2337364
Deza MM, Deza E (2016) Encyclopedia of Distances, 4th edn. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-00234-2_1
Dhaliwal T, Khomh F, Zou Y (2011) Classifying field crash reports for fixing bugs: A case study of mozilla firefox. In: Proceedings of the 2011 27th IEEE International Conference on Software Maintenance, IEEE Computer Society, Washington, DC, USA, ICSM ’11, pp 333–342. https://doi.org/10.1109/ICSM.2011.6080800
Ebrahimi N, Trabelsi A, Islam M S, Hamou-Lhadj A, Khanmohammadi K (2019) An hmm-based approach for automatic detection and classification of duplicate bug reports. Inf Softw Technol 113:98–109
Gehan EA (1965) A generalized wilcoxon test for comparing arbitrarily singly-censored samples. Biometrika 52(1/2):203–223. http://www.jstor.org/stable/2333825
Glerum K, Kinshumann K, Greenberg S, Aul G, Orgovan V, Nichols G, Grant D, Loihle G, Hunt G (2009) Debugging in the (very) large: Ten years of implementation and experience. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles, Association for Computing Machinery, New York, NY, USA, SOSP ’09, p 103–116. https://doi.org/10.1145/1629575.1629586
Kampstra P (2008) Beanplot: A boxplot alternative for visual comparison of distributions. Journal of Statistical Software, Code Snippets 28(1):1–9. https://doi.org/10.18637/jss.v028.c01, https://www.jstatsoft.org/v028/c01
Kim S, Zimmermann T, Nagappan N (2011) Crash graphs: An aggregated view of multiple crashes to improve crash triage. In: 2011 IEEE/IFIP 41St international conference on dependable systems & networks. IEEE, DSN, pp 486–493
Koopaei NE, Hamou-Lhadj A (2015) Crashautomata: An approach for the detection of duplicate crash reports based on generalizable automata. In: Proceedings of the 25th Annual International Conference on Computer Science and Software Engineering, IBM Corp., USA, CASCON ’15, p 201–210
Lerch J, Mezini M (2013) Finding duplicates of your yet unwritten bug report. In: Proceedings of the 2013 17th European Conference on Software Maintenance and Reengineering, IEEE Computer Society, Washington, DC, USA, CSMR ’13, pp 69–78. https://doi.org/10.1109/CSMR.2013.17
Manning CD, Schütze H (1999) Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts. http://nlp.stanford.edu/fsnlp/
Miller FP, Vandome AF, McBrewster J (2009) Levenshtein distance: information theory, computer science, string (computer science), string metric Damerau? Levenshtein distance. Spell Checker, Hamming Distance. Alpha Press
Modani N, Gupta R, Lohman G, Syeda-Mahmood T, Mignet L (2007) Automatically identifying known software problems. In: Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop, IEEE Computer Society, Washington, DC, USA, ICDEW ’07, pp 433–441. https://doi.org/10.1109/ICDEW.2007.4401026
Moroo A, Aizawa A, Hamamoto T (2017) Reranking-based crash report deduplication. In: He X (ed) SEKE ’17. https://doi.org/10.18293/SEKE2017-135, pp 507–510
Needleman S, Wunsch C (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48(3):443–453. https://doi.org/10.1016/0022-2836(70)90057-4
Putatunda S, Rama K (2018) A comparative analysis of hyperopt as against other approaches for hyper-parameter optimization of xgboost. In: Proceedings of the 2018 International Conference on Signal Processing and Machine Learning, Association for Computing Machinery, New York, NY, USA, SPML ’18, p 6–10. https://doi.org/10.1145/3297067.3297080
Rakha MS, Bezemer C, Hassan AE (2018) Revisiting the performance evaluation of automated approaches for the retrieval of duplicate issue reports. IEEE Trans Softw Eng 44(12):1245–1268. https://doi.org/10.1109/TSE.2017.2755005
Sabor KK, Hamou-Lhadj A, Larsson A (2017) DURFEX: A feature extraction technique for efficient detection of duplicate bug reports. In: 2017 IEEE International Conference on Software Quality, Reliability and Security, QRS 2017, Prague, Czech Republic, July 25-29, 2017, IEEE, pp 240–250. https://doi.org/10.1109/QRS.2017.35
Schroter A, Schröter A, Bettenburg N, Premraj R (2010) Do stack traces help developers fix bugs?. In: 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010), IEEE, pp 118–121
Sellers PH (1974) On the theory and computation of evolutionary distances. SIAM J Appl Math 26(4):787–793
Sun C, Lo D, Khoo SC, Jiang J (2011) Towards more accurate retrieval of duplicate bug reports. In: Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering, IEEE Computer Society, Washington, DC, USA, ASE ’11, pp 253–262. https://doi.org/10.1109/ASE.2011.6100061
Waskom M (2020) mwaskom/seaborn. https://doi.org/10.5281/zenodo.592845
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Foutse Khomh
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE)
We would like to gratefully acknowledge the Natural Sciences and Engineering Research Council of Canada (NSERC), Ericsson, Ciena, and EffciOS for funding this project. Moreover, this research was enabled in part by the support provided by WestGrid (https://www.westgrid.ca/) and Compute Canada (www.computecanada.ca).
Appendix A: Additional Ablation Study Results
Appendix A: Additional Ablation Study Results
In this appendix, we expand the ablation study in which Global Weight, Local Weight, the diff(⋅) Function, and normalization are removed. We depict ΔAUC, ΔMAP, and ΔRR@1 between the original TraceSim and each possible configuration that has not more than two components enabled in Figs. 14, 15, 16, 17, 18, 19, 20 and 21.
The following configurations are not reported:
-
1.
TraceSim without Global Weight and Local Weight. In this case, frame weights are always equal to 1. Since the normalization was designed based on variable frame weights, the normalization loses its effectiveness.
-
2.
TraceSim without Global Weight, Local Weight, and the diff(⋅) Function. Similarly to the previous configuration, the normalization is not effective because the frame weights are constants.
-
3.
TraceSim without Global Weight, Local Weight, normalization and the diff(⋅) Function. This configuration is equivalent to NW algorithm in which the match, mismatch and gap values are set to 1.0, 2.0, and 1.0, respectively.
Rights and permissions
About this article
Cite this article
Rodrigues, I.M., Khvorov, A., Aloise, D. et al. TraceSim: An Alignment Method for Computing Stack Trace Similarity. Empir Software Eng 27, 53 (2022). https://doi.org/10.1007/s10664-021-10070-w
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-021-10070-w