Abstract
Software developers and maintainers must address security bug reports (SBRs) before they are publicly disclosed, and their system is left vulnerable to attack. Bug tracking systems may contain securities-related reports which are unlabeled as SBRs, which makes it hard for developers to identify them. Therefore, finding unlabeled SBRs is an essential to help security expert developers identify these security issues fast and accurately. The goal of this paper is to aid software developers to better classify bug reports that identify security vulnerabilities as security bug reports through fasttext classifier. Previous work has applied text analytics and machine learning learners to classify which bug reports are security related. We improve on that work, as shown by our analysis of five open-source projects. We first collected a dataset of 45,940 bug reports from five software repositories (e.g., the work of Peters et al. and Shu et al.). Second, we conducted an experiment throughout the classification of SBRs using machine learning technique; particularly, we built fasttext classifiers. Finally, we investigated the accuracy of our built fasttext classifiers in identifying SBRs. Our experiment results show that our fasttext classifier can achieve an average F1 score of 0.81 when used to identify SBRs. Furthermore, we examined the generalizability of identifying SBRs by applying cross-project validation, and our results showed that the fasttext classifier is able to achieve an average F1 score values of 0.65. Finally, we made our data and results available at Alqahtani (fasttext implementation, 2023. https://github.com/isultane/fasttext_classifications) to help the replication of our work.
Similar content being viewed by others
Data availability
The experimental data and the simulation results that support the findings of this study are available in GitHub with the URL https://github.com/isultane/fasttext_classifications/tree/master/data. The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Floris, P., Vogt Harald, H.: How to save on software maintenance costs, omnext white pape, vol. SOURCE 2 V (2010)
Rui, S., Tianpei, X., Laurie, W., Tim, M.: Better security bug report classification via hyperparameter optimization (2019). https://arxiv.org/pdf/1905.06872.pdf
Chawla, I., Singh, S.K.: Automatic bug labeling using semantic information from LSI. In: 2014 Seventh International Conference on Contemporary Computing (IC3), pp. 376–381 (2014). https://doi.org/10.1109/IC3.2014.6897203.
Bozorgi, M., Saul, L.K., Savage, S., Voelker, G.M.: Beyond heuristics: learning to classify vulnerabilities and predict exploits. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD ’10, p. 105 (2010). https://doi.org/10.1145/1835804.1835821
Peters, F., Tun, T.T., Yu, Y., Nuseibeh, B.: Text filtering and ranking for security bug report prediction. IEEE Trans. Softw. Eng. 45(6), 615–631 (2019). https://doi.org/10.1109/TSE.2017.2787653
Wijayasekara, D., Manic, M., Wright, J.L., McQueen, M.: Mining bug databases for unidentified software vulnerabilities. In: 2012 5th International Conference on Human System Interactions, pp. 89–96 (2012). https://doi.org/10.1109/HSI.2012.22
Wu, X., Zheng, W., Xia, X., Lo, D.: Data quality matters: a case study on data label correctness for security bug report prediction. IEEE Trans. Softw. Eng. 48(7), 2541–2556 (2022). https://doi.org/10.1109/TSE.2021.3063727
Fu, W., Menzies, T.: Easy over hard: a case study on deep learning. In: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pp. 49–60 (2017). https://doi.org/10.1145/3106237.3106256
Liu, Z., Xia, X., Hassan, A.E., Lo, D., Xing, Z., Wang, X.: Neural-machine-translation-based commit message generation: how far are we? In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pp. 373–384 (2018). https://doi.org/10.1145/3238147.3238190
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431 (2017). https://aclanthology.org/E17-2068
Ohira, M., et al.: A dataset of high impact bugs: manually-classified issue reports. In: 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp. 518–521 (2015). https://doi.org/10.1109/MSR.2015.78
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). https://doi.org/10.1613/jair.953
Abir, R., Moulay, A.A. Malbert: using transformers for cybersecurity and malicious software detection (2021). https://arxiv.org/pdf/2103.03806.pdf
Roopak, M., Yun Tian, G., Chambers, J.: Deep learning models for cyber security in IoT networks. In: 2019 IEEE 9th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0452–0457 (2019). https://doi.org/10.1109/CCWC.2019.8666588
Yin, J., Tang, M., Cao, J., Wang, H.: Apply transfer learning to cybersecurity: predicting exploitability of vulnerabilities by description. Knowl. Based Syst. 210, 106529 (2020). https://doi.org/10.1016/j.knosys.2020.106529
Johnson, R., Zhang, T.: Semi-supervised convolutional neural networks for text categorization via region embedding. In: Advances in Neural Information Processing Systems, vol. 28 (2015). https://proceedings.neurips.cc/paper/2015/file/acc3e0404646c57502b480dc052c4fe1-Paper.pdf
Liu, J., Chang, W.-C., Wu, Y., Yang, Y.: Deep learning for extreme multi-label text classification. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 115–124 (2017). https://doi.org/10.1145/3077136.3080834
Alqahtani, S.S.: fasttext implementation (2023). https://github.com/isultane/fasttext_classifications. Accessed 20 June 2023
Song, Q., Guo, Y., Shepperd, M.: A comprehensive investigation of the role of imbalanced learning for software defect prediction. IEEE Trans. Softw. Eng. 45(12), 1253–1269 (2019). https://doi.org/10.1109/TSE.2018.2836442
Kallis, R., Di Sorbo, A., Canfora, G., Panichella, S.: Ticket tagger: machine learning driven issue classification. In: 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 406–409 (2019). https://doi.org/10.1109/ICSME.2019.00070
Mileva, Y.M., Dallmeier, V., Burger, M., Zeller, A.: Mining trends of library usage. In: Proceedings of the Joint International and Annual ERCIM Workshops on Principles of Software Evolution (IWPSE) and Software Evolution (Evol) Workshops, pp. 57–62 (2009). https://doi.org/10.1145/1595808.1595821
Gegick, M., Rotella, P., Xie, T.: Identifying security bug reports via text mining: an industrial case study. In: 7th IEEE Working Conference on Mining Software Repositories, pp. 11–20 (2010). https://doi.org/10.1109/MSR.2010.5463340
Scandariato, R., Walden, J., Hovsepyan, A., Joosen, W.: Predicting vulnerable software components via text mining. IEEE Trans. Softw. Eng. 40(10), 993–1006 (2014). https://doi.org/10.1109/TSE.2014.2340398
Yang, Y., Xia, X., Lo, D., Bi, T., Grundy, J., Yang, X.: Predictive models in software engineering: challenges and opportunities. ACM Trans. Softw. Eng. Methodol. 31(3), 1–72 (2022). https://doi.org/10.1145/3503509
Sawadogo, A.D., Guimard, T.F., Bissyandé, Q., Kader Kaboré, J., Klein, A., Moha, N.: Early Detection of Security-Relevant Bug Reports using Machine Learning: How Far Are We? eprint arXiv:2112.10123 (2021). https://ui.adsabs.harvard.edu/abs/2021arXiv211210123S/abstract
Berrar, D.: Cross-validation. In: Encyclopedia of Bioinformatics and Computational Biology, pp. 542–545. Elsevier (2019)
Zhang, Z.: Introduction to machine learning: k-nearest neighbors. Ann. Transl. Med. 4(11), 218–218 (2016). https://doi.org/10.21037/atm.2016.03.37
Alipour, A., Hindle, A., Stroulia, E.: A contextual approach towards more accurate duplicate bug report detection. In: 2013 10th Working Conference on Mining Software Repositories (MSR), pp. 183–192 (2013). https://doi.org/10.1109/MSR.2013.6624026
Sharma, M., Bedi, P., Chaturvedi, K.K., Singh, V.B.: Predicting the priority of a reported bug using machine learning techniques and cross project validation. In: 2012 12th International Conference on Intelligent Systems Design and Applications (ISDA), pp. 539–545 (2012). https://doi.org/10.1109/ISDA.2012.6416595
Peng, H., Bing, L., Yutao, M.: Towards cross-project defect prediction with imbalanced feature sets, p. 10 (2014). https://doi.org/10.48550/arXiv.1411.4228
Funding
No funding was received to assist with the preparation of this manuscript.
Author information
Authors and Affiliations
Contributions
SSA contributed to the study conception and design, including the material preparation, data collection and analysis. Also, SSA prepared the first draft of the manuscript and proof-reader commented on previous versions of the manuscript. The author read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The author Sultan S. Alqahtani, declares that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Alqahtani, S.S. Security bug reports classification using fasttext. Int. J. Inf. Secur. 23, 1347–1358 (2024). https://doi.org/10.1007/s10207-023-00793-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10207-023-00793-w