Security bug reports classification using fasttext

Alqahtani, Sultan S.

doi:10.1007/s10207-023-00793-w

Security bug reports classification using fasttext

Regular Contribution
Published: 22 December 2023

Volume 23, pages 1347–1358, (2024)
Cite this article

International Journal of Information Security Aims and scope Submit manuscript

Sultan S. Alqahtani¹

75 Accesses
2 Altmetric
Explore all metrics

Abstract

Software developers and maintainers must address security bug reports (SBRs) before they are publicly disclosed, and their system is left vulnerable to attack. Bug tracking systems may contain securities-related reports which are unlabeled as SBRs, which makes it hard for developers to identify them. Therefore, finding unlabeled SBRs is an essential to help security expert developers identify these security issues fast and accurately. The goal of this paper is to aid software developers to better classify bug reports that identify security vulnerabilities as security bug reports through fasttext classifier. Previous work has applied text analytics and machine learning learners to classify which bug reports are security related. We improve on that work, as shown by our analysis of five open-source projects. We first collected a dataset of 45,940 bug reports from five software repositories (e.g., the work of Peters et al. and Shu et al.). Second, we conducted an experiment throughout the classification of SBRs using machine learning technique; particularly, we built fasttext classifiers. Finally, we investigated the accuracy of our built fasttext classifiers in identifying SBRs. Our experiment results show that our fasttext classifier can achieve an average F1 score of 0.81 when used to identify SBRs. Furthermore, we examined the generalizability of identifying SBRs by applying cross-project validation, and our results showed that the fasttext classifier is able to achieve an average F1 score values of 0.65. Finally, we made our data and results available at Alqahtani (fasttext implementation, 2023. https://github.com/isultane/fasttext_classifications) to help the replication of our work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 4

Data collection and quality challenges in deep learning: a data-centric AI perspective

Article 03 January 2023

How different are different diff algorithms in Git?

Article Open access 11 September 2019

Detection of cross-site scripting (XSS) attacks using machine learning techniques: a review

Article 23 March 2023

Data availability

The experimental data and the simulation results that support the findings of this study are available in GitHub with the URL https://github.com/isultane/fasttext_classifications/tree/master/data. The data that support the findings of this study are available from the corresponding author upon reasonable request.

Notes

References

Floris, P., Vogt Harald, H.: How to save on software maintenance costs, omnext white pape, vol. SOURCE 2 V (2010)
Rui, S., Tianpei, X., Laurie, W., Tim, M.: Better security bug report classification via hyperparameter optimization (2019). https://arxiv.org/pdf/1905.06872.pdf
Chawla, I., Singh, S.K.: Automatic bug labeling using semantic information from LSI. In: 2014 Seventh International Conference on Contemporary Computing (IC3), pp. 376–381 (2014). https://doi.org/10.1109/IC3.2014.6897203.
Bozorgi, M., Saul, L.K., Savage, S., Voelker, G.M.: Beyond heuristics: learning to classify vulnerabilities and predict exploits. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD ’10, p. 105 (2010). https://doi.org/10.1145/1835804.1835821
Peters, F., Tun, T.T., Yu, Y., Nuseibeh, B.: Text filtering and ranking for security bug report prediction. IEEE Trans. Softw. Eng. 45(6), 615–631 (2019). https://doi.org/10.1109/TSE.2017.2787653
Article Google Scholar
Wijayasekara, D., Manic, M., Wright, J.L., McQueen, M.: Mining bug databases for unidentified software vulnerabilities. In: 2012 5th International Conference on Human System Interactions, pp. 89–96 (2012). https://doi.org/10.1109/HSI.2012.22
Wu, X., Zheng, W., Xia, X., Lo, D.: Data quality matters: a case study on data label correctness for security bug report prediction. IEEE Trans. Softw. Eng. 48(7), 2541–2556 (2022). https://doi.org/10.1109/TSE.2021.3063727
Article Google Scholar
Fu, W., Menzies, T.: Easy over hard: a case study on deep learning. In: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pp. 49–60 (2017). https://doi.org/10.1145/3106237.3106256
Liu, Z., Xia, X., Hassan, A.E., Lo, D., Xing, Z., Wang, X.: Neural-machine-translation-based commit message generation: how far are we? In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pp. 373–384 (2018). https://doi.org/10.1145/3238147.3238190
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431 (2017). https://aclanthology.org/E17-2068
Ohira, M., et al.: A dataset of high impact bugs: manually-classified issue reports. In: 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp. 518–521 (2015). https://doi.org/10.1109/MSR.2015.78
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). https://doi.org/10.1613/jair.953
Article Google Scholar
Abir, R., Moulay, A.A. Malbert: using transformers for cybersecurity and malicious software detection (2021). https://arxiv.org/pdf/2103.03806.pdf
Roopak, M., Yun Tian, G., Chambers, J.: Deep learning models for cyber security in IoT networks. In: 2019 IEEE 9th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0452–0457 (2019). https://doi.org/10.1109/CCWC.2019.8666588
Yin, J., Tang, M., Cao, J., Wang, H.: Apply transfer learning to cybersecurity: predicting exploitability of vulnerabilities by description. Knowl. Based Syst. 210, 106529 (2020). https://doi.org/10.1016/j.knosys.2020.106529
Article Google Scholar
Johnson, R., Zhang, T.: Semi-supervised convolutional neural networks for text categorization via region embedding. In: Advances in Neural Information Processing Systems, vol. 28 (2015). https://proceedings.neurips.cc/paper/2015/file/acc3e0404646c57502b480dc052c4fe1-Paper.pdf
Liu, J., Chang, W.-C., Wu, Y., Yang, Y.: Deep learning for extreme multi-label text classification. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 115–124 (2017). https://doi.org/10.1145/3077136.3080834
Alqahtani, S.S.: fasttext implementation (2023). https://github.com/isultane/fasttext_classifications. Accessed 20 June 2023
Song, Q., Guo, Y., Shepperd, M.: A comprehensive investigation of the role of imbalanced learning for software defect prediction. IEEE Trans. Softw. Eng. 45(12), 1253–1269 (2019). https://doi.org/10.1109/TSE.2018.2836442
Article Google Scholar
Kallis, R., Di Sorbo, A., Canfora, G., Panichella, S.: Ticket tagger: machine learning driven issue classification. In: 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 406–409 (2019). https://doi.org/10.1109/ICSME.2019.00070
Mileva, Y.M., Dallmeier, V., Burger, M., Zeller, A.: Mining trends of library usage. In: Proceedings of the Joint International and Annual ERCIM Workshops on Principles of Software Evolution (IWPSE) and Software Evolution (Evol) Workshops, pp. 57–62 (2009). https://doi.org/10.1145/1595808.1595821
Gegick, M., Rotella, P., Xie, T.: Identifying security bug reports via text mining: an industrial case study. In: 7th IEEE Working Conference on Mining Software Repositories, pp. 11–20 (2010). https://doi.org/10.1109/MSR.2010.5463340
Scandariato, R., Walden, J., Hovsepyan, A., Joosen, W.: Predicting vulnerable software components via text mining. IEEE Trans. Softw. Eng. 40(10), 993–1006 (2014). https://doi.org/10.1109/TSE.2014.2340398
Article Google Scholar
Yang, Y., Xia, X., Lo, D., Bi, T., Grundy, J., Yang, X.: Predictive models in software engineering: challenges and opportunities. ACM Trans. Softw. Eng. Methodol. 31(3), 1–72 (2022). https://doi.org/10.1145/3503509
Article Google Scholar
Sawadogo, A.D., Guimard, T.F., Bissyandé, Q., Kader Kaboré, J., Klein, A., Moha, N.: Early Detection of Security-Relevant Bug Reports using Machine Learning: How Far Are We? eprint arXiv:2112.10123 (2021). https://ui.adsabs.harvard.edu/abs/2021arXiv211210123S/abstract
Berrar, D.: Cross-validation. In: Encyclopedia of Bioinformatics and Computational Biology, pp. 542–545. Elsevier (2019)
Zhang, Z.: Introduction to machine learning: k-nearest neighbors. Ann. Transl. Med. 4(11), 218–218 (2016). https://doi.org/10.21037/atm.2016.03.37
Article Google Scholar
Alipour, A., Hindle, A., Stroulia, E.: A contextual approach towards more accurate duplicate bug report detection. In: 2013 10th Working Conference on Mining Software Repositories (MSR), pp. 183–192 (2013). https://doi.org/10.1109/MSR.2013.6624026
Sharma, M., Bedi, P., Chaturvedi, K.K., Singh, V.B.: Predicting the priority of a reported bug using machine learning techniques and cross project validation. In: 2012 12th International Conference on Intelligent Systems Design and Applications (ISDA), pp. 539–545 (2012). https://doi.org/10.1109/ISDA.2012.6416595
Peng, H., Bing, L., Yutao, M.: Towards cross-project defect prediction with imbalanced feature sets, p. 10 (2014). https://doi.org/10.48550/arXiv.1411.4228

Download references

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

Authors and Affiliations

Computer and Information Sciences College, Al-Imam Mohammad Ibn Saud Islamic University, Riyadh, Saudi Arabia
Sultan S. Alqahtani

Authors

Sultan S. Alqahtani
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SSA contributed to the study conception and design, including the material preparation, data collection and analysis. Also, SSA prepared the first draft of the manuscript and proof-reader commented on previous versions of the manuscript. The author read and approved the final manuscript.

Corresponding author

Correspondence to Sultan S. Alqahtani.

Ethics declarations

Conflict of interest

The author Sultan S. Alqahtani, declares that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Alqahtani, S.S. Security bug reports classification using fasttext. Int. J. Inf. Secur. 23, 1347–1358 (2024). https://doi.org/10.1007/s10207-023-00793-w

Download citation

Accepted: 21 November 2023
Published: 22 December 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s10207-023-00793-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Security bug reports classification using fasttext

Abstract

Access this article

Similar content being viewed by others

Data collection and quality challenges in deep learning: a data-centric AI perspective

How different are different diff algorithms in Git?

Detection of cross-site scripting (XSS) attacks using machine learning techniques: a review

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Security bug reports classification using fasttext

Abstract

Access this article

Similar content being viewed by others

Data collection and quality challenges in deep learning: a data-centric AI perspective

How different are different diff algorithms in Git?

Detection of cross-site scripting (XSS) attacks using machine learning techniques: a review

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation