Skip to main content
Log in

Security bug reports classification using fasttext

  • Regular Contribution
  • Published:
International Journal of Information Security Aims and scope Submit manuscript

Abstract

Software developers and maintainers must address security bug reports (SBRs) before they are publicly disclosed, and their system is left vulnerable to attack. Bug tracking systems may contain securities-related reports which are unlabeled as SBRs, which makes it hard for developers to identify them. Therefore, finding unlabeled SBRs is an essential to help security expert developers identify these security issues fast and accurately. The goal of this paper is to aid software developers to better classify bug reports that identify security vulnerabilities as security bug reports through fasttext classifier. Previous work has applied text analytics and machine learning learners to classify which bug reports are security related. We improve on that work, as shown by our analysis of five open-source projects. We first collected a dataset of 45,940 bug reports from five software repositories (e.g., the work of Peters et al. and Shu et al.). Second, we conducted an experiment throughout the classification of SBRs using machine learning technique; particularly, we built fasttext classifiers. Finally, we investigated the accuracy of our built fasttext classifiers in identifying SBRs. Our experiment results show that our fasttext classifier can achieve an average F1 score of 0.81 when used to identify SBRs. Furthermore, we examined the generalizability of identifying SBRs by applying cross-project validation, and our results showed that the fasttext classifier is able to achieve an average F1 score values of 0.65. Finally, we made our data and results available at Alqahtani (fasttext implementation, 2023. https://github.com/isultane/fasttext_classifications) to help the replication of our work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

The experimental data and the simulation results that support the findings of this study are available in GitHub with the URL https://github.com/isultane/fasttext_classifications/tree/master/data. The data that support the findings of this study are available from the corresponding author upon reasonable request.

Notes

  1. https://cwe.mitre.org/.

  2. https://docs.python.org/3/c-api/utilities.html.

  3. https://cwe.mitre.org/.

  4. https://fasttext.cc/docs/en/options.html.

References

  1. Floris, P., Vogt Harald, H.: How to save on software maintenance costs, omnext white pape, vol. SOURCE 2 V (2010)

  2. Rui, S., Tianpei, X., Laurie, W., Tim, M.: Better security bug report classification via hyperparameter optimization (2019). https://arxiv.org/pdf/1905.06872.pdf

  3. Chawla, I., Singh, S.K.: Automatic bug labeling using semantic information from LSI. In: 2014 Seventh International Conference on Contemporary Computing (IC3), pp. 376–381 (2014). https://doi.org/10.1109/IC3.2014.6897203.

  4. Bozorgi, M., Saul, L.K., Savage, S., Voelker, G.M.: Beyond heuristics: learning to classify vulnerabilities and predict exploits. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD ’10, p. 105 (2010). https://doi.org/10.1145/1835804.1835821

  5. Peters, F., Tun, T.T., Yu, Y., Nuseibeh, B.: Text filtering and ranking for security bug report prediction. IEEE Trans. Softw. Eng. 45(6), 615–631 (2019). https://doi.org/10.1109/TSE.2017.2787653

    Article  Google Scholar 

  6. Wijayasekara, D., Manic, M., Wright, J.L., McQueen, M.: Mining bug databases for unidentified software vulnerabilities. In: 2012 5th International Conference on Human System Interactions, pp. 89–96 (2012). https://doi.org/10.1109/HSI.2012.22

  7. Wu, X., Zheng, W., Xia, X., Lo, D.: Data quality matters: a case study on data label correctness for security bug report prediction. IEEE Trans. Softw. Eng. 48(7), 2541–2556 (2022). https://doi.org/10.1109/TSE.2021.3063727

    Article  Google Scholar 

  8. Fu, W., Menzies, T.: Easy over hard: a case study on deep learning. In: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pp. 49–60 (2017). https://doi.org/10.1145/3106237.3106256

  9. Liu, Z., Xia, X., Hassan, A.E., Lo, D., Xing, Z., Wang, X.: Neural-machine-translation-based commit message generation: how far are we? In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pp. 373–384 (2018). https://doi.org/10.1145/3238147.3238190

  10. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431 (2017). https://aclanthology.org/E17-2068

  11. Ohira, M., et al.: A dataset of high impact bugs: manually-classified issue reports. In: 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp. 518–521 (2015). https://doi.org/10.1109/MSR.2015.78

  12. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). https://doi.org/10.1613/jair.953

    Article  Google Scholar 

  13. Abir, R., Moulay, A.A. Malbert: using transformers for cybersecurity and malicious software detection (2021). https://arxiv.org/pdf/2103.03806.pdf

  14. Roopak, M., Yun Tian, G., Chambers, J.: Deep learning models for cyber security in IoT networks. In: 2019 IEEE 9th Annual Computing and Communication Workshop and Conference (CCWC), pp. 0452–0457 (2019). https://doi.org/10.1109/CCWC.2019.8666588

  15. Yin, J., Tang, M., Cao, J., Wang, H.: Apply transfer learning to cybersecurity: predicting exploitability of vulnerabilities by description. Knowl. Based Syst. 210, 106529 (2020). https://doi.org/10.1016/j.knosys.2020.106529

    Article  Google Scholar 

  16. Johnson, R., Zhang, T.: Semi-supervised convolutional neural networks for text categorization via region embedding. In: Advances in Neural Information Processing Systems, vol. 28 (2015). https://proceedings.neurips.cc/paper/2015/file/acc3e0404646c57502b480dc052c4fe1-Paper.pdf

  17. Liu, J., Chang, W.-C., Wu, Y., Yang, Y.: Deep learning for extreme multi-label text classification. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 115–124 (2017). https://doi.org/10.1145/3077136.3080834

  18. Alqahtani, S.S.: fasttext implementation (2023). https://github.com/isultane/fasttext_classifications. Accessed 20 June 2023

  19. Song, Q., Guo, Y., Shepperd, M.: A comprehensive investigation of the role of imbalanced learning for software defect prediction. IEEE Trans. Softw. Eng. 45(12), 1253–1269 (2019). https://doi.org/10.1109/TSE.2018.2836442

    Article  Google Scholar 

  20. Kallis, R., Di Sorbo, A., Canfora, G., Panichella, S.: Ticket tagger: machine learning driven issue classification. In: 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 406–409 (2019). https://doi.org/10.1109/ICSME.2019.00070

  21. Mileva, Y.M., Dallmeier, V., Burger, M., Zeller, A.: Mining trends of library usage. In: Proceedings of the Joint International and Annual ERCIM Workshops on Principles of Software Evolution (IWPSE) and Software Evolution (Evol) Workshops, pp. 57–62 (2009). https://doi.org/10.1145/1595808.1595821

  22. Gegick, M., Rotella, P., Xie, T.: Identifying security bug reports via text mining: an industrial case study. In: 7th IEEE Working Conference on Mining Software Repositories, pp. 11–20 (2010). https://doi.org/10.1109/MSR.2010.5463340

  23. Scandariato, R., Walden, J., Hovsepyan, A., Joosen, W.: Predicting vulnerable software components via text mining. IEEE Trans. Softw. Eng. 40(10), 993–1006 (2014). https://doi.org/10.1109/TSE.2014.2340398

    Article  Google Scholar 

  24. Yang, Y., Xia, X., Lo, D., Bi, T., Grundy, J., Yang, X.: Predictive models in software engineering: challenges and opportunities. ACM Trans. Softw. Eng. Methodol. 31(3), 1–72 (2022). https://doi.org/10.1145/3503509

    Article  Google Scholar 

  25. Sawadogo, A.D., Guimard, T.F., Bissyandé, Q., Kader Kaboré, J., Klein, A., Moha, N.: Early Detection of Security-Relevant Bug Reports using Machine Learning: How Far Are We? eprint arXiv:2112.10123 (2021). https://ui.adsabs.harvard.edu/abs/2021arXiv211210123S/abstract

  26. Berrar, D.: Cross-validation. In: Encyclopedia of Bioinformatics and Computational Biology, pp. 542–545. Elsevier (2019)

  27. Zhang, Z.: Introduction to machine learning: k-nearest neighbors. Ann. Transl. Med. 4(11), 218–218 (2016). https://doi.org/10.21037/atm.2016.03.37

    Article  Google Scholar 

  28. Alipour, A., Hindle, A., Stroulia, E.: A contextual approach towards more accurate duplicate bug report detection. In: 2013 10th Working Conference on Mining Software Repositories (MSR), pp. 183–192 (2013). https://doi.org/10.1109/MSR.2013.6624026

  29. Sharma, M., Bedi, P., Chaturvedi, K.K., Singh, V.B.: Predicting the priority of a reported bug using machine learning techniques and cross project validation. In: 2012 12th International Conference on Intelligent Systems Design and Applications (ISDA), pp. 539–545 (2012). https://doi.org/10.1109/ISDA.2012.6416595

  30. Peng, H., Bing, L., Yutao, M.: Towards cross-project defect prediction with imbalanced feature sets, p. 10 (2014). https://doi.org/10.48550/arXiv.1411.4228

Download references

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

Authors and Affiliations

Authors

Contributions

SSA contributed to the study conception and design, including the material preparation, data collection and analysis. Also, SSA prepared the first draft of the manuscript and proof-reader commented on previous versions of the manuscript. The author read and approved the final manuscript.

Corresponding author

Correspondence to Sultan S. Alqahtani.

Ethics declarations

Conflict of interest

The author Sultan S. Alqahtani, declares that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alqahtani, S.S. Security bug reports classification using fasttext. Int. J. Inf. Secur. 23, 1347–1358 (2024). https://doi.org/10.1007/s10207-023-00793-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10207-023-00793-w

Keywords

Navigation