Vulnerability discovery based on source code patch commit mining: a systematic literature review

Zuo, Fei; Rhee, Junghwan

doi:10.1007/s10207-023-00795-8

Vulnerability discovery based on source code patch commit mining: a systematic literature review

Regular Contribution
Published: 06 January 2024

Volume 23, pages 1513–1526, (2024)
Cite this article

International Journal of Information Security Aims and scope Submit manuscript

Fei Zuo¹ &
Junghwan Rhee¹

283 Accesses
1 Altmetric
Explore all metrics

Abstract

In recent years, there has been a remarkable surge in the adoption of open-source software (OSS). However, with the growing usage of OSS components in both free and proprietary software, vulnerabilities that are present within them can be spread to a vast array of underlying applications. Even worse, a myriad of vulnerabilities are fixed secretly via patch commits, which causes other software re-using the vulnerable code snippets to be left in the dark. Thus, source code patch commit mining toward vulnerability discovery is receiving immense attention, and a variety of approaches are proposed. Despite that, there is no comprehensive survey summarizing and discussing the current progress within this field. To fill this gap, we survey, evaluate, and systematize a list of literature and provide the community with our insights on both successes and remaining issues in this space. Special attention is paid on the work toward vulnerability discovery. In this paper, we also provide an introductory panorama with our replicable hands-on experience, which can help readers quickly understand and step into the pertinent field. Our empirical study reveals noteworthy challenges which need to be highlighted and addressed in this field. We also discuss potential directions for the future work. To the best of knowledge, we provide the first literature review to study source code patch commit mining in the vulnerability discovery context. The systematic framework, hands-on practices, and list of potential challenges provide new knowledge for mining source code patch commit toward a more robust software eco-system. The research gaps found in this literature review show the need for future research, such as the concern on data quality, high false alarms, and the significance of textual information.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

How different are different diff algorithms in Git?

Article Open access 11 September 2019

AndroMalPack: enhancing the ML-based malware classification by detection and removal of repacked apps for Android systems

Article Open access 14 November 2022

Data Availability

Data sharing is not applicable to this article as no datasets were generated during the study. The papers discussed in this review are listed at https://github.com/fzuo/Patch-Commits-Study.

Notes

References

Silverman, R.E.: Git Pocket Guide: A Working Introduction. O’Reilly Media, Inc., Sebastopol (2013)
Google Scholar
Sawadogo, A.D., Bissyandé, T.F., Moha, N., Allix, K., Klein, J., Li, L., Le Traon, Y.: SSPCatcher: learning to catch security patches. Empir. Softw. Eng. 27(6), 151 (2022)
Article Google Scholar
Wang, X., Wang, S., Feng, P., Sun, K., Jajodia, S., Benchaaboun, S., Geck, F.: PatchRNN: a deep learning-based system for security patch identification. In: Proceedings of the IEEE Military Communications Conference (MILCOM), pp. 595–600 (2021)
CVE: Published CVE records (2023). https://www.cve.org/About/Metrics. Accessed 03 2023
Snyk: The state of open-source security (2017). https://snyk.io/series/open-source-security/. Accessed 12 2018
Snyk: The state of open source security report (2019). https://snyk.io/series/open-source-security/. Accessed 08 2020
Liang, H., Pei, X., Jia, X., Shen, W., Zhang, J.: Fuzzing: state of the art. IEEE Trans. Reliab. 67(3), 1199–1218 (2018)
Article Google Scholar
Baldoni, R., Coppa, E., D’celia, D.C., Demetrescu, C., Finocchi, I.: A survey of symbolic execution techniques. ACM Comput. Surv. (CSUR) 51(3), 1–39 (2018)
Article Google Scholar
Luo, L., Zeng, Q., Yang, B., Zuo, F., Wang, J.: Westworld: fuzzing-assisted remote dynamic symbolic execution of smart apps on IoT cloud platforms. In: Proceedings of the Annual Computer Security Applications Conference, pp. 982–995 (2021)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Zuo, F., Yang, B., Li, X., Zeng, Q.: Exploiting the inherent limitation of L0 adversarial examples. In: Proceedings of the 22nd International Symposium on Research in Attacks, Intrusions and Defenses (RAID), pp. 293–307. USENIX Association (2019)
Tian, Y., Lawall, J., Lo, D. Identifying Linux bug fixing patches. In: Proceedings of the 34th International Conference on Software Engineering (ICSE), pp. 386–396. IEEE (2012)
Wang, X., Sun, K., Batcheller, A., Jajodia, S.: Detecting 0-day vulnerability: an empirical study of secret security patch in OSS. In: Proceedings of the 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 485–492 (2019)
Huang, C., Sun, M., Duan, R., Susheng, W., Chen, B.: Vulnerability identification technology research based on project version difference. Chin. J. Netw. Inf. Secur. 8(1), 52–62 (2022)
Google Scholar
Zhou, J., Pacheco, M., Wan, Z., Xia, X., Lo, D., Wang, Y., Hassan, A.E.: Finding a needle in a haystack: automated mining of silent vulnerability fixes. In: Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 705–716 (2021)
Zhou, Y., Siow, J.K., Wang, C., Liu, S., Liu, Y.: SPI: automated identification of security patches via commits. ACM Trans. Softw. Eng. Methodol. (TOSEM) 31(1), 1–27 (2021)
Google Scholar
Wu, B., Liu, S., Feng, R., Xie, X., Siow, J., Lin, S.-W.: Enhancing security patch identification by capturing structures in commits. IEEE Transactions on Dependable and Secure Computing (2022)
Xu, Z., Chen, B., Chandramohan, M., Liu, Y., Song, F.: Spain: security patch analysis for binaries towards understanding the pain and pills. In: Proceedings of the IEEE/ACM 39th International Conference on Software Engineering (ICSE), pp. 462–472 (2017)
Zuo, F., Li, X., Young, P., Luo, L., Zeng, Q., Zhang, Z.: Neural machine translation inspired binary code similarity comparison beyond function pairs. In: Proceedings of the 26th Network and Distributed Systems Security (NDSS) Symposium (2019)
Dissanayake, N., Jayatilaka, A., Zahedi, M., Ali Babar, M.: Software security patch management-a systematic literature review of challenges, approaches, tools and practices. Inf. Softw. Technol. 144, 106771 (2022)
Article Google Scholar
Bettenburg, N., Just, S., Schröter, A., Weiss, C., Premraj, R., Zimmermann, T.: What makes a good bug report? In: Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, pp. 308–318 (2008)
Zimmermann, T., Premraj, R., Bettenburg, N., Just, S., Schroter, A., Weiss, C.: What makes a good bug report? IEEE Trans. Softw. Eng. 36(5), 618–643 (2010)
Article Google Scholar
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: Proceedings of the 5th International Conference on Learning Representations (ICLR) (2017)
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. In: Proceedings of the 5th International Conference on Learning Representations (ICLR) (2018)
Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, vol. 26 (2013)
Pagliardini, M., Gupta, P., Jaggi, M.: Unsupervised learning of sentence embeddings using compositional n-gram features. In: Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) (2018)
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the International conference on machine learning, pp. 1188–1196 (2014)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2018). arXiv:1810.04805
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training. Technical report, OpenAI (2018)
Radford, A., Jeffrey, W., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. Technical report, OpenAI (2019)
Wang, S., Zhang, Y., Bao, L., Xia, X., Wu, M.: Vcmatch: a ranking-based approach for automatic security patches localization for OSS vulnerabilities. In: Proceedings of the IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 589–600 (2022)
Li, X., Liu, B.: Learning to classify texts using positive and unlabeled data. In: Proceedings of the International Joint Conference on Artificial Intelligence, vol. 3, pp. 587–592 (2003)
Perl, H., Dechand, S., Smith, M., Arp, D., Yamaguchi, F., Rieck, K., Fahl, S., Acar, Y.: Vccfinder: finding potential vulnerabilities in open-source projects to assist code audits. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 426–437 (2015)
Ji, T., Pan, J., Chen, L., Mao, X.: Identifying supplementary bug-fix commits. In: Proceedings of the 42nd Annual Computer Software and Applications Conference (COMPSAC), pp. 184–193. IEEE (2018)
Wang, X., Wang, S., Sun, K., Batcheller, A., Jajodia, S.: A machine learning approach to classify security patches into vulnerability types. In: Proceedings of the IEEE Conference on Communications and Network Security (CNS), pp. 1–9 (2020)
Riom, T., Sawadogo, A., Allix, K., Bissyandé, T.F., Moha, N., Klein, J.: Revisiting the VCCFinder approach for the identification of vulnerability-contributing commits. Empir. Softw. Eng. 26, 1–30 (2021)
Article Google Scholar
Tan, X., Zhang, Y., Mi, C., Cao, J., Sun, K., Lin, Y., Yang, M.: Locating the security patches for disclosed OSS vulnerabilities with vulnerability-commit correlation ranking. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, pp. 3282–3299 (2021)
Burges, C.J.C.: From ranknet to lambdarank to lambdamart: an overview. Learning 11(23–581), 81 (2010)
Google Scholar
Wang, S., Wang, X., Sun, K., Jajodia, S., Wang, H., Li, Q. Graphspd: graph-based security patch detection with enriched code semantics. In: Proceedings of the IEEE Symposium on Security and Privacy (SP), pp. 604–621 (2022)
Zhou, X., Pang, J., Shan, Z., Yue, F., Liu, F., Jinlong, X., Wang, J., Liu, W., Liu, G.: TMVDPatch: a trusted multi-view decision system for security patch identification. Appl. Sci. 13(6), 3938 (2023)
Article Google Scholar
Sabetta, A., Bezzi, M.: A practical approach to the automatic classification of security-relevant commits. In: Proceedings of the IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 579–582 (2018)
Hoang, T., Lawall, J., Tian, Y., Oentaryo, R.J., Lo, D.: PatchNet: hierarchical deep learning-based stable patch identification for the Linux kernel. IEEE Trans. Softw. Eng. 47(11), 2471–2486 (2021)
Article Google Scholar
Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, and D., Zhou, M.: CodeBERT: a pre-trained model for programming and natural languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1536–1547. Association for Computational Linguistics (2020)
Zuo, F., Zhang, X., Song, Y., Rhee, J., Fu, J.: Commit message can help: security patch detection in open source software via transformer. In: IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications (SERA), pp. 345–351 (2023)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Zhou, Y., Sharma, A.: Automated identification of security issues from commit messages and bug reports. In: Proceedings of the 11th Joint Meeting on Foundations of Software Engineering, pp. 914–919 (2017)
Islam, M.R., Zibran, M.F.: Sentiment analysis of software bug related commit messages. In: Proceedings of the 27th International Conference on Software Engineering and Data Engineering (SEDE), pp. 3–8 (2018)
Wang, X., Wang, S., Feng, P., Sun, K., Jajodia, S.: PatchDB: a large-scale security patch dataset. In: Proceedings of the 51st annual IEEE/IFIP international conference on dependable systems and networks (DSN), pp. 149–160 (2021)
Reis, S., Abreu, R.: A ground-truth dataset of real security patches. arXiv:2110.09635 (2021)
Zuo, F., Rhee, J., Kim, Y., Oh, J., Qian, G.: A comprehensive dataset towards hands-on experience enhancement in a research-involved cybersecurity program. In: The 24th ACM Annual Conference on Information Technology Education, pp. 118–124 (2023)
Li, F., Paxson, V.: A large-scale empirical study of security patches. In: Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, pp. 2201–2215 (2017)
Iannone, E., Guadagni, R., Ferrucci, F., De Lucia, A., Palomba, F.: The secret life of software vulnerabilities: a large-scale empirical study. IEEE Trans. Softw. Eng. 49(1), 44–63 (2022)
Zhong, H., Su, Z.: An empirical study on real bug fixes. In: Proceedings of the 37th International Conference on Software Engineering, vol. 1, pp. 913–923. IEEE (2015)
Ferenc, R., Gyimesi, P., Gyimesi, G., Tóth, Z., Gyimóthy, T.: An automatically created novel bug dataset and its validation in bug prediction. J. Syst. Softw. 169, 110691 (2020)
Article Google Scholar
Tian, Y., Lo, D., Xia, X., Sun, C.: Automated prediction of bug report priority using multi-factor analysis. Empir. Softw. Eng. 20, 1354–1383 (2015)
Article Google Scholar
Shu, R., Xia, T., Chen, J., Williams, L., Menzies, T.: How to better distinguish security bug reports (using dual hyperparameter optimization). Empir. Softw. Eng. 26, 1–37 (2021)
Article Google Scholar
Ahmed, H.A., Bawany, N.Z., Shamsi, J.A.: CaPBug-a framework for automatic bug categorization and prioritization using NLP and machine learning algorithms. IEEE Access 9, 50496–50512 (2021)
Umer, Q., Liu, H., Illahi, I.: Cnn-based automatic prioritization of bug reports. IEEE Trans. Reliab. 69(4), 1341–1354 (2019)
Article Google Scholar
Fang, S., Tan, Y., Zhang, T., Zhou, X., Liu, H.: Effective prediction of bug-fixing priority via weighted graph convolutional networks. IEEE Trans. Reliab. 70(2), 563–574 (2021)
Article Google Scholar
Li, Y., Che, X., Huang, Y., Wang, J., Wang, S., Wang, Y., Wang, Q.: A tale of two tasks: automated issue priority prediction with deep multi-task learning. In: Proceedings of the 16th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pp. 1–11 (2022)
Sun, C., Lo, D., Khoo, S.-C., Jiang, J.: Towards more accurate retrieval of duplicate bug reports. In: 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 253–262. IEEE (2011)
Robertson, S., Zaragoza, H., Taylor, M.: Simple bm25 extension to multiple weighted fields. In: Proceedings of the 13th International Conference on Information and Knowledge Management (CIKM), pp. 42-49. ACM (2004)
Nguyen, A.T., Nguyen, T.T., Nguyen, T.N., Lo, D., Sun, C.: Duplicate bug report detection with a combination of information retrieval and topic modeling. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, pp. 70–79 (2012)
Gopalan, R.P., Krishna, A.: Duplicate bug report detection using clustering. In: Proceedings of the 23rd Australian Software Engineering Conference, pp. 104–109. IEEE (2014)
Deshmukh, J., Annervaz, K.M., Podder, S., Sengupta, S., Dubash, N.: Towards accurate duplicate bug retrieval using deep learning techniques. In: 2017 IEEE International conference on software maintenance and evolution (ICSME), pp. 115–124. IEEE (2017)
Budhiraja, A., Dutta, K., Reddy, R., Shrivastava, M.: DWEN: deep word embedding network for duplicate bug report detection in software repositories. In: Proceedings of the 40th International Conference on software engineering: companion proceeedings, pp. 193–194 (2018)
Zaman, S., Adams, B., Hassan, A.E.: Security versus performance bugs: a case study on Firefox. In: Proceedings of the 8th Working Conference on Mining Software Repositories, pp. 93–102 (2011)
Imseis, J., Nachuma, C., Arifuzzaman, S., Zibran, M., Bhuiyan, Z.A.: On the assessment of security and performance bugs in chromium open-source project. In: Proceedings of the 5th International Conference on Dependability in Sensor, Cloud, and Big Data Systems and Applications, pp. 145–157 (2019)
Rajbhandari, A., Zibran, M.F., Eishita, F.Z.: Security versus performance bugs: How bugs are handled in the chromium project. In: Proceedings of the 20th IEEE/ACIS International Conference on Software Engineering Research, Management and Applications (SERA), pp. 70–76 (2022)
Shrestha, M., Kim, Y., Oh, J., Rhee, J., Choe, Y.R., Zuo, F., Park, M., Qian, G.: Provsec: Cybersecurity system provenance analysis benchmark dataset. In: IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications (SERA), pp. 352–357 (2023)
Tian, Y., Zhang, Y., Stol, K.-J., Jiang, L., Liu, H.: What makes a good commit message? In: Proceedings of the 44th International Conference on Software Engineering, pp. 2389–2401 (2022)

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Central Oklahoma, Edmond, OK, 73034, USA
Fei Zuo & Junghwan Rhee

Authors

Fei Zuo
View author publications
You can also search for this author in PubMed Google Scholar
Junghwan Rhee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fei Zuo.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zuo, F., Rhee, J. Vulnerability discovery based on source code patch commit mining: a systematic literature review. Int. J. Inf. Secur. 23, 1513–1526 (2024). https://doi.org/10.1007/s10207-023-00795-8

Download citation

Published: 06 January 2024
Issue Date: April 2024
DOI: https://doi.org/10.1007/s10207-023-00795-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Vulnerability discovery based on source code patch commit mining: a systematic literature review

Abstract

Access this article

Similar content being viewed by others

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

How different are different diff algorithms in Git?

AndroMalPack: enhancing the ML-based malware classification by detection and removal of repacked apps for Android systems

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Vulnerability discovery based on source code patch commit mining: a systematic literature review

Abstract

Access this article

Similar content being viewed by others

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

How different are different diff algorithms in Git?

AndroMalPack: enhancing the ML-based malware classification by detection and removal of repacked apps for Android systems

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation