VulMAE: Graph Masked Autoencoders for Vulnerability Detection from Source and Binary Codes

Zamani, Mahmoud; Irtiza, Saquib; Khan, Latifur; Hamlen, Kevin W.

doi:10.1007/978-3-031-57537-2_12

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14551))

Included in the following conference series:

International Symposium on Foundations and Practice of Security

90 Accesses

Abstract

The first graph masked auto-encoder (GraphMAE) model for software vulnerability detection is designed and developed, with a comparative evaluation against other self-supervised learning (SSL) methods. Evaluation of the domain-specific GraphMAE model (VulMAE) for the vulnerability detection task shows exceptional promise, outperforming all other baseline models in the study. The approach is particularly well-suited for cybersecurity applications where gathering substantial real-world labeled samples is difficult, since graph SSL methods (e.g., contrastive and generative models) offer data classification in AI tasks without requiring vast amounts of labeled data for effective training.

The study fills a key gap in the literature on automated and machine-assisted discovery and patching of software security vulnerabilities, which has become increasingly critical with the dramatic increase in modern software complexity, but for which graph neural network (GNN) approaches are understudied relative to traditional processes, such as manual source code auditing and fuzzing. To conduct the study, the evaluation applies models to source and binary software components sourced from the National Vulnerability Database (NVD). A new dataset is curated by extracting vulnerable code fragments from six applications with NVD-documented security flaws and converting them to four graph types using specialized tools based on code property graphs and binary semantics lifting. The data is used to train contrastive and generative learning models for comparison. VulMAE achieves a weighted F1 score of 0.936 and a weighted Recall of 0.938, which is the highest of all tested methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/Saquibirtiza/VulMAE.git.

References

Booth, H., Rike, D., Witte, G.A.: The national vulnerability database (NVD): Overview. ITL Bulletin, National Institute of Standards and Technology (2013)
Google Scholar
Brumley, D., Jager, I., Avgerinos, T., Schwartz, E.J.: BAP: a binary analysis platform. In: Proceedings of International Conference on Computer Aided Verification, pp. 463–469 (2011)
Google Scholar
Chakraborty, S., Krishna, R., Ding, Y., Ray, B.: Deep learning based vulnerability detection: are we there yet. IEEE Trans. Softw. Eng. 48, 3280–3296 (2022)
Article Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artifi. Intell. Res. 16(1), 321–357 (2002)
Article Google Scholar
Croft, R., Newlands, D., Chen, Z., Babar, M.A.: An empirical study of rule-based and learning-based approaches for static application security testing. In: Proceedings of ACM/IEEE International Symposium Empirical Software Engineering and Measurement (2021)
Google Scholar
DevNest: How to bypass sudo – exploit CVE-2023-22809 vulnerability. Medium (2023). https://medium.com/@dev.nest/how-to-bypass-sudo-exploit-cve-2023-22809-vulnerability-296ef10a1466
Hassani, K., Khasahmadi, A.H.: Contrastive multi-view representation learning on graphs. In: Proceedings of International Conference on Machine Learning, pp. 4116–4126 (2020)
Google Scholar
Hin, D., Kan, A., Chen, H., Babar, M.A.: LineVD: statement-level vulnerability detection using graph neural networks. In: Proceedings of International Conference on Mining Software Repositories, pp. 596–607 (2022)
Google Scholar
Hjelm, R.D., et al.: Learning deep representations by mutual information estimation and maximization. In: Proceedings of International Conference on Learning Representation (2019)
Google Scholar
Hohnka, M.J., Miller, J.A., Dacumos, K.M., Fritton, T.J., Erdley, J.D., Long, L.N.: Evaluation of compiler-induced vulnerabilities. J. Aerospace Inform. Syst. 16(10), 409–426 (2019)
Article Google Scholar
Hou, Z., Liu, X., Cen, Y., Dong, Y., Yang, H., Wang, C., Tang, J.: GraphMAE: self-supervised masked graph autoencoders. In: Proceedings of ACM Conference on Knowledge Discovery and Data Mining, pp. 594–604 (2022)
Google Scholar
Kazius, J., McGuire, R., Bursi, R.: Derivation and validation of toxicophores for mutagenicity prediction. J. Med. Chem. 48(1), 312–320 (2005)
Article Google Scholar
Kipf, T.N., Welling, M.: Variational graph auto-encoders. arXiv:1611.07308 (2016)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: Proceedings of International Conferen on Learning Representation (Poster) (2017)
Google Scholar
Le, T., et al.: Maximal divergence sequential autoencoder for binary software vulnerability detection. In: Proceedings of International Conference on Learning Representation (2019)
Google Scholar
Li, X., Feng, B., Li, G., Li, T., He, M.: A vulnerability detection system based on fusion of assembly code and source code. Sec. Commun. Netw. 2021 (2021)
Google Scholar
Li, Z., Zou, D., Xu, S., Chen, Z., Zhu, Y., Jin, H.: VulDeeLocator: a deep learning-based fine-grained vulnerability detector. IEEE Trans. Dependable Sec. Comput. 19(4), 2821–2837 (2021)
Article Google Scholar
Li, Z., Zou, D., Xu, S., Jin, H., Qi, H., Hu, J.: Vulpecker: an automated vulnerability detection system based on code similarity analysis. In: Proceedings of Annual Computer Security Applications Conference, pp. 201–213 (2016)
Google Scholar
Li, Z., Zou, D., Xu, S., Jin, H., Zhu, Y., Chen, Z.: SySeVR: a framework for using deep learning to detect software vulnerabilities. IEEE Trans. Dependable Sec. Comput. 19(4), 2244–2258 (2021)
Article Google Scholar
Li, Z., et al.: Vuldeepecker: a deep learning-based system for vulnerability detection. In: Proceedings of Annual Network & Distributed System Security Symposium (2018)
Google Scholar
Lin, G., Zhang, J., Luo, W., Pan, L., Xiang, Y.: POSTER: vulnerability discovery with function representation learning from unlabeled projects. In: Proceedings of ACM Conference on Computer and Communications Security, pp. 2539–2541 (2017)
Google Scholar
Lin, G.: Cross-project transfer representation learning for vulnerable function discovery. IEEE Trans. Indus. Inform. 14(7), 3289–3297 (2018)
Article Google Scholar
Lipp, S., Banescu, S., Pretschner, A.: An empirical study on the effectiveness of static C code analyzers for vulnerability detection. In: Proceedings of ACM International Symposium on Software Testing and Analysis, pp. 544–555 (2022)
Google Scholar
Ma, R., Jian, Z., Chen, G., Ma, K., Chen, Y.: ReJection: a AST-based reentrancy vulnerability detection method. In: Proceedings of Chinese Conference on Trusted Computing and Information Security, pp. 58–71 (2020)
Google Scholar
Mizrahi, Y.: OpenSSH pre-auth double free CVE-2023-25136 – writeup and proof-of-concept. JFrog (2023). https://jfrog.com/blog/openssh-pre-auth-double-free-cve-2023-25136-writeup-and-proof-of-concept
NIST: CVSS severity distribution over time. https://nvd.nist.gov/general/visualizations/vulnerability-visualizations/cvss-severity-distribution-over-time#CVSSSeverityOverTime, (Accessed 12 Sep 2023)
Pinconschi, E., Abreu, R., Adão, P.: A comparative study of automatic program repair techniques for security vulnerabilities. In: Proceedings of IEEE International Symposium on Software Reliability Engineering, pp. 196–207 (2021)
Google Scholar
Russell, R., et al.: klM.: Automated vulnerability detection in source code using deep representation learning. In: Proceedings of IEEE International Conference on Machine Learning and Applications, pp. 757–762 (2018)
Google Scholar
Schlichtkrull, M., Kipf, T.N., Bloem, P., van den Berg, R., Titov, I., Welling, M.: Modeling relational data with graph convolutional networks. In: Proceedings of European Semantic Web Conference, pp. 593–607 (2018)
Google Scholar
Shervashidze, N., Schweitzer, P., Leeuwen, E.J.V., Mehlhorn, K., Borgwardt, K.M.: Weisfeiler-Lehman graph kernels. J. Mach. Learn. Res. 12(9) (2011)
Google Scholar
Shimchik, N., Ignatyev, V., Belevantsev, A.: Improving accuracy and completeness of source code static taint analysis. In: Ivannikov Ispras Open Conference, pp. 61–68 (2021)
Google Scholar
Sun, F.Y., Hoffmann, J., Verma, V., Tang, J.: Infograph: unsupervised and semi-supervised graph-level representation learning via mutual information maximization. In: Proceedings of International Conference on Learning Representations (2020)
Google Scholar
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: Proceedings of International Conference on Learning Representation (2017)
Google Scholar
Veličković, P., Fedus, W., Hamilton, W.L., Liò, P., Bengio, Y., Hjelm, R.D.: Deep graph infomax. In: Proceedings of International Conference on Learning Representation (2019)
Google Scholar
Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? In: Proceedings of International Conference on Learning Representation (2019)
Google Scholar
Xu, L., Sun, F., Su, Z.: Constructing precise control flow graphs from binaries. The University of California, Davis, Tech. rep. (2009)
Google Scholar
Yamaguchi, F., Golde, N., Arp, D., Rieck, K.: Modeling and discovering vulnerabilities with code property graphs. In: Proceedings IEEE Symposium on Security & Privacy, pp. 590–604 (2014)
Google Scholar
Yamaguchi, F., Lindner, F.F., Rieck, K.: Vulnerability extrapolation: assisted discovery of vulnerabilities using machine learning. In: Proceedings of USENIX Workshop Offensive Technologies, pp. 118–127 (2011)
Google Scholar
Yanardag, P., Vishwanathan, S.: Deep graph kernels. In: Proceedings of ACM International Conference on Knowledge Discovery and Data Mining, pp. 1365–1374 (2015)
Google Scholar
You, Y., Chen, T., Sui, Y., Chen, T., Wang, Z., Shen, Y.: Graph contrastive learning with augmentations. In: Proceedings of Conference on Neural Information Processing Systems, pp. 5812–5823 (2020)
Google Scholar
Zhang, H., Wu, Q., Yan, J., Wipf, D., Yu, P.S.: From canonical correlation analysis to self-supervised graph neural networks. In: Proceedings of Conference on Neural Information Processing Systems, pp. 76–89 (2021)
Google Scholar
Zhou, M., et al.: A method for software vulnerability detection based on improved control flow graph. Wuhan University J. Nat. Sci. 24(2), 149–160 (2019)
Article Google Scholar
Zhou, Y., Liu, S., Siow, J., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In: Proceedings of Conference on Neural Information Processing Systems, pp. 10197–10207 (2019)
Google Scholar
Zhu, Q., Du, B., Yan, P.: Self-supervised training of graph convolutional networks. In: Proceedings of International Conference on Machine Learning, Online (2020)
Google Scholar

Download references

Acknowledgments

This research was supported in part by DARPA Award N66001-21-C-4024, ONR Award N00014-21-1-2654, and ARO Award W911NF-21-1-0032. All recommendations, opinions, and conclusions are those of the authors and not necessarily of the above supporters.

Author information

Authors and Affiliations

The University of Texas at Dallas, Richardson, TX, 75080, USA
Mahmoud Zamani, Saquib Irtiza, Latifur Khan & Kevin W. Hamlen

Authors

Mahmoud Zamani
View author publications
You can also search for this author in PubMed Google Scholar
Saquib Irtiza
View author publications
You can also search for this author in PubMed Google Scholar
Latifur Khan
View author publications
You can also search for this author in PubMed Google Scholar
Kevin W. Hamlen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Saquib Irtiza .

Editor information

Editors and Affiliations

University of Bordeaux, Bordeaux, France
Mohamed Mosbah
Toulouse III - Paul Sabatier University, Toulouse, France
Florence Sèdes
Université Laval, Québec, QC, Canada
Nadia Tawbi
University of Bordeaux, Bordeaux, France
Toufik Ahmed
Polytechnique Montréal, Montreal, QC, Canada
Nora Boulahia-Cuppens
Telecom SudParis, Palaiseau, France
Joaquin Garcia-Alfaro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zamani, M., Irtiza, S., Khan, L., Hamlen, K.W. (2024). VulMAE: Graph Masked Autoencoders for Vulnerability Detection from Source and Binary Codes. In: Mosbah, M., Sèdes, F., Tawbi, N., Ahmed, T., Boulahia-Cuppens, N., Garcia-Alfaro, J. (eds) Foundations and Practice of Security. FPS 2023. Lecture Notes in Computer Science, vol 14551. Springer, Cham. https://doi.org/10.1007/978-3-031-57537-2_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-57537-2_12
Published: 25 April 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57536-5
Online ISBN: 978-3-031-57537-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

VulMAE: Graph Masked Autoencoders for Vulnerability Detection from Source and Binary Codes