Skip to main content
Log in

Detecting vulnerable software functions via text and dependency features

  • Data analytics and machine learning
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Detecting vulnerabilities in software is crucial to guarantee the security of software systems. Most previous methods focus on training a classification or regression model on the text feature of the source code to predict vulnerabilities. However, it is not always easy to obtain the labeled vulnerabilities in practical applications, and using only the text feature is insufficient to find the vulnerabilities in complex software systems. To address these problems, in this paper, we propose an unsupervised method to detect vulnerable software functions, which uses both text and dependency features of the source code to improve the detection accuracy. Specifically, we first extract the text and dependency features from the source code and concatenate them to the combined feature. We then learn a deep autoencoder to transform the combined feature into low-dimensional embedding. We finally apply an outlier detection method on the embedding to predict the vulnerable functions. We extensively evaluated the proposed method on seven C/C++ program datasets, and the results illustrate that our method improves F1 score on average of 88 and 66% over comparison methods Rats and Joern, which verifies the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data Availability

Enquiries about data availability should be directed to the authors.

Notes

  1. https://cve.mitre.org/cve/data_feeds.html.

  2. https://www.microfocus.com/en-us/assets/cyberres/application-security-risk-report.

  3. https://samate.nist.gov/SRD/.

  4. https://nvd.nist.gov/.

  5. http://cwe.mitre.org/.

  6. https://code.google.com/archive/p/rough-auditing-tool-for-security/downloads.

  7. https://joern.io/.

References

  • Aggarwal CC (2015) Time series and multidimensional streaming outlier detection. Outlier Analysis. Springer, New York, pp 225–264

    Google Scholar 

  • Anowar F, Sadaoui S, Selim B (2021) Conceptual and empirical comparison of dimensionality reduction algorithms (pca, kpca, lda, mds, svd, lle, isomap, le, ica, t-sne). Comput Sci Rev 40(100):378

    MathSciNet  MATH  Google Scholar 

  • Aremu OO, Hyland-Wood D, McAree PR (2020) A machine learning approach to circumventing the curse of dimensionality in discontinuous time series machine data. Reliab Eng Syst Safety 195(106):706

    Google Scholar 

  • Breunig MM, Kriegel HP, Ng RT (2000) LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on Management of data. ACM, Dallas, Texas, USA, pp 93–104

  • Chakraborty S, Krishna R, Ding Y (2022) Deep learning based vulnerability detection: are we there yet. IEEE Trans Softw Eng 48(9):3280–3296

    Article  Google Scholar 

  • Chibotaru V, Bichsel B, Raychev V (2019) Scalable taint specification inference with big code. In: Proceedings of the 40th ACM SIGPLAN conference on programming language design and implementation (PLDI ’19). ACM, Phoenix, AZ, pp 760–774

  • Dey T, Karnauch A, Mockus A (2021) Representation of developer expertise in open source software. In: 2021 IEEE/ACM 43rd international conference on software engineering (ICSE 2021). IEEE, Electr network, pp 995–1007

  • Duan X, Wu J, Luo T (2020) Vulnerability mining method based on code property graph and attention BILSTM. J Softw 31(11):3404–3420

    Google Scholar 

  • Filus K, Boryszko P, Domanska J et al (2021) Efficient feature selection for static analysis vulnerability prediction. Sensors 21(4):1133

    Article  Google Scholar 

  • Han J, Pei J, Kamber M (eds) (2011) Data mining: concepts and techniques. Elsevier, USA

    Google Scholar 

  • Hata H, Mizuno O, Kikuno T (2010) Fault-prone module detection using large-scale text features based on spam filtering. Empir Softw Eng 15:147–165

    Article  Google Scholar 

  • Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  MathSciNet  MATH  Google Scholar 

  • Landman D, Serebrenik A, Vinju JJ (2017) Challenges for static analysis of java refection-literature review and empirical study. In: 39th IEEE/ACM international conference on software engineering (ICSE). IEEE, Buenos Aires, ARGENTINA, pp 507–518

  • Li B, Zhou Y, Wang Y (2005) Matrixbased component dependence representation and its applications in software quality assurance. ACM SIGPLAN Notices 40:29–36

    Article  Google Scholar 

  • Li Y, Xue Y, Chen H (2019) Cerebro: Context-aware adaptive fuzzing for effective vulnerability detection. In: ESEC/FSE’2019 proceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. ACM, Tallinn, ESTONIA, pp 533–544

  • Li Z, Zou D, Xu S (2021) Sysevr: A framework for using deep learning to detect software vulnerabilities. IEEE Trans Depend Secur Comput

  • Lin G, Wen S, Han QL (2020) Software vulnerability detection using deep neural networks: a survey. Proc IEEE 108(10):1825–1848

    Article  Google Scholar 

  • Liu Z, Qian P, Wang X (2021) Combining graph neural networks with expert knowledge for smart contract vulnerability detection. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2021.3095196

    Article  Google Scholar 

  • Neuhaus S, Zimmermann T, Holler C (2007) Predicting vulnerable software components. In: 14th ACM conference on computer and communication security. ACM, Alexandria, VA, pp 529–540

  • Nguyen VH, Tran LMS (2010) Predicting vulnerable software components with dependency graphs. In: Proceedings of the 6th international workshop on security measurements and metrics, pp 1–8

  • Pang Y, Xue X, Namin A (2015) Predicting vulnerable software components through n-gram analysis and statistical feature selection. In: 2015 IEEE 14th international conference on machine learning and applications (ICMLA). IEEE, Miami, pp 543–548

  • Pang Y, Xue X, Wang H (2017) Predicting vulnerable software components through deep neural network. In: Proceedings of the 2017 international conference on deep learning technologies. ACM, Chengdu, China, pp 6–10

  • Perl H, Dechand S, Smith M (2015) Vccfinder: Finding potential vulnerabilities in open-source projects to assist code audits. In: 22nd ACM SIGSAC conference on computer and communications security (CCS). ACM, Denver, CO, pp 426–437

  • Qasem A, Shirani P, Debbabi M (2021) Automatic vulnerability detection in embedded devices and firmware: survey and layered taxonomies. ACM Comput Surv 54(2):1–42

    Article  Google Scholar 

  • Russell RL, Kim L, Hamilton LH (2018) Automated vulnerability detection in source code using deep representation learning. In: 2018 17th IEEE international conference on machine learning and applications (ICMLA). IEEE, Orlando, FL, pp 757–762

  • Şahin CB, Dinler ÖB, Abualigah L (2021) Prediction of software vulnerability based deep symbiotic genetic algorithms: phenotyping of dominant-features. Appl Intell 51(11):8271–8287

    Article  Google Scholar 

  • Shirey R (2007) Internet security glossary, version 2. RFC 4949:1–365

    Google Scholar 

  • Sun H, Cui L, Li L (2021) Vdsimilar: Vulnerability detection based on code similarity of vulnerabilities and patches. Comput Secur 110(102):417

    Google Scholar 

  • Xie J, Girshick R, Farhadi A (2016) Unsupervised deep embedding for clustering analysis. In: Weinberger K (ed) Balcan M. Unsupervised deep embedding for clustering analysis, New York, pp 478–487

    Google Scholar 

  • Yamaguchi F, Maier A, Gascon H (2015) Automatic inference of search patterns for taint-style vulnerabilities. In: 2015 IEEE symposium on security and privacy SP 2015. IEEE, San Jose, CA, pp 797–812

  • Yan H, Sui Y, Chen S (2017) Machine-learning-guided typestate analysis for static use-after-free detection. In: 33rd annual computer security applications conference (ACSAC 2017). ACM, Orlando, FL, pp 42–54

  • Zhou C, Liu Y, Liu X (2017) Scalable graph embedding for asymmetric proximity. In: 31st AAAI conference on artificial intelligence. AAAI, San Francisco, CA, pp 2942–2948

  • Zhou Y, Liu S, Siow J (2019) Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Adv Neural Inf Proces Syst 32(10):197–207

  • Zou D, Wang S, Xu S (2019) \(\mu \)vuldeepecker: A deep learning-based system for multiclass vulnerability detection. IEEE Trans Depend Secur Comput 18(5):2224–2236

    Google Scholar 

Download references

Acknowledgements

This work was supported by Natural Science Foundation of YunNan Provincial Department of Education (2019J0942).

Funding

The authors have not disclosed any funding.

Author information

Authors and Affiliations

Authors

Contributions

WX involved in conceptualization, methodology, experiment and writing—original draft. TL involved in writing—review and editing. JW involved in writing—review and editing. YT involved in data curation, resources and writing—review.

Corresponding author

Correspondence to Wenlin Xu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This material is the authors’ own original work, which has not been previously published elsewhere. The paper is not currently being considered for publication elsewhere. The paper reflects the authors’ own research and analysis in a truthful and complete manner.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, W., Li, T., Wang, J. et al. Detecting vulnerable software functions via text and dependency features. Soft Comput 27, 5425–5435 (2023). https://doi.org/10.1007/s00500-022-07775-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-022-07775-5

Keywords

Navigation