A Comparison of Different Source Code Representation Methods for Vulnerability Prediction in Python

Bagheri, Amirreza; Hegedűs, Péter

doi:10.1007/978-3-030-85347-1_20

Amirreza Bagheri⁹ &
Péter Hegedűs^10,11

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1439))

Included in the following conference series:

International Conference on the Quality of Information and Communications Technology

1588 Accesses
5 Citations

Abstract

In the age of big data and machine learning, at a time when the techniques and methods of software development are evolving rapidly, a problem has arisen: programmers can no longer detect all the security flaws and vulnerabilities in their code manually. To overcome this problem, developers can now rely on automatic techniques, like machine learning based prediction models, to detect such issues. An inherent property of such approaches is that they work with numeric vectors (i.e., feature vectors) as inputs. Therefore, one needs to transform the source code into such feature vectors, often referred to as code embedding. A popular approach for code embedding is to adapt natural language processing techniques, like text representation, to automatically derive the necessary features from the source code. However, the suitability and comparison of different text representation techniques for solving Software Engineering (SE) problems is rarely studied systematically. In this paper, we present a comparative study on three popular text representation methods, word2vec, fastText, and BERT applied to the SE task of detecting vulnerabilities in Python code. Using a data mining approach, we collected a large volume of Python source code in both vulnerable and fixed forms that we embedded with word2vec, fastText, and BERT to vectors and used a Long Short-Term Memory network to train on them. Using the same LSTM architecture, we could compare the efficiency of the different embeddings in deriving meaningful feature vectors. Our findings show that all the text representation methods are suitable for code representation in this particular task, but the BERT model is the most promising as it is the least time consuming and the LSTM model based on it achieved the best overall accuracy (93.8%) in predicting Python source code vulnerabilities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An Empirical Evaluation of the Usefulness of Word Embedding Techniques in Deep Learning-Based Vulnerability Prediction

An extensive study of the effects of different deep learning models on code vulnerability detection in Python code

Article 31 January 2024

Optimizing software vulnerability detection using RoBERTa and machine learning

Article 08 May 2024

Notes

1.
https://doi.org/10.5281/zenodo.4703996.
2.
https://radimrehurek.com/gensim/.
3.
Project no. 2018-1.2.1-NKP-2018-00004 has been implemented with the support provided from the National Research, Development and Innovation Fund of Hungary, financed under the 2018-1.2.1-NKP funding scheme.

References

Allamanis, M., Sutton, C.: Mining source code repositories at massive scale using language modeling. In: 2013 10th Working Conference on Mining Software Repositories (MSR), pp. 207–216. IEEE (2013)
Google Scholar
Alon, U., Zilberstein, M., Levy, O., Yahav, E.: code2vec: learning distributed representations of code. In: Proceedings of the ACM on Programming Languages, vol. 3(POPL), pp. 1–29 (2019)
Google Scholar
Arroyo, M., Chiotta, F., Bavera, F.: An user configurable clang static analyzer taint checker. In: 2016 35th International Conference of the Chilean Computer Science Society (SCCC), pp. 1–12. IEEE (2016)
Google Scholar
Ben-Nun, T., Jakobovits, A.S., Hoefler, T.: Neural code comprehension: a learnable representation of code semantics. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 2018, Red Hook, NY, USA, pp. 3589–3601. Curran Associates Inc. (2018)
Google Scholar
Bhoopchand, A., Rocktäschel, T., Barr, E., Riedel, S.: Learning python code suggestion with a sparse pointer network. arXiv preprint arXiv:1611.08307 (2016)
Chaturvedi, K.K., Sing, V., Singh, P.: Tools in mining software repositories. In: 2013 13th International Conference on Computational Science and Its Applications, pp. 89–98. IEEE (2013)
Google Scholar
Chen, Z., Monperrus, M.: The remarkable role of similarity in redundancy-based program repair. arXiv preprint arXiv:1811.05703 (2018)
Chollet, F., et al.: Keras: the python deep learning library. Astrophysics Source Code Library, p. ascl-1806 (2018)
Google Scholar
Church, K.W.: Word2vec. Nat. Lang. Eng. 23(1), 155–162 (2017)
Article Google Scholar
Cousot, P., et al.: The ASTREÉ analyzer. In: Sagiv, M. (ed.) ESOP 2005. LNCS, vol. 3444, pp. 21–30. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-31987-0_3
Chapter Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)
Feng, Z., et al.: CodeBERT: a pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020)
Harer, J.A., et al.: Automated software vulnerability detection with machine learning (2018)
Google Scholar
Hovsepyan, A., Scandariato, R., Joosen, W., Walden, J.: Software vulnerability prediction using text analysis techniques. In: Proceedings of the 4th International Workshop on Security Measurements and Metrics, pp. 7–10 (2012)
Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: FastText.zip: Compressing text classification models. arXiv:1612.03651 (2016)
Lozoya, R.C., Baumann, A., Sabetta, A., Bezzi, M.: Commit2vec: learning distributed representations of code changes. SN Comput. Sci. 2(3), 1–16 (2021)
Google Scholar
Olesen, M.C., Hansen, R.R., Lawall, J.L., Palix, N.: Coccinelle: tool support for automated CERT C secure coding standard certification. Sci. Comput. Program. 91, 141–160 (2014)
Article Google Scholar
Peng, H., Mou, L., Li, G., Liu, Y., Zhang, L., Jin, Z.: Building program vector representations for deep learning. In: Zhang, S., Wirsing, M., Zhang, Z. (eds.) KSEM 2015. LNCS (LNAI), vol. 9403, pp. 547–553. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25159-2_49
Chapter Google Scholar
Russell, R., et al.: Automated vulnerability detection in source code using deep representation learning. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 757–762. IEEE (2018)
Google Scholar
Skaletsky, A., et al.: Dynamic program analysis of Microsoft windows applications. In: 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), pp. 2–12. IEEE (2010)
Google Scholar
Spadini, D., Aniche, M., Bacchelli, A.: Pydriller: Python framework for mining software repositories. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 908–911 (2018)
Google Scholar
Srivastava, A., Eustace, A.: ATOM: a system for building customized program analysis tools. ACM SIGPLAN Not. 39(4), 528–539 (2004)
Article Google Scholar
Sundermeyer, M., Schlüter, R., Ney, H.: LSTM neural networks for language modeling. In: 13th Annual Conference of the International Speech Communication Association (2012)
Google Scholar
Waddington, D.G., Roy, N., Schmidt, D.C.: Dynamic analysis and profiling of multithreaded systems. In: Advanced Operating Systems and Kernel Applications: Techniques and Technologies, pp. 156–199. IGI Global (2010)
Google Scholar
Wartschinski, L.: Detecting software vulnerabilities with deep learning. Master’s thesis, Humboldt University, Berlin (2014)
Google Scholar
Wen, Y., Zhang, W., Luo, R., Wang, J.: Learning text representation using recurrent convolutional neural network with highway layers. arXiv preprint arXiv:1606.06905 (2016)
White, M., Tufano, M., Martinez, M., Monperrus, M., Poshyvanyk, D.: Sorting and transforming program repair ingredients via deep learning code similarities. In: 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), February 2019
Google Scholar

Download references

Acknowledgment

The presented work was carried out within the SETIT Project (2018-1.2.1-NKP-2018-00004)^{Footnote 3} and supported by the Ministry of Innovation and Technology NRDI Office within the framework of the Artificial Intelligence National Laboratory Program (MILAB). The research was partly supported by the EU-funded project AssureMOSS (Grant no. 952647).

Furthermore, Péter Hegedűs was supported by the Bolyai János Scholarship of the Hungarian Academy of Sciences and the ÚNKP-20-5-SZTE-650 New National Excellence Program of the Ministry for Innovation and Technology.

Author information

Authors and Affiliations

Software Engineering Department, University of Szeged, Szeged, Hungary
Amirreza Bagheri
MTA-SZTE Research Group on Artificial Intelligence, ELKH, Szeged, Hungary
Péter Hegedűs
FrontEndART Ltd., Szeged, Hungary
Péter Hegedűs

Authors

Amirreza Bagheri
View author publications
You can also search for this author in PubMed Google Scholar
Péter Hegedűs
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Péter Hegedűs .

Editor information

Editors and Affiliations

Faculty of Engineering of the University of Porto, Porto, Portugal
Ana C. R. Paiva
Institut Polytechnique de Paris, Paris, France
Ana Rosa Cavalli
University of Algarve, Faro, Portugal
Paula Ventura Martins
University of Castila-La Mancha, Ciudad Real, Ciudad Real, Spain
Ricardo Pérez-Castillo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bagheri, A., Hegedűs, P. (2021). A Comparison of Different Source Code Representation Methods for Vulnerability Prediction in Python. In: Paiva, A.C.R., Cavalli, A.R., Ventura Martins, P., Pérez-Castillo, R. (eds) Quality of Information and Communications Technology. QUATIC 2021. Communications in Computer and Information Science, vol 1439. Springer, Cham. https://doi.org/10.1007/978-3-030-85347-1_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-85347-1_20
Published: 25 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85346-4
Online ISBN: 978-3-030-85347-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Comparison of Different Source Code Representation Methods for Vulnerability Prediction in Python

Abstract

Access this chapter

Similar content being viewed by others

An Empirical Evaluation of the Usefulness of Word Embedding Techniques in Deep Learning-Based Vulnerability Prediction

An extensive study of the effects of different deep learning models on code vulnerability detection in Python code

Optimizing software vulnerability detection using RoBERTa and machine learning

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Comparison of Different Source Code Representation Methods for Vulnerability Prediction in Python

Abstract

Access this chapter

Similar content being viewed by others

An Empirical Evaluation of the Usefulness of Word Embedding Techniques in Deep Learning-Based Vulnerability Prediction

An extensive study of the effects of different deep learning models on code vulnerability detection in Python code

Optimizing software vulnerability detection using RoBERTa and machine learning

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation