Skip to main content

A Comparison of Different Source Code Representation Methods for Vulnerability Prediction in Python

  • Conference paper
  • First Online:
Quality of Information and Communications Technology (QUATIC 2021)

Abstract

In the age of big data and machine learning, at a time when the techniques and methods of software development are evolving rapidly, a problem has arisen: programmers can no longer detect all the security flaws and vulnerabilities in their code manually. To overcome this problem, developers can now rely on automatic techniques, like machine learning based prediction models, to detect such issues. An inherent property of such approaches is that they work with numeric vectors (i.e., feature vectors) as inputs. Therefore, one needs to transform the source code into such feature vectors, often referred to as code embedding. A popular approach for code embedding is to adapt natural language processing techniques, like text representation, to automatically derive the necessary features from the source code. However, the suitability and comparison of different text representation techniques for solving Software Engineering (SE) problems is rarely studied systematically. In this paper, we present a comparative study on three popular text representation methods, word2vec, fastText, and BERT applied to the SE task of detecting vulnerabilities in Python code. Using a data mining approach, we collected a large volume of Python source code in both vulnerable and fixed forms that we embedded with word2vec, fastText, and BERT to vectors and used a Long Short-Term Memory network to train on them. Using the same LSTM architecture, we could compare the efficiency of the different embeddings in deriving meaningful feature vectors. Our findings show that all the text representation methods are suitable for code representation in this particular task, but the BERT model is the most promising as it is the least time consuming and the LSTM model based on it achieved the best overall accuracy (93.8%) in predicting Python source code vulnerabilities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://doi.org/10.5281/zenodo.4703996.

  2. 2.

    https://radimrehurek.com/gensim/.

  3. 3.

    Project no. 2018-1.2.1-NKP-2018-00004 has been implemented with the support provided from the National Research, Development and Innovation Fund of Hungary, financed under the 2018-1.2.1-NKP funding scheme.

References

  1. Allamanis, M., Sutton, C.: Mining source code repositories at massive scale using language modeling. In: 2013 10th Working Conference on Mining Software Repositories (MSR), pp. 207–216. IEEE (2013)

    Google Scholar 

  2. Alon, U., Zilberstein, M., Levy, O., Yahav, E.: code2vec: learning distributed representations of code. In: Proceedings of the ACM on Programming Languages, vol. 3(POPL), pp. 1–29 (2019)

    Google Scholar 

  3. Arroyo, M., Chiotta, F., Bavera, F.: An user configurable clang static analyzer taint checker. In: 2016 35th International Conference of the Chilean Computer Science Society (SCCC), pp. 1–12. IEEE (2016)

    Google Scholar 

  4. Ben-Nun, T., Jakobovits, A.S., Hoefler, T.: Neural code comprehension: a learnable representation of code semantics. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 2018, Red Hook, NY, USA, pp. 3589–3601. Curran Associates Inc. (2018)

    Google Scholar 

  5. Bhoopchand, A., Rocktäschel, T., Barr, E., Riedel, S.: Learning python code suggestion with a sparse pointer network. arXiv preprint arXiv:1611.08307 (2016)

  6. Chaturvedi, K.K., Sing, V., Singh, P.: Tools in mining software repositories. In: 2013 13th International Conference on Computational Science and Its Applications, pp. 89–98. IEEE (2013)

    Google Scholar 

  7. Chen, Z., Monperrus, M.: The remarkable role of similarity in redundancy-based program repair. arXiv preprint arXiv:1811.05703 (2018)

  8. Chollet, F., et al.: Keras: the python deep learning library. Astrophysics Source Code Library, p. ascl-1806 (2018)

    Google Scholar 

  9. Church, K.W.: Word2vec. Nat. Lang. Eng. 23(1), 155–162 (2017)

    Article  Google Scholar 

  10. Cousot, P., et al.: The ASTREÉ analyzer. In: Sagiv, M. (ed.) ESOP 2005. LNCS, vol. 3444, pp. 21–30. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-31987-0_3

    Chapter  Google Scholar 

  11. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)

  12. Feng, Z., et al.: CodeBERT: a pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020)

  13. Harer, J.A., et al.: Automated software vulnerability detection with machine learning (2018)

    Google Scholar 

  14. Hovsepyan, A., Scandariato, R., Joosen, W., Walden, J.: Software vulnerability prediction using text analysis techniques. In: Proceedings of the 4th International Workshop on Security Measurements and Metrics, pp. 7–10 (2012)

    Google Scholar 

  15. Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: FastText.zip: Compressing text classification models. arXiv:1612.03651 (2016)

  16. Lozoya, R.C., Baumann, A., Sabetta, A., Bezzi, M.: Commit2vec: learning distributed representations of code changes. SN Comput. Sci. 2(3), 1–16 (2021)

    Google Scholar 

  17. Olesen, M.C., Hansen, R.R., Lawall, J.L., Palix, N.: Coccinelle: tool support for automated CERT C secure coding standard certification. Sci. Comput. Program. 91, 141–160 (2014)

    Article  Google Scholar 

  18. Peng, H., Mou, L., Li, G., Liu, Y., Zhang, L., Jin, Z.: Building program vector representations for deep learning. In: Zhang, S., Wirsing, M., Zhang, Z. (eds.) KSEM 2015. LNCS (LNAI), vol. 9403, pp. 547–553. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25159-2_49

    Chapter  Google Scholar 

  19. Russell, R., et al.: Automated vulnerability detection in source code using deep representation learning. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 757–762. IEEE (2018)

    Google Scholar 

  20. Skaletsky, A., et al.: Dynamic program analysis of Microsoft windows applications. In: 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), pp. 2–12. IEEE (2010)

    Google Scholar 

  21. Spadini, D., Aniche, M., Bacchelli, A.: Pydriller: Python framework for mining software repositories. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 908–911 (2018)

    Google Scholar 

  22. Srivastava, A., Eustace, A.: ATOM: a system for building customized program analysis tools. ACM SIGPLAN Not. 39(4), 528–539 (2004)

    Article  Google Scholar 

  23. Sundermeyer, M., Schlüter, R., Ney, H.: LSTM neural networks for language modeling. In: 13th Annual Conference of the International Speech Communication Association (2012)

    Google Scholar 

  24. Waddington, D.G., Roy, N., Schmidt, D.C.: Dynamic analysis and profiling of multithreaded systems. In: Advanced Operating Systems and Kernel Applications: Techniques and Technologies, pp. 156–199. IGI Global (2010)

    Google Scholar 

  25. Wartschinski, L.: Detecting software vulnerabilities with deep learning. Master’s thesis, Humboldt University, Berlin (2014)

    Google Scholar 

  26. Wen, Y., Zhang, W., Luo, R., Wang, J.: Learning text representation using recurrent convolutional neural network with highway layers. arXiv preprint arXiv:1606.06905 (2016)

  27. White, M., Tufano, M., Martinez, M., Monperrus, M., Poshyvanyk, D.: Sorting and transforming program repair ingredients via deep learning code similarities. In: 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), February 2019

    Google Scholar 

Download references

Acknowledgment

The presented work was carried out within the SETIT Project (2018-1.2.1-NKP-2018-00004)Footnote 3 and supported by the Ministry of Innovation and Technology NRDI Office within the framework of the Artificial Intelligence National Laboratory Program (MILAB). The research was partly supported by the EU-funded project AssureMOSS (Grant no. 952647).

Furthermore, Péter Hegedűs was supported by the Bolyai János Scholarship of the Hungarian Academy of Sciences and the ÚNKP-20-5-SZTE-650 New National Excellence Program of the Ministry for Innovation and Technology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Péter Hegedűs .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bagheri, A., Hegedűs, P. (2021). A Comparison of Different Source Code Representation Methods for Vulnerability Prediction in Python. In: Paiva, A.C.R., Cavalli, A.R., Ventura Martins, P., Pérez-Castillo, R. (eds) Quality of Information and Communications Technology. QUATIC 2021. Communications in Computer and Information Science, vol 1439. Springer, Cham. https://doi.org/10.1007/978-3-030-85347-1_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-85347-1_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-85346-4

  • Online ISBN: 978-3-030-85347-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics