Abstract
In the age of big data and machine learning, at a time when the techniques and methods of software development are evolving rapidly, a problem has arisen: programmers can no longer detect all the security flaws and vulnerabilities in their code manually. To overcome this problem, developers can now rely on automatic techniques, like machine learning based prediction models, to detect such issues. An inherent property of such approaches is that they work with numeric vectors (i.e., feature vectors) as inputs. Therefore, one needs to transform the source code into such feature vectors, often referred to as code embedding. A popular approach for code embedding is to adapt natural language processing techniques, like text representation, to automatically derive the necessary features from the source code. However, the suitability and comparison of different text representation techniques for solving Software Engineering (SE) problems is rarely studied systematically. In this paper, we present a comparative study on three popular text representation methods, word2vec, fastText, and BERT applied to the SE task of detecting vulnerabilities in Python code. Using a data mining approach, we collected a large volume of Python source code in both vulnerable and fixed forms that we embedded with word2vec, fastText, and BERT to vectors and used a Long Short-Term Memory network to train on them. Using the same LSTM architecture, we could compare the efficiency of the different embeddings in deriving meaningful feature vectors. Our findings show that all the text representation methods are suitable for code representation in this particular task, but the BERT model is the most promising as it is the least time consuming and the LSTM model based on it achieved the best overall accuracy (93.8%) in predicting Python source code vulnerabilities.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
Project no. 2018-1.2.1-NKP-2018-00004 has been implemented with the support provided from the National Research, Development and Innovation Fund of Hungary, financed under the 2018-1.2.1-NKP funding scheme.
References
Allamanis, M., Sutton, C.: Mining source code repositories at massive scale using language modeling. In: 2013 10th Working Conference on Mining Software Repositories (MSR), pp. 207–216. IEEE (2013)
Alon, U., Zilberstein, M., Levy, O., Yahav, E.: code2vec: learning distributed representations of code. In: Proceedings of the ACM on Programming Languages, vol. 3(POPL), pp. 1–29 (2019)
Arroyo, M., Chiotta, F., Bavera, F.: An user configurable clang static analyzer taint checker. In: 2016 35th International Conference of the Chilean Computer Science Society (SCCC), pp. 1–12. IEEE (2016)
Ben-Nun, T., Jakobovits, A.S., Hoefler, T.: Neural code comprehension: a learnable representation of code semantics. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS 2018, Red Hook, NY, USA, pp. 3589–3601. Curran Associates Inc. (2018)
Bhoopchand, A., Rocktäschel, T., Barr, E., Riedel, S.: Learning python code suggestion with a sparse pointer network. arXiv preprint arXiv:1611.08307 (2016)
Chaturvedi, K.K., Sing, V., Singh, P.: Tools in mining software repositories. In: 2013 13th International Conference on Computational Science and Its Applications, pp. 89–98. IEEE (2013)
Chen, Z., Monperrus, M.: The remarkable role of similarity in redundancy-based program repair. arXiv preprint arXiv:1811.05703 (2018)
Chollet, F., et al.: Keras: the python deep learning library. Astrophysics Source Code Library, p. ascl-1806 (2018)
Church, K.W.: Word2vec. Nat. Lang. Eng. 23(1), 155–162 (2017)
Cousot, P., et al.: The ASTREÉ analyzer. In: Sagiv, M. (ed.) ESOP 2005. LNCS, vol. 3444, pp. 21–30. Springer, Heidelberg (2005). https://doi.org/10.1007/978-3-540-31987-0_3
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)
Feng, Z., et al.: CodeBERT: a pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020)
Harer, J.A., et al.: Automated software vulnerability detection with machine learning (2018)
Hovsepyan, A., Scandariato, R., Joosen, W., Walden, J.: Software vulnerability prediction using text analysis techniques. In: Proceedings of the 4th International Workshop on Security Measurements and Metrics, pp. 7–10 (2012)
Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., Mikolov, T.: FastText.zip: Compressing text classification models. arXiv:1612.03651 (2016)
Lozoya, R.C., Baumann, A., Sabetta, A., Bezzi, M.: Commit2vec: learning distributed representations of code changes. SN Comput. Sci. 2(3), 1–16 (2021)
Olesen, M.C., Hansen, R.R., Lawall, J.L., Palix, N.: Coccinelle: tool support for automated CERT C secure coding standard certification. Sci. Comput. Program. 91, 141–160 (2014)
Peng, H., Mou, L., Li, G., Liu, Y., Zhang, L., Jin, Z.: Building program vector representations for deep learning. In: Zhang, S., Wirsing, M., Zhang, Z. (eds.) KSEM 2015. LNCS (LNAI), vol. 9403, pp. 547–553. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25159-2_49
Russell, R., et al.: Automated vulnerability detection in source code using deep representation learning. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 757–762. IEEE (2018)
Skaletsky, A., et al.: Dynamic program analysis of Microsoft windows applications. In: 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), pp. 2–12. IEEE (2010)
Spadini, D., Aniche, M., Bacchelli, A.: Pydriller: Python framework for mining software repositories. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 908–911 (2018)
Srivastava, A., Eustace, A.: ATOM: a system for building customized program analysis tools. ACM SIGPLAN Not. 39(4), 528–539 (2004)
Sundermeyer, M., Schlüter, R., Ney, H.: LSTM neural networks for language modeling. In: 13th Annual Conference of the International Speech Communication Association (2012)
Waddington, D.G., Roy, N., Schmidt, D.C.: Dynamic analysis and profiling of multithreaded systems. In: Advanced Operating Systems and Kernel Applications: Techniques and Technologies, pp. 156–199. IGI Global (2010)
Wartschinski, L.: Detecting software vulnerabilities with deep learning. Master’s thesis, Humboldt University, Berlin (2014)
Wen, Y., Zhang, W., Luo, R., Wang, J.: Learning text representation using recurrent convolutional neural network with highway layers. arXiv preprint arXiv:1606.06905 (2016)
White, M., Tufano, M., Martinez, M., Monperrus, M., Poshyvanyk, D.: Sorting and transforming program repair ingredients via deep learning code similarities. In: 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), February 2019
Acknowledgment
The presented work was carried out within the SETIT Project (2018-1.2.1-NKP-2018-00004)Footnote 3 and supported by the Ministry of Innovation and Technology NRDI Office within the framework of the Artificial Intelligence National Laboratory Program (MILAB). The research was partly supported by the EU-funded project AssureMOSS (Grant no. 952647).
Furthermore, Péter Hegedűs was supported by the Bolyai János Scholarship of the Hungarian Academy of Sciences and the ÚNKP-20-5-SZTE-650 New National Excellence Program of the Ministry for Innovation and Technology.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Bagheri, A., Hegedűs, P. (2021). A Comparison of Different Source Code Representation Methods for Vulnerability Prediction in Python. In: Paiva, A.C.R., Cavalli, A.R., Ventura Martins, P., Pérez-Castillo, R. (eds) Quality of Information and Communications Technology. QUATIC 2021. Communications in Computer and Information Science, vol 1439. Springer, Cham. https://doi.org/10.1007/978-3-030-85347-1_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-85347-1_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85346-4
Online ISBN: 978-3-030-85347-1
eBook Packages: Computer ScienceComputer Science (R0)