Abstract
Several advances in deep learning have been successfully applied to the software development process. Of recent interest is the use of neural language models to build tools, such as Copilot, that assist in writing code. In this paper we perform a comparative empirical analysis of Copilot-generated code from a security perspective. The aim of this study is to determine if Copilot is as bad as human developers. We investigate whether Copilot is just as likely to introduce the same software vulnerabilities as human developers. Using a dataset of C/C++ vulnerabilities, we prompt Copilot to generate suggestions in scenarios that led to the introduction of vulnerabilities by human developers. The suggestions are inspected and categorized in a 2-stage process based on whether the original vulnerability or fix is reintroduced. We find that Copilot replicates the original vulnerable code about 33% of the time while replicating the fixed code at a 25% rate. However this behaviour is not consistent: Copilot is more likely to introduce some types of vulnerabilities than others and is also more likely to generate vulnerable code in response to prompts that correspond to older vulnerabilities. Overall, given that in a significant number of cases it did not replicate the vulnerabilities previously introduced by human developers, we conclude that Copilot, despite performing differently across various vulnerability types, is not as bad as human developers at introducing vulnerabilities in code.
Similar content being viewed by others
Data Availibility Statement
The dataset used in this study is available in the MSR_20_Code_vulnerability_CSV _Dataset repository, https://github.com/ZeoVan/MSR_20_Code_vulnerability_CSV_Dataset
References
Asare, O., M. Nagappan, and N. Asokan. 2022. Is GitHub’s Copilot as Bad as Humans at Introducing Vulnerabilities in Code? _eprint: 2204.04741
Barke, S., M.B. James, and N. Polikarpova. 2022, August. Grounded Copilot: How Programmers Interact with Code-Generating Models. arXiv:2206.15000
Bengio, Y., R. Ducharme, and P. Vincent 2000.A Neural Probabilistic Language Model. In Advances in Neural Information Processing Systems,Volume 13. MIT Press
Bielik, P., V. Raychev, and M. Vechev 2016.PHOG: probabilistic model for code. In International Conference on Machine Learning, pp. 2933–2942. PMLR
Brown, T.B., B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal,A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,G. Krueger, T. Henighan, R. Child, A. Ramesh, D.M. Ziegler, J. Wu, C. Winter,C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. 2020,July.Language Models are Few-Shot Learners. arXiv:2005.14165 [cs]
Chakraborty S, Krishna R, Ding Y, Ray B (2022) Learning Based Vulnerability Detection: Are We There Yet? IEEE Transactions on Software Engineering 48(9):3280–3296. https://doi.org/10.1109/TSE.2021.3087402
Chen, D. and C. Manning 2014, October. A Fast and Accurate Dependency Parser using Neural Networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 740–750. Association for Computational Linguistics
Chen, M., J. Tworek, H. Jun, Q. Yuan, H.P.d.O. Pinto, J. Kaplan, H. Edwards,Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov,H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov,A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F.P. Such,D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W.H. Guss,A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain,W. Saunders, C. Hesse, A.N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa,A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder,B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba. 2021,July.Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs]
Ciniselli, M., L. Pascarella, and G. Bavota. 2022, April.To What Extent do Deep Learning-based Code Recommenders Generate Predictions by Cloning Code from the Training Set? arXiv:2204.06894
Dakhel, A.M., V. Majdinasab, A. Nikanjam, F. Khomh, M.C. Desmarais, Z. Ming,and Jiang. 2022, June. GitHub Copilot AI pair programmer: Asset or Liability? arXiv:2206.15331
Desai, A. and A. Deo. 2022. Introducing Amazon CodeWhisperer, the ML-powered coding companion
Devlin, J., M.W. Chang, K. Lee, and K. Toutanova. 2019, May. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805
Dohmke, T. 2022, June.GitHub Copilot is generally available to all developers
Fan, J., Y. Li, S. Wang, and T.N. Nguyen 2020, June.A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. In Proceedings of the 17th International Conference on Mining Software Repositories, Seoul Republic of Korea, pp. 508–512.ACM
Feng, Z., D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu,D. Jiang, and M. Zhou. 2020, September.CodeBERT: A Pre-Trained Model for Programming and Natural Languages.arXiv:2002.08155
Galassi, A., M. Lippi, and P. Torroni. 2021, October. Natural Language Processing.IEEE Transactions on Neural Networks and Learning Systems 32(10): 4291–4308. https://doi.org/10.1109/TNNLS.2020.3019893
GitHub Inc. 2019.CodeQL
GitHub Inc. 2021.GitHub Copilot Your AI pair programmer
Hardmeier, C. 2016, December.A Neural Model for Part-of-Speech Tagging in Historical Texts.In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp.922–931. The COLING 2016 Organizing Committee
Hellendoorn, V.J. and P. Devanbu 2017, August.Are deep neural networks the best choice for modeling source code? In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, Paderborn Germany, pp.763–773. ACM
Hindle, A., E.T. Barr, Z. Su, M. Gabel, and P. Devanbu 2012.On the Naturalness of Software.In Proceedings of the 34th International Conference on Software Engineering, ICSE ’12, pp.837–847. IEEE Press. event-place: Zurich, Switzerland
Hochreiter S, Schmidhuber J (1997) November. Long Short-Term Memory. Neural Computation 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Jiang, N., T. Lutellier, and L. Tan 2021, May. CURE: Code-Aware Neural Machine Translation for Automatic Program Repair. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pp.1161–1173.ISSN: 1558-1225
Le, T.H.M., H. Chen, and M.A. Babar. 2020, June.Deep Learning for Source Code Modeling and Generation:Models, Applications, and Challenges. ACM Comput. Surv. 53(3)https://doi.org/10.1162/neco.10.1145/3383458
Li, Y., D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R Leblond, T. Eccles,J. Keeling, F. Gimeno, A.D. Lago, T. Hubert, P. Choy, C.d.M. d’Autume,I. Babuschkin, X. Chen, P.S. Huang, J. Welbl, S. Gowal, A. Cherepanov,J. Molloy, D.J. Mankowitz, E.S. Robson, P. Kohli, N. de Freitas,K. Kavukcuoglu, and O. Vinyals. 2022.Competition-Level Code Generation with AlphaCode
Lu, S., D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement,D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, L. Zhou, M. Tufano,M. Gong, M. Zhou, N. Duan, N. Sundaresan, S.K. Deng, S. Fu, and S. Liu. 2021,March.CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. arXiv:2102.04664
Nguyen, N. and S. Nadi 2022.Empirical Evaluation of GitHub Copilot’s Code Suggestions.In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), pp.1–5
Nijkamp, E., B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong. 2022.CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis.arXiv preprint
Pearce, H., B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri 2022, May.Asleep at the Keyboard? Assessing the Security of GitHub Copilot Code Contributions. In 2022 IEEE Symposium on Security and Privacy (SP), pp.754–768.ISSN: 2375-1207
Pearce, H., B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt 2023, May.Examining Zero-Shot Vulnerability Repair with Large Language Models.In 2023 2023 IEEE Symposium on Security and Privacy(SP) (SP), Los Alamitos, CA, USA, pp.1–18. IEEE Computer Society
Prenner, J., H. Babii, and R. Robbes 2022, May. Can OpenAI’s Codex Fix Bugs?: An evaluation on QuixBugs.2022 IEEE/ACM International Workshop on Automated Program Repair (APR), Los Alamitos, CA, USA, pp.69–75. IEEE Computer Society
Raychev, V., M. Vechev, and E. Yahav 2014, June.Code completion with statistical language models.In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, Edinburgh United Kingdom, pp.419–428. ACM
Sobania, D., M. Briesch, and F. Rothlauf 2022, July.Choose your programming copilot: a comparison of the program synthesis performance of github copilot and genetic programming.In Proceedings of the Genetic and Evolutionary Computation Conference, Boston Massachusetts, pp.1019–1027. ACM
Svyatkovskiy, A., S.K. Deng, S. Fu, and N. Sundaresan 2020, November.IntelliCode compose: code generation using transformer.In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event USA, pp.1433–1443. ACM
Synopsys 2022.Source Security and Risk Analysis Report. Technical report, Synopsys Inc
Tabnine. 2022.Code Faster with AI Completions
Vaithilingam, P., T. Zhang, and E.L. Glassman 2022, April. Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models.In CHI Conference on Human Factors in Computing Systems Extended Abstracts, New Orleans LA USA, pp.1–7. ACM
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł Kaiser, and I. Polosukhin 2017.Attention is All You Need.In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA,pp.6000–6010. Curran Associates Inc.event-place: Long Beach, California, USA
Xu, F.F., U. Alon, G. Neubig, and V.J. Hellendoorn 2022, June. A systematic evaluation of large language models of code.In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, San Diego CA USA, pp.1–10. ACM
Yan, W. and Y. Li. 2022, April.WhyGen: Explaining ML-powered Code Generation by Referring to Training Examples. arXiv:2204.07940
Yin, J., X. Jiang, Z. Lu, L. Shang, H. Li, and X. Li 2016. Neural Generative Question Answering.In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16, pp.2972–2978. AAAI Press.event-place: New York, New York, USA
Yin, P. and G. Neubig. 2017, April. A Syntactic Neural Model for General-Purpose Code Generation.arXiv:1704.01696
Zhang, J., J. Cambronero, S. Gulwani, V. Le, R. Piskac, G. Soares, and G. Verbruggen. 2022.Repairing Bugs in Python Assignments Using Large Language Models
Zhou J, Cao Y, Wang X, Li P, Xu W (2016) Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation. Transactions of the Association for Computational Linguistics 4:371–383. https://doi.org/10.1162/tacl_a_00105
Ziegler, A., E. Kalliamvakou, X.A. Li, A. Rice, D. Rifkin, S. Simister,G. Sittampalam, and E. Aftandilian 2022, June.Productivity assessment of neural code completion.In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, San Diego CA USA, pp.21–29. ACM
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Jin Guo
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Special Issue on Mining Software Repositories (MSR).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Asare, O., Nagappan, M. & Asokan, N. Is GitHub’s Copilot as bad as humans at introducing vulnerabilities in code?. Empir Software Eng 28, 129 (2023). https://doi.org/10.1007/s10664-023-10380-1
Accepted:
Published:
DOI: https://doi.org/10.1007/s10664-023-10380-1