Skip to main content
Log in

Is GitHub’s Copilot as bad as humans at introducing vulnerabilities in code?

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Several advances in deep learning have been successfully applied to the software development process. Of recent interest is the use of neural language models to build tools, such as Copilot, that assist in writing code. In this paper we perform a comparative empirical analysis of Copilot-generated code from a security perspective. The aim of this study is to determine if Copilot is as bad as human developers. We investigate whether Copilot is just as likely to introduce the same software vulnerabilities as human developers. Using a dataset of C/C++ vulnerabilities, we prompt Copilot to generate suggestions in scenarios that led to the introduction of vulnerabilities by human developers. The suggestions are inspected and categorized in a 2-stage process based on whether the original vulnerability or fix is reintroduced. We find that Copilot replicates the original vulnerable code about 33% of the time while replicating the fixed code at a 25% rate. However this behaviour is not consistent: Copilot is more likely to introduce some types of vulnerabilities than others and is also more likely to generate vulnerable code in response to prompts that correspond to older vulnerabilities. Overall, given that in a significant number of cases it did not replicate the vulnerabilities previously introduced by human developers, we conclude that Copilot, despite performing differently across various vulnerability types, is not as bad as human developers at introducing vulnerabilities in code.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availibility Statement

The dataset used in this study is available in the MSR_20_Code_vulnerability_CSV _Dataset repository, https://github.com/ZeoVan/MSR_20_Code_vulnerability_CSV_Dataset

References

  • Asare, O., M. Nagappan, and N. Asokan. 2022. Is GitHub’s Copilot as Bad as Humans at Introducing Vulnerabilities in Code? _eprint: 2204.04741

  • Barke, S., M.B. James, and N. Polikarpova. 2022, August. Grounded Copilot: How Programmers Interact with Code-Generating Models. arXiv:2206.15000

  • Bengio, Y., R. Ducharme, and P. Vincent 2000.A Neural Probabilistic Language Model. In Advances in Neural Information Processing Systems,Volume 13. MIT Press

  • Bielik, P., V. Raychev, and M. Vechev 2016.PHOG: probabilistic model for code. In International Conference on Machine Learning, pp. 2933–2942. PMLR

  • Brown, T.B., B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal,A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,G. Krueger, T. Henighan, R. Child, A. Ramesh, D.M. Ziegler, J. Wu, C. Winter,C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. 2020,July.Language Models are Few-Shot Learners. arXiv:2005.14165 [cs]

  • Chakraborty S, Krishna R, Ding Y, Ray B (2022) Learning Based Vulnerability Detection: Are We There Yet? IEEE Transactions on Software Engineering 48(9):3280–3296. https://doi.org/10.1109/TSE.2021.3087402

    Article  Google Scholar 

  • Chen, D. and C. Manning 2014, October. A Fast and Accurate Dependency Parser using Neural Networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 740–750. Association for Computational Linguistics

  • Chen, M., J. Tworek, H. Jun, Q. Yuan, H.P.d.O. Pinto, J. Kaplan, H. Edwards,Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov,H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov,A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F.P. Such,D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W.H. Guss,A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain,W. Saunders, C. Hesse, A.N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa,A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder,B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba. 2021,July.Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs]

  • Ciniselli, M., L. Pascarella, and G. Bavota. 2022, April.To What Extent do Deep Learning-based Code Recommenders Generate Predictions by Cloning Code from the Training Set? arXiv:2204.06894

  • Dakhel, A.M., V. Majdinasab, A. Nikanjam, F. Khomh, M.C. Desmarais, Z. Ming,and Jiang. 2022, June. GitHub Copilot AI pair programmer: Asset or Liability? arXiv:2206.15331

  • Desai, A. and A. Deo. 2022. Introducing Amazon CodeWhisperer, the ML-powered coding companion

  • Devlin, J., M.W. Chang, K. Lee, and K. Toutanova. 2019, May. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805

  • Dohmke, T. 2022, June.GitHub Copilot is generally available to all developers

  • Fan, J., Y. Li, S. Wang, and T.N. Nguyen 2020, June.A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. In Proceedings of the 17th International Conference on Mining Software Repositories, Seoul Republic of Korea, pp. 508–512.ACM

  • Feng, Z., D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu,D. Jiang, and M. Zhou. 2020, September.CodeBERT: A Pre-Trained Model for Programming and Natural Languages.arXiv:2002.08155

  • Galassi, A., M. Lippi, and P. Torroni. 2021, October. Natural Language Processing.IEEE Transactions on Neural Networks and Learning Systems 32(10): 4291–4308. https://doi.org/10.1109/TNNLS.2020.3019893

  • GitHub Inc. 2019.CodeQL

  • GitHub Inc. 2021.GitHub Copilot Your AI pair programmer

  • Hardmeier, C. 2016, December.A Neural Model for Part-of-Speech Tagging in Historical Texts.In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp.922–931. The COLING 2016 Organizing Committee

  • Hellendoorn, V.J. and P. Devanbu 2017, August.Are deep neural networks the best choice for modeling source code? In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, Paderborn Germany, pp.763–773. ACM

  • Hindle, A., E.T. Barr, Z. Su, M. Gabel, and P. Devanbu 2012.On the Naturalness of Software.In Proceedings of the 34th International Conference on Software Engineering, ICSE ’12, pp.837–847. IEEE Press. event-place: Zurich, Switzerland

  • Hochreiter S, Schmidhuber J (1997) November. Long Short-Term Memory. Neural Computation 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  • Jiang, N., T. Lutellier, and L. Tan 2021, May. CURE: Code-Aware Neural Machine Translation for Automatic Program Repair. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pp.1161–1173.ISSN: 1558-1225

  • Le, T.H.M., H. Chen, and M.A. Babar. 2020, June.Deep Learning for Source Code Modeling and Generation:Models, Applications, and Challenges. ACM Comput. Surv. 53(3)https://doi.org/10.1162/neco.10.1145/3383458

  • Li, Y., D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R Leblond, T. Eccles,J. Keeling, F. Gimeno, A.D. Lago, T. Hubert, P. Choy, C.d.M. d’Autume,I. Babuschkin, X. Chen, P.S. Huang, J. Welbl, S. Gowal, A. Cherepanov,J. Molloy, D.J. Mankowitz, E.S. Robson, P. Kohli, N. de Freitas,K. Kavukcuoglu, and O. Vinyals. 2022.Competition-Level Code Generation with AlphaCode

  • Lu, S., D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement,D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, L. Zhou, M. Tufano,M. Gong, M. Zhou, N. Duan, N. Sundaresan, S.K. Deng, S. Fu, and S. Liu. 2021,March.CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. arXiv:2102.04664

  • Nguyen, N. and S. Nadi 2022.Empirical Evaluation of GitHub Copilot’s Code Suggestions.In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), pp.1–5

  • Nijkamp, E., B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong. 2022.CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis.arXiv preprint

  • Pearce, H., B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri 2022, May.Asleep at the Keyboard? Assessing the Security of GitHub Copilot Code Contributions. In 2022 IEEE Symposium on Security and Privacy (SP), pp.754–768.ISSN: 2375-1207

  • Pearce, H., B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt 2023, May.Examining Zero-Shot Vulnerability Repair with Large Language Models.In 2023 2023 IEEE Symposium on Security and Privacy(SP) (SP), Los Alamitos, CA, USA, pp.1–18. IEEE Computer Society

  • Prenner, J., H. Babii, and R. Robbes 2022, May. Can OpenAI’s Codex Fix Bugs?: An evaluation on QuixBugs.2022 IEEE/ACM International Workshop on Automated Program Repair (APR), Los Alamitos, CA, USA, pp.69–75. IEEE Computer Society

  • Raychev, V., M. Vechev, and E. Yahav 2014, June.Code completion with statistical language models.In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, Edinburgh United Kingdom, pp.419–428. ACM

  • Sobania, D., M. Briesch, and F. Rothlauf 2022, July.Choose your programming copilot: a comparison of the program synthesis performance of github copilot and genetic programming.In Proceedings of the Genetic and Evolutionary Computation Conference, Boston Massachusetts, pp.1019–1027. ACM

  • Svyatkovskiy, A., S.K. Deng, S. Fu, and N. Sundaresan 2020, November.IntelliCode compose: code generation using transformer.In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event USA, pp.1433–1443. ACM

  • Synopsys 2022.Source Security and Risk Analysis Report. Technical report, Synopsys Inc

  • Tabnine. 2022.Code Faster with AI Completions

  • Vaithilingam, P., T. Zhang, and E.L. Glassman 2022, April. Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models.In CHI Conference on Human Factors in Computing Systems Extended Abstracts, New Orleans LA USA, pp.1–7. ACM

  • Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł Kaiser, and I. Polosukhin 2017.Attention is All You Need.In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA,pp.6000–6010. Curran Associates Inc.event-place: Long Beach, California, USA

  • Xu, F.F., U. Alon, G. Neubig, and V.J. Hellendoorn 2022, June. A systematic evaluation of large language models of code.In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, San Diego CA USA, pp.1–10. ACM

  • Yan, W. and Y. Li. 2022, April.WhyGen: Explaining ML-powered Code Generation by Referring to Training Examples. arXiv:2204.07940

  • Yin, J., X. Jiang, Z. Lu, L. Shang, H. Li, and X. Li 2016. Neural Generative Question Answering.In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16, pp.2972–2978. AAAI Press.event-place: New York, New York, USA

  • Yin, P. and G. Neubig. 2017, April. A Syntactic Neural Model for General-Purpose Code Generation.arXiv:1704.01696

  • Zhang, J., J. Cambronero, S. Gulwani, V. Le, R. Piskac, G. Soares, and G. Verbruggen. 2022.Repairing Bugs in Python Assignments Using Large Language Models

  • Zhou J, Cao Y, Wang X, Li P, Xu W (2016) Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation. Transactions of the Association for Computational Linguistics 4:371–383. https://doi.org/10.1162/tacl_a_00105

    Article  Google Scholar 

  • Ziegler, A., E. Kalliamvakou, X.A. Li, A. Rice, D. Rifkin, S. Simister,G. Sittampalam, and E. Aftandilian 2022, June.Productivity assessment of neural code completion.In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, San Diego CA USA, pp.21–29. ACM

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Owura Asare.

Additional information

Communicated by: Jin Guo

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Mining Software Repositories (MSR).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Asare, O., Nagappan, M. & Asokan, N. Is GitHub’s Copilot as bad as humans at introducing vulnerabilities in code?. Empir Software Eng 28, 129 (2023). https://doi.org/10.1007/s10664-023-10380-1

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-023-10380-1

Keywords

Navigation