Skip to main content
Log in

Clones in deep learning code: what, where, and why?

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

Deep Learning applications are becoming increasingly popular worldwide. Developers of deep learning systems like in every other context of software development strive to write more efficient code in terms of performance, complexity, and maintenance. The continuous evolution of deep learning systems imposing tighter development timelines and their increasing complexity may result in bad design decisions by the developers. Besides, due to the use of common frameworks and repetitive implementation of similar tasks, deep learning developers are likely to use the copy-paste practice leading to clones in deep learning code. Code clone is considered to be a bad software development practice since developers can inadvertently fail to properly propagate changes to all clones fragments during a maintenance activity. However, to the best of our knowledge, no study has investigated code cloning practices in deep learning development. The majority of research on deep learning systems mostly focusing on improving the dependability of the models. Given the negative impacts of clones on software quality reported in the studies on traditional systems and the inherent complexity of maintaining deep learning systems (e.g., bug fixing), it is very important to understand the characteristics and potential impacts of code clones on deep learning systems. This paper examines the frequency, distribution, and impacts of code clones and the code cloning practices in deep learning systems. To accomplish this, we use the NiCad clone detection tool to detect clones from 59 Python, 14 C#, and 6 Java based deep learning systems and an equal number of traditional software systems. We then analyze the comparative frequency and distribution of code clones in deep learning systems and the traditional ones. Further, we study the distribution of the detected code clones by applying a location based taxonomy. In addition, we study the correlation between bugs and code clones to assess the impacts of clones on the quality of the studied systems. Finally, we introduce a code clone taxonomy related to deep learning programs based on 6 DL software systems (from 59 DL systems) and identify the deep learning system development phases in which cloning has the highest risk of faults. Our results show that code cloning is a frequent practice in deep learning systems and that deep learning developers often clone code from files contain in distant repositories in the system. In addition, we found that code cloning occurs more frequently during DL model construction, model training, and data pre-processing. And that hyperparameters setting is the phase of deep learning model construction during which cloning is the riskiest, since it often leads to faults.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28

Similar content being viewed by others

Notes

  1. https://www.forbes.com/sites/robtoews/2020/06/17/deep-learnings-climate-change-problem/?sh=7de914936b43

  2. https://github.com/tensorflow/ranking

  3. https://github.com/keras-team/keras/commit/2d8739d

References

  • Al Dallal J, Abdin A (2017) Empirical evaluation of the impact of object-oriented code refactoring on quality attributes: a systematic literature review. IEEE Trans Softw Eng 44(1):44–69

    Article  Google Scholar 

  • Amershi S, Begel A, Bird C, DeLine R, Gall H, Kamar E, Nagappan N, Nushi B, Zimmermann T (2019) Software engineering for machine learning: a case study. In: 2019 IEEE/ACM 41St international conference on software engineering: Software engineering in practice (ICSE-SEIP). IEEE, pp 291–300

  • Anwar H, Pfahl D, Srirama SN (2019) Evaluating the impact of code smell refactoring on the energy consumption of android applications. In: 2019 45Th euromicro conference on software engineering and advanced applications (SEAA). https://doi.org/10.1109/SEAA.2019.00021, pp 82–86

  • Aversano L, Cerulo L, Di Penta M (2007) How clones are maintained: an empirical study. In: 11Th european conference on software maintenance and reengineering (CSMR’07). IEEE, pp 81–90

  • Barbour L, An L, Khomh F, Zou Y, Wang S (2018) An investigation of the fault-proneness of clone evolutionary patterns. Softw Qual J 26 (4):1187–1222

    Article  Google Scholar 

  • Barbour L, Khomh F, Zou Y (2011) Late propagation in software clones. In: 2011 27Th IEEE international conference on software maintenance (ICSM). IEEE, pp 273–282

  • Barbour L, Khomh F, Zou Y (2013) An empirical study of faults in late propagation clone genealogies. J Soft Evol Proc 25:1139–1165

    Article  Google Scholar 

  • Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(1):281–305

    MathSciNet  MATH  Google Scholar 

  • Bordes A, Chopra S, Weston J (2014) Question answering with subgraph embeddings. arXiv:1406.3676

  • Braiek HB, Khomh F (2019) Deepevolution: A search-based testing approach for deep neural networks. 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp 454–458

  • Braiek HB, Khomh F (2020) On testing machine learning programs. J Syst Softw 164:110542

    Article  Google Scholar 

  • Braiek HB, Khomh F, Adams B (2018) The open-closed principle of modern machine learning frameworks. In: 2018 IEEE/ACM 15Th international conference on mining software repositories (MSR). IEEE, pp 353–363

  • Breck E, Cai S, Nielsen E, Salib M, Sculley D (2017) The ml test score: a rubric for ml production readiness and technical debt reduction. In: 2017 IEEE International conference on big data (big data). IEEE, pp 1123–1132

  • Buckley FJ, Poston R (1984) Software quality assurance. IEEE Trans Softw Eng (1):36–41

  • Chen B (2019) Berrynet deep learning gateway on raspberry pi and other edge devices. https://github.com/DT42/berrynet

  • Chen Z, Cao Y, Liu Y, Wang H, Xie T, Liu X (2020) Understanding challenges in deploying deep learning based software: An empirical study. arXiv:2005.00760

  • Chen Z, Chen L, Ma W, Zhou X, Zhou Y, Xu B (2018) Understanding metric-based detectable smells in python software: a comparative study. Inf Softw Technol 94:14–29

    Article  Google Scholar 

  • Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12(ARTICLE):2493–2537

    MATH  Google Scholar 

  • Cordy JR, Roy CK (2011) The nicad clone detector. In: 2011 IEEE 19Th international conference on program comprehension. IEEE, pp 219–220

  • Cordy JR, Roy CK (2019) NiCad clone detector. https://www.txl.ca/txl-nicaddownload.html. [Online; accessed 20-February-2020]

  • Ernst N (2019) Cliff’s delta implementation. https://github.com/neilernst/cliffsDelta

  • Farabet C, Couprie C, Najman L, LeCun Y (2012) Learning hierarchical features for scene labeling. IEEE Trans Pattern Anal Mach Intell 35 (8):1915–1929

    Article  Google Scholar 

  • Fowler M, Beck K, Brant J, Opdyke W, Roberts D (1999) Refactoring: Improving the design of existing code. Addison-Wesley Professional. Berkeley

  • Gode N, Harder J (2011) Clone stability. In: 2011 15Th european conference on software maintenance and reengineering. IEEE, pp 65–74

  • Göde N., Koschke R (2011) Frequency and risks of changes to clones. In: Proceedings of the 33rd International Conference on Software Engineering, pp 311–320

  • Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT Press. http://www.deeplearningbook.org

  • Gottschalk M, Josefiok M, Jelschen J, Winter A (2012) Removing energy code smells with reengineering services. Informatik

  • Gupta R, Pal S, Kanade A, Shevade S (2017) Deepfix: Fixing common c language errors by deep learning. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp 1345–1351

  • Hadhemii (2019) DLCOdesmells. https://github.com/hadhemii/DLCodesmells/blob/master/data/dlrepos.csv

  • Hamdan S, Alramouni S (2015) A quality framework for software continuous integration. Procedia Manuf 3:2019–2025

    Article  Google Scholar 

  • Han J, Shihab E, Wan Z, Deng S, Xia X (2020) What do programmers discuss about deep learning frameworks. Empirical Software Engineering

  • Heaton JB, Polson NG, Witte JH (2017) Deep learning for finance: deep portfolios. Appl Stoch Model Bus Ind 33(1):3–12

    Article  MathSciNet  Google Scholar 

  • Hinton G, Deng L, Yu D, Dahl GE, Mohamed AR, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN et al (2012) Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process Mag 29(6):82–97

    Article  Google Scholar 

  • Hotta K, Sano Y, Higo Y, Kusumoto S (2010) Is duplicate code more frequently modified than non-duplicate code in software evolution? an empirical study on open source software. In: Proceedings of the Joint ERCIM Workshop on Software Evolution (EVOL) and International Workshop on Principles of Software Evolution (IWPSE), pp 73–82

  • Islam JF, Mondal M, Roy CK (2016) Bug replication in code clones: an empirical study. In: 2016 IEEE 23Rd international conference on software analysis, evolution, and reengineering (SANER), vol 1. IEEE, pp 68–78

  • Islam JF, Mondal M, Roy CK, Schneider KA (2017) A comparative study of software bugs in clone and non-clone code. In: SEKE, pp 436–443

  • Islam MJ, Nguyen G, Pan R, Rajan H (2019) A comprehensive study on deep learning bug characteristics. arXiv:1906.01388

  • Jebnoun H (2020) 6dlreposdata. https://github.com/hadhemii/clonesinDLCode/blob/master/data/6DLReposdata.csv

  • Jebnoun H (2020) Clones in deep learning code. https://github.com/Hadhemii/ClonesInDLCode

  • Jebnoun H, Ben Braiek H, Rahman MM, Khomh F (2020) The scent of deep learning code: an empirical study. In: Proceedings of the 17th International Conference on Mining Software Repositories, pp 1–11

  • Jiang L, Su Z, Chiu E (2007) Context-based detection of clone-related bugs. In: Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, pp 55–64

  • Juergens E, Deissenboeck F, Hummel B, Wagner S (2009) Do code clones matter?. In: 2009 IEEE 31St international conference on software engineering. IEEE, pp 485–495

  • Kapser C, Godfrey MW (2003) Toward a taxonomy of clones in source code: a case study. Evol Large Scale Indust Softw Architect 16:107–113

    Google Scholar 

  • Kapser CJ, Godfrey MW (2008) “cloning considered harmful” considered harmful: patterns of cloning in software. Empir Softw Eng 13(6):645

    Article  Google Scholar 

  • Kery MB, Radensky M, Arya M, John BE, Myers BA (2018) The story in the notebook: Exploratory data science using a literate programming tool. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp 1–11

  • Kim M, Sazawal V, Notkin D, Murphy G (2005) An empirical study of code clone genealogies. In: Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on Foundations of software engineering, pp 187–196

  • Kim S, Whitehead Jr EJ (2006) How long did it take to fix bugs?. In: Proceedings of the 2006 international workshop on Mining software repositories, pp 173–174

  • Koenzen A, Ernst N, Storey MA (2020) Code duplication and reuse in jupyter notebooks. arXiv:2005.13709

  • Koschke R (2007) Survey of research on software clones. In: Dagstuhl seminar proceedings. Schloss dagstuhl-leibniz-zentrum für informatik

  • Krinke J (2011) Is cloned code older than non-cloned code?. In: Proceedings of the 5th International Workshop on Software Clones, pp 28–33

  • Kumlander D (2010) Towards a new paradigm of software development: an ambassador driven process in distributed software companies. In: Advanced techniques in computing sciences and software engineering. Springer, pp 487–490

  • LeCun Y, Boser BE, Denker JS, Henderson D, Howard RE, Hubbard WE, Jackel LD (1990) Handwritten digit recognition with a back-propagation network. In: Advances in neural information processing systems, pp 396–404

  • LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  • Li J, Ernst MD (2012) CBCD: Cloned Buggy code detector. In: Proceedings of ICSE, pp 310–320

  • Li X, Jiang H, Ren Z, Li G, Zhang J (2018) Deep learning in software engineering. arXiv:1805.04825

  • Li Z, Lu S, Myagmar S, Zhou Y (2006) CP-Miner: Finding copy-paste and related bugs in large-scale software code. IEEE TSE 32:176–192

    Google Scholar 

  • Liu J, Huang Q, Xia X, Shihab E, Lo D, Li S (2020) Is using deep learning frameworks free? characterizing technical debt in deep learning frameworks. In: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Society, pp 1–10

  • Lozano A, Wermelinger M (2008) Assessing the effect of clones on changeability. In: Proceedings of ICSM, pp 227–236

  • Lozano A, Wermelinger M (2010) Tracking clones’ imprint. In: Proceedings of the 4th International Workshop on Software Clones, pp 65–72

  • Macbeth G, Razumiejczyk E, Ledesma RD (2011) Cliff’s delta calculator: a non-parametric effect size program for two groups of observations. Univ Psychol 10(2):545–555

    Article  Google Scholar 

  • Miotto R, Wang F, Wang S, Jiang X, Dudley JT (2018) Deep learning for healthcare: review, opportunities and challenges. Brief Bioinf 19(6):1236–1246

    Article  Google Scholar 

  • Mondal M, Rahman MS, Roy CK, Schneider KA (2018) Is cloned code really stable?. Empir Softw Engg 23(2):693–770

    Article  Google Scholar 

  • Mondal M, Roy B, Roy CK, Schneider KA (2019) Investigating context adaptation bugs in code clones. In: 2019 IEEE International conference on software maintenance and evolution (ICSME), pp 157–168

  • Mondal M, Roy CK, Rahman MS, Saha RK, Krinke J, Schneider KA (2012) Comparative stability of cloned and non-cloned code: an empirical study. In: Proceedings of ACM SAC, pp 1227–1234

  • Mondal M, Roy CK, Schneider KA (2015) A comparative study on the bug-proneness of different types of code clones. In: 2015 IEEE International conference on software maintenance and evolution (ICSME), pp 91–100

  • Mondal M, Roy CK, Schneider KA (2017) Does cloned code increase maintenance effort?. In: 2017 IEEE 11Th international workshop on software clones (IWSC). IEEE, pp 1–7

  • Munappy A, Bosch J, Olsson HH, Arpteg A, Brinne B (2019) Data management challenges for deep learning. In: 2019 45Th euromicro conference on software engineering and advanced applications (SEAA). IEEE, pp 140–147

  • Neuhäuser M (2011) Wilcoxon–Mann–Whitney Test. Springer, Berlin, pp 1656–1658. https://doi.org/10.1007/978-3-642-04898-2_615

  • Nguyen H, Kieu LM, Wen T, Cai C (2018) Deep learning methods in transportation domain: a review. IET Intell Transp Syst 12(9):998–1004

    Article  Google Scholar 

  • Pasumarthi RK, Bruch S, Wang X, Li C, Bendersky M, Najork M, Pfeifer J, Golbandi N, Anil R, Wolf S (2019) Tensorflow ranking. https://github.com/tensorflow/ranking

  • Pimentel JF, Murta L, Braganholo V, Freire J (2019) A large-scale study about quality and reproducibility of jupyter notebooks. In: 2019 IEEE/ACM 16Th international conference on mining software repositories (MSR). IEEE, pp 507–517

  • Psallidas F, Zhu Y, Karlas B, Interlandi M, Floratou A, Karanasos K, Wu W, Zhang C, Krishnan S, Curino C et al (2019) Data science through the looking glass and what we found there. arXiv:1912.09536

  • Rahman F, Bird C, Devanbu P (2012) Clones: What is that smell? Empir Softw Eng 17(4-5):503–530

    Article  Google Scholar 

  • Rahman MS, Roy CK (2014) A change-type based empirical study on the stability of cloned code. In: Proceedings of SCAM, pp 31–40

  • Rahman MS, Roy CK (2017) On the relationships between stability and bug-proneness of code clones: an empirical study. In: Proceedings of SCAM, pp 131–140

  • Rampasek L, Goldenberg A (2016) Tensorflow: Biology’s gateway to deep learning? Cell Syst 2(1):12–14

    Article  Google Scholar 

  • Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788

  • Rochimah S, Arifiani S, Insanittaqwa VF (2015) Non-source code refactoring: a systematic literature review. Int J Softw Eng Appl 9(6):197–214

    Google Scholar 

  • Rosen C, Grawi B, Shihab E (2015) Commit guru: analytics and risk prediction of software commits. In: Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, pp 966–969

  • Roy CK, Cordy JR (2007) A survey on software clone detection research. Queen’s School Comput TR 541(115):64–68

    Google Scholar 

  • Roy CK, Cordy JR (2008) Nicad: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In: 2008 16Th iEEE international conference on program comprehension. IEEE, pp 172–181

  • Roy CK, Cordy JR (2009) A mutation/injection-based automatic framework for evaluating code clone detection tools. In: 2009 International conference on software testing, verification, and validation workshops. IEEE, pp 157–166

  • Roy CK, Cordy JR (2010) Near-miss function clones in open source software: an empirical study. J Softw Mainten Evol Res Practice 22(3):165–189

    Google Scholar 

  • Roy CK, Zibran MF, Koschke R (2014) The vision of software clone management: past, present, and future (keynote paper). In: Proceedings of CSMR-WCRE, pp 18–33

  • Sainath TN, Mohamed AR, Kingsbury B, Ramabhadran B (2013) Deep convolutional neural networks for lvcsr. In: 2013 IEEE International conference on acoustics, speech and signal processing. IEEE, pp 8614–8618

  • Saini V, Sajnani H, Lopes C (2018) Cloned and non-cloned java methods: A comparative study. Empir Softw Engg 23(4):2232–2278. https://doi.org/10.1007/s10664-017-9572-7

    Article  Google Scholar 

  • Sajnani H, Saini V, Lopes CV (2014) A comparative study of bug patterns in java cloned and non-cloned code. In: 2014 IEEE 14Th international working conference on source code analysis and manipulation. IEEE, pp 21–30

  • Samek W, Wiegand T, Müller KR (2017) Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. arXiv:1708.08296

  • Sculley D, Holt G, Golovin D, Davydov E, Phillips T, Ebner D, Chaudhary V, Young M, Crespo JF, Dennison D (2015) Hidden technical debt in machine learning systems. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in Neural Information Processing Systems. http://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf, vol 28. Curran Associates, Inc., pp 2503–2511

  • Selim GM, Barbour L, Shang W, Adams B, Hassan AE, Zou Y (2010) Studying the impact of clones on software defects. In: 2010 17Th working conference on reverse engineering. IEEE, pp 13–21

  • Shen H (2014) Interactive notebooks: Sharing the code. Nature 515 (7525):151–152

    Article  Google Scholar 

  • Spadini D, Aniche M, Bacchelli A (2018) PyDriller: Python Framework for Mining Software Repositories. https://doi.org/10.1145/3236024.3264598

  • Suryanarayana G, Samarthyam G, Sharma T (2014) Refactoring for software design smells: managing technical debt. Morgan Kaufmann

  • Svajlenko J, Roy CK (2014) Evaluating modern clone detection tools. In: 2014 IEEE International conference on software maintenance and evolution. IEEE, pp 321–330

  • Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

  • Tompson JJ, Jain A, LeCun Y, Bregler C (2014) Joint training of a convolutional network and a graphical model for human pose estimation. In: Advances in neural information processing systems, pp 1799–1807

  • Vetro A, Ardito L, Morisio M (2013) Definition implementation and validation of energy code smells: an exploratory study on an embedded system

  • Wagner S, Abdulkhaleq A, Kaya K, Paar A (2016) On the relationship of inconsistent software clones and faults: an empirical study. In: Proceedings of SANER, pp 79–89

  • Wan Z, Xia X, Lo D, Murphy GC (2019) How does machine learning change software development practices? IEEE Transactions on Software Engineering

  • Wang T, Harman M, Jia Y, Krinke J (2013) Searching for better configurations: a rigorous approach to clone evaluation. In: Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pp 455–465

  • Weill C, Gonzalvo J, Kuznetsov V, Yang S, Yak S, Mazzawi H, Hotaj E, Jerfel G, Macko V, Adlam B, Mohri M, Cortes C (2019) Adanet. https://github.com/tensorflow/adanet

  • Wheeler DA (2004) SLOCCount. https://dwheeler.com/sloccount/. [Online; accessed 19-May-2020]

  • White M, Tufano M, Vendome C, Poshyvanyk D (2016) Deep learning code fragments for code clone detection. In: 2016 31St IEEE/ACM international conference on automated software engineering (ASE). IEEE, pp 87–98

  • Zhang JM, Harman M, Ma L, Liu Y (2020) Machine learning testing: survey, landscapes and horizons. IEEE Transactions on Software Engineering

  • Zhang X, Yang Y, Feng Y, Chen Z (2019) Software engineering practice in the development of deep learning applications. arXiv:1910.03156

  • Zhang Y, Chen Y, Cheung SC, Xiong Y, Zhang L (2018) An empirical study on tensorflow program bugs. In: Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, pp 129–140

Download references

Acknowledgements

This work is supported by Fonds de Recherche du Quebec (FRQ) and the Natural Sciences and Engineering Research Council of Canada (NSERC). We would like to thank Dr. Amin Nikanjam for his valuable comments on the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hadhemi Jebnoun.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Communicated by: Denys Poshyvanyk

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Study Design

Table 22 shows the name, url, number of lines of code (SLOC), number of commits and the size of each selected 6 DL repository.

Table 22 The 6 analyzed DL repositories details

Appendix B: RQ1 Additional Results

2.1 B.1 Results of Clone Detection Using Threshold of 20%

In this section, we provide additional results for both programming languages (Java and C#) when using a dissimilarity threshold 20% in order to explore the impact of threshold on clone detection as we use in our analysis 30% as threshold. Figure 29 shows the code clones occurrences in DL and Traditional Java projects for both code clones granularities. Figure 30 shows the same analysis but for C# projects.

Fig. 29
figure 29

Code Clones Occurrences in DL and Traditional Java Projects Using threshold as 20% for Both Code Clones Granularities: (a) Function, (b) Block. LOCC: Lines Of Code Clones, SLOC: Source Lines Of Code

Fig. 30
figure 30

Code Clones Occurrences in DL and Traditional C# Projects for Both Code Clones Granularities: (a) Function, (b) Block and using 20% as threshold. LOCC: Lines Of Code Clones, SLOC: Source Lines Of Code

We further extend our analysis by comparing clones types. Figures 31 and 32 illustrate the clone density in DL and traditional projects for clone types and granularity for the two programming languages (Java and C# respectively).

Fig. 31
figure 31

Clone Density in DL and Traditional Java Projects for Clone Types and Granularity Using threshold as 20% LOCC: Lines Of Code Clones

Fig. 32
figure 32

Clone Density in DL and Traditional C# Projects for Clone Types and Granularity and using 20% as threshold. LOCC: Lines Of Code Clones

Appendix C: RQ2 Additional Results

3.1 C.1 Other Programming Languages Analysis Results

We study the distribution of different clones types by clone location in DL and traditional code in java projects (Fig. 33) and in C# projects (Fig. 34)

Fig. 33
figure 33

Distribution of Different Types of Clones by Clone Location in DL and Traditional Code (Java)

Fig. 34
figure 34

Distribution of Different Types of Clones by Clone Location in DL and Traditional Code (C#)

3.2 C.2 Results of Clone Detection Using Threshold of 20%

In this section, we present the additional analysis we performed to address RQ2. We examine the code clone distribution by location in DL and traditional java (Fig. 35) and C# (Fig. 36) systems using 20% as dissimilarity threshold.

Fig. 35
figure 35

Code Clones Distribution by Location in DL and Traditional java Systems using 20% as Threshold Regarding Percentage of Lines of Code Clones (LOCC). i.e, (LOCC/total LOCC)x 100

Fig. 36
figure 36

Code Clones Distribution by Location in DL and Traditional C# Systems using 20% as Threshold Regarding Percentage of Lines of Code Clones (LOCC). i.e, (LOCC/total LOCC)x 100

We further study the percentages of average number of fragments of code clones by location of clones in both deep learning and traditional using 20% dissimilarity threshold for the two programming language Java (Fig. 37) and C# (Fig. 38).

Fig. 37
figure 37

Percentages of Average Number of Fragments of Code Clones by Location of Clones in both Deep Learning and Traditional Java Systems using 20% as threshold

Fig. 38
figure 38

Percentages of Average Number of Fragments of Code Clones by Location of Clones in both Deep Learning and Traditional C# Systems using 20% as threshold

We then study the distribution of different types of clones in the different clones location (Same file, Same directory, and different directories) using 20% dissimilarity threshold for the two programming languages Java (Fig. 39) and C# (Fig. 40).

Fig. 39
figure 39

Distribution of Different Types of Clones by Clone Location in DL and Traditional Code (Java) with 20% as threshold

Fig. 40
figure 40

Distribution of Different Types of Clones by Clone Location in DL and Traditional Code (C#) with 20% as threshold

Appendix D: RQ3 Additional Results

In this section, we provide additional analysis on the distribution of the size of cloned and non-cloned functions in DL and traditional systems (Fig. 41). This is done to understand if size is playing an important confounding role in identifying bug fixing commits that are related to clones. We study the distribution of the mean size of cloned and non-cloned functions per systems in DL and traditional systems in Python projects (Fig. 42) and for Java and C# (Fig. 43).

Fig. 41
figure 41

Distribution of the Size of Cloned and Non-cloned Functions in DL and Traditional Systems

Fig. 42
figure 42

Distribution of Mean Size of Cloned and Non-cloned Functions Per Systems in DL and Traditional Systems

Fig. 43
figure 43

Distribution of Mean Size of Cloned and Non-cloned Functions Per Systems in Java and C# DL Systems

Appendix E: RQ4 Additional Results

As explained in RQ4 but in percentages, Table 23 shows the total number of code clones attributed to the DL phases. The total number of code clones manually analyzed is 595.

Table 23 Total number of occurrence of code clones in DL phases

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jebnoun, H., Rahman, M.S., Khomh, F. et al. Clones in deep learning code: what, where, and why?. Empir Software Eng 27, 84 (2022). https://doi.org/10.1007/s10664-021-10099-x

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-021-10099-x

Keywords

Navigation