Cross-project code clones in GitHub


Code reuse has well-known benefits on code quality, coding efficiency, and maintenance. Open Source Software (OSS) programmers gladly share their own code and they happily reuse others’. Social programming platforms like GitHub have normalized code foraging via their common platforms, enabling code search and reuse across different projects. Removing project borders may facilitate more efficient code foraging, and consequently faster programming. But looking for code across projects takes longer and, once found, may be more challenging to tailor to one’s needs. Learning how much code reuse goes on across projects, and identifying emerging patterns in past cross-project search behavior may help future foraging efforts. Our contribution is two fold. First, to understand cross-project code reuse, here we present an in-depth empirical study of cloning in GitHub. Using Deckard, a popular clone finding tool, we identified copies of code fragments across projects, and investigate their prevalence and characteristics using statistical and network science approaches, and with multiple case studies. By triangulating findings from different analysis methods, we find that cross-project cloning is prevalent in GitHub, ranging from cloning few lines of code to whole project repositories. Some of the projects serve as popular sources of clones, and others seem to contain more clones than their fair share. Moreover, we find that ecosystem cloning follows an onion model: most clones come from the same project, then from projects in the same application domain, and finally from projects in different domains. Second, we utilized these results to develop a novel tool named CLONE-HUNTRESS that streamlines finding and tracking code clones in GitHub. The tool is GitHub integrated, built around a user-friendly interface and runs efficiently over a modern database system. We describe the tool and make it publicly available at

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11


  1. 1.

    See StackExchange question:

  2. 2.

  3. 3.

    Since shorter exact clones can capture, to some extent, more variability during longer code evolution.

  4. 4.

    All these operations are done using MySQL Server and SQL queries.

  5. 5.

    The project domain identification process was implemented with Python.

  6. 6.

  7. 7.

  8. 8.

  9. 9.

  10. 10.

  11. 11.

  12. 12.

  13. 13.

  14. 14.

  15. 15.

    We used R and the “igraph” package for all graph constructions, comparisons and analyses.

  16. 16.

    The graphs are created using Gephi.

  17. 17.

    We also investigated the correlation between clone density and project size. The results were similar to those in Fig. 2.

  18. 18.

    This number is among the set of the projects that had any clones at all. So the total sum of all domain sizes adds up to the first row numbers of Table 3.

  19. 19.

    This number is derived from the implementation of queries described in Section 4.1 and using GHtorrent’s 2018-04-01 dump of GitHub projects.


  1. Al-Ekram R, Kapser C, Holt R, Godfrey M (2005) Cloning by accident: an empirical study of source code cloning across software systems. In: 2005 international symposium on Empirical software engineering. IEEE, pp 10–pp

  2. Bajracharya S, Ngo T, Linstead E, Dou Y, Rigor P, Baldi P, Lopes C (2006) Sourcerer: a search engine for open source code supporting structure-based search. In: Companion to the 21st ACM SIGPLAN symposium on object-oriented programming systems, languages, and applications. ACM, pp 681–682

  3. Barr ET, Brun Y, Devanbu P, Harman M, Sarro F (2014) The plastic surgery hypothesis. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, pp 306–317

  4. Bogdan V, Posnett D, Ray B, Brand Mvd, Filkov AS, Premkumar D, Filkov V (2015) Gender and tenure diversity in github teams. CHI ’15 ACM

  5. Dabbish L, Stuart C, Tsay J, Herbsleb J (2012) Social coding in github: transparency and collaboration in an open software repository. In: Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work. ACM, pp 1277–1286

  6. Duala-Ekoko E, Robillard MP (2008) Clonetracker: tool support for code clone management. In: Proceedings of the 30th international conference on Software engineering. ACM, pp 843–846

  7. Gabel M, Su Z (2010) A study of the uniqueness of source code. In: Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering. ACM, pp 147–156

  8. Gharehyazie M, Posnett D, Vasilescu B, Filkov V (2015) Developer initiation and social interactions in oss: a case study of the apache software foundation. Empir Softw Eng 20(5):1318–1353

    Article  Google Scholar 

  9. Gharehyazie M, Ray B, Filkov V (2017) Some from here, some from there: cross-project code reuse in github. In: Proceedings of the 14th International Conference on Mining Software Repositories. IEEE Press, pp 291–301

  10. Goues CL, Nguyen T, Forrest S, Weimer W (2012) Genprog: a generic method for automatic software repair. IEEE Trans Softw Eng 38(1):54–72

    Article  Google Scholar 

  11. Gousios G (2013) The ghtorent dataset and tool suite. In: Proceedings of the 10th Working Conference on Mining Software Repositories. IEEE Press, pp 233–236

  12. Jiang L, Misherghi G, Su Z, Glondu S (2007) Deckard: scalable and accurate tree-based detection of code clones. In: Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society, pp 96–105

  13. Juergens E, Deissenboeck F, Hummel B, Wagner S (2009) Do code clones matter?. In: Proceedings of the 31st International Conference on Software Engineering, ICSE ’09. IEEE Computer Society, Washington, pp 485–495

    Google Scholar 

  14. Kamiya T, Kusumoto S, Inoue K (2002) Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans Softw Eng 28 (7):654–670

    Article  Google Scholar 

  15. Kim M, Bergman L, Lau T, Notkin D (2004) An ethnographic study of copy and paste programming practices in oopl. In: 2004 Proceedings of the International Symposium on Empirical Software Engineering, ISESE’04. IEEE, pp 83–92

  16. Kim M, Sazawal V, Notkin D, Murphy G (2005) An empirical study of code clone genealogies. In: ACM SIGSOFT Software engineering notes, vol 30. ACM, pp 187–196

  17. Li J, Ernst MD (2012) Cbcd: cloned buggy code detector. In: Proceedings of the 34th International Conference on Software Engineering. IEEE Press, pp 310–320

  18. Lv F, Zhang H, Lou J-G, Wang S, Zhang D, Zhao J (2015) Codehow: effective code search based on api understanding and extended boolean model (e). In: 2015 30th IEEE/ACM International Conference on Automated software engineering (ASE). IEEE, pp 260–270

  19. Meng N, Kim M, McKinley KS (2011) Systematic editing: generating program transformations from an example. In: ACM SIGPLAN Notices, vol 46. ACM, pp 329–342

  20. Meng N, Kim M, McKinley KS (2013) Lase: locating and applying systematic edits by learning from examples. In: Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, pp 502–511

  21. Nguyen HA, Nguyen AT, Nguyen TT, Nguyen TN, Rajan H (2013) A study of repetitiveness of code changes in software evolution. In: Proceedings of the 28th International Conference on Automated Software Engineering. ASE

  22. Ossher J, Sajnani H, Lopes C (2011) File cloning in open source java projects: the good, the bad, and the ugly. In: 2011 27th IEEE International Conference on Software Maintenance (ICSM). IEEE, pp 283–292

  23. Ponzanelli L, Bavota G, Di Penta M, Oliveto R, Lanza M (2014) Mining stackoverflow to turn the ide into a self-confident programming prompter. In: Proceedings of the 11th Working Conference on Mining Software Repositories. ACM, pp 102–111

  24. Rattan D, Bhatia R, Singh M (2013) Software clone detection: a systematic review. Inf Softw Technol 55(7):1165–1199

    Article  Google Scholar 

  25. Ray B, Kim M (2012) A case study of cross-system porting in forked projects. In: Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering. ACM, p 53

  26. Ray B, Nagappan M, Bird C, Nagappan N, Zimmermann T (2014) The uniqueness of changes: characteristics and applications. Technical report, Microsoft Research Technical Report

  27. Ray B, Posnett D, Filkov V, Devanbu P (2014) A large scale study of programming languages and code quality in github. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, pp 155–165

  28. Reiss SP (2009) Semantics-based code search. In: Proceedings of the 31st International Conference on Software Engineering. IEEE Computer Society, pp 243–253

  29. Roy CK, Cordy JR, Koschke R (2009) Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci Comput Program 74 (7):470–495

    MathSciNet  Article  MATH  Google Scholar 

  30. Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) Sourcerercc: scaling code clone detection to big-code. In: 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE). IEEE, pp 1157–1168

  31. Scacchi W (2010) Collaboration practices and affordances in free/open source software development. In: Collaborative software engineering. Springer, pp 307–327

  32. Sim SE, Clarke CL, Holt RC (1998) Archetypal source code searches: a survey of software developers and maintainers. In: 1998 Proceedings of the 6th international workshop on Program comprehension, IWPC’98. IEEE, pp 180–187

  33. Su F-H, Bell J, Harvey K, Sethumadhavan S, Kaiser G, Jebara T (2016) Code relatives: detecting similarly behaving software. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, pp 702–714

  34. Thummalapenta S, Xie T (2007) Parseweb: a programmer assistant for reusing open source code on the web. In: Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering. ACM, pp 204–213

  35. Vasilescu B, Blincoe K, Xuan Q, Casalnuovo C, Damian D, Devanbu P, Filkov V (2016) The sky is not the limit: multitasking on GitHub projects. In: International Conference on Software Engineering, ICSE. to appear

  36. Xuan Q, Okano A, Devanbu P, Filkov V (2014) Focus-shifting patterns of oss developers and their congruence with call graphs. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, pp 401–412

  37. Zhang H, Jain A, Khandelwal G, Kaushik C, Ge S, Hu W (2016) Bing developer assistant: improving developer productivity by recommending sample code. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, pp 956–961

Download references


We thank Prof. Prem Devanbu and members of the DECAL lab at UC Davis for valuable discussions. We also thank Mr. Seyed Mohammad Masoud Sadrnezhaad for his help in updating CLONE-HUNTRESS’s database.

Author information



Corresponding author

Correspondence to Mohammad Gharehyazie.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by: Abram Hindle and Lin Tan

Appendix A: CLONE-HUNTRESS tool description and use

Appendix A: CLONE-HUNTRESS tool description and use

Here we describe CLONE-HUNTRESS, our online tool for (1) identifying clones between a user selected source project and a target list of Java-based GitHub projects, and (2) tracking changes to the clones over time. Our design goal was to provide a GitHub integrated, comprehensive, and efficient tool that users can interact with transparently, without the need to experience the mechanics of the clone search process. We wanted users to be able to come back to the tool over time and be able to monitor the changes to the cloned code. The tool is available at

Finding clones among the many projects that exist in GitHub is very time consuming and computationally infeasible, specially when constrained by a reasonable response time limit. Also, as per our findings in the main text of this paper, clones are often found in pairs of projects in the same domain. Hence, to speed up the search among projects, CLONE-HUNTRESS allows users to search and track clones between projects in the same domain.

We selected a list of projects consisting of the 39422 Java-based GitHub projects, as an initial preset list that will grow over time through automatic addition of users’ projects. This number is derived from the implementation of the queries described in Section 4.1 and applying them to the Aprilst2018 GHTorrent MySQL dump. In other words we selected Java projects that had at least 2 developers, were at least 1 year old, and had more than 10 commits. We also eliminated projects that were forked.

The front page of the tool is shown in Fig. 12.

Fig. 12

Users of the tool will encounter the front page when they access the tool link

A.1 Login, registration and settings

CLONE-HUNTRESS is GitHub integrated. To use CLONE-HUNTRESS a user must first get authenticated through GitHub. Once authenticated, CLONE-HUNTRESS automatically pulls the list of the user’s publicly available projects and adds them to their profile within the tool. Users can chose one from these projects, or add other projects manually, as described later, as the source project for clone detection.

By clicking on the user’s GitHub name, email, or avatar on the dashboard, the Profile page is shown, where users can change the tool’s tracking frequency settings. As shown in Fig. 13, there are two options that govern CLONE-HUNTRESS’s behavior. The first one is the update frequency of the tracked clones. This frequency determines how often the tool should update the changes that are taking place on the tracked clone code. The second one is the frequency at which clone detection is executed from scratch. This option exists because after a sufficiently long time, many of the tracked clones may change via commits, and thus may not be similar anymore to the original clone in the user’s project.

Fig. 13

Code clone and project update tracking frequencies can be changed on this page. It is accessible from the user’s dashboard, by clicking on the user’s GitHub name

A.2 Detecting and tracking clones

The main functionality of CLONE-HUNTRESS i.e., tacking clones, is accessible through the ”Add project” button on the top right corner of the dashboard (Fig. 14) which redirects the user to the corresponding page (Fig. 15, top) where users can select a project from their list of GitHub projects. In addition to the list of user’s GitHub projects, any other project of interest can be selected as the source by providing its URL directly, as illustrated in Fig. 15 (top). Once a project is specified, the tool will ask for the project’s application domain, and once it is specified and ”Get projects” is pressed, it will present a list of all projects (within its current project list) in that application domain (Fig. 15, bottom).

Fig. 14

The main tool dashboard. The results of all clone detections are shown here as a list which allows navigation to all of the tracked clone instances and change reports

Fig. 15

Top: Users can select from their own GitHub projects or any other random GitHub project by providing its url. Bottom: The tool proposes a list of target projects to the user

Users can select up to 20 target projects from the given list, to detect clones between them and the source project. These limitations are imposed for two reasons: 1) Hardware resource limitations and response time limits and 2) The fact that tracking a large number of projects eventually leads to confusion rather than providing benefits. Users are also able to add any other GitHub project to the target list by specifying the project link directly, using the “Add other project” button below the list, as illustrated in Fig. 16. The target list can be reset to its original form using the “Reset project list” button at the bottom of the list.

Fig. 16

Users can add projects to the list directly via their URLs

With the source and target projects chosen, clone detection is initiated by pressing the “Detect-Clones” button at the bottom of the page. It could take the tool a few minutes to show the results of clone detection. When done, CLONE-HUNTRESS will redirect the user to the result page, which will resemble Fig. 17. If any clones are found, the results will show the clone instances from the source project and those from the target projects.

Fig. 17

Results of clone detections are some traceable clone instances

Users can choose to track any clone instance they want by selecting them and clicking on the “Save and track” button, and over time see the changes that occur on these selected instances. There is a limitation on the number of traceable clones. Users can track up to 20 clone instances due to the aforementioned reasons. After choosing some clone instances to track, users are returned to their dashboard. Every clone detection that the user has done will be displayed as a row in a table placed in the dashboard page, as shown in Fig. 14.

A.3 Tracking reports

CLONE-HUNTRESS provides View, Edit, and Delete functions in each row of the clone detection table (see the buttons in the ACTION column in Fig. 14). The View buttons report the tracking of changes made to the respective clone instances. Our tool checks at pre-specified intervals whether or not the clone instances have changed, and if so, the number of changes will be displayed as a notification on the View button. The intervals are identified by the update frequencies of tracked clones, found under the Profile page, as mentioned before. Clicking on the View button will redirect users to an “Alerts and Reports” page for that clone, similar to Fig. 18. There, clones from the user’s source project will be shown, and below each there will be the tracked clone instances, and links to the actual code. Changed clone instances are marked and users can visit the changed files. It is also possible for users to untrack any clones or clone instances from this page.

Fig. 18

The Alerts And Reports page

Edit directs users to a page similar to the first page of the process (Fig. 19), where users can repeat the steps of clone detection. The tool shows them all the steps they have already taken, and they can change anything they want and re-run clone detection again. Through the Delete button, the corresponding entry be deleted, and so the results of clone detection for that specific project will disappear.

Fig. 19

Clone detections that have already been completed can be edited by the user

A.4 Future improvements

While we have tried our best to provide a polished and useful product, there are many ways in which our tool can be improved. The first and foremost thing is to improve its hardware resource so that clone detection and checking for updates does not take as much time and users would be able to check for clones across more projects. The second area of improvement is to provide documentation and access to CLONE-HUNTRESS’s web services so other developers may integrate its functionalities within other tools and environments such as Eclipse.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gharehyazie, M., Ray, B., Keshani, M. et al. Cross-project code clones in GitHub. Empir Software Eng 24, 1538–1573 (2019).

Download citation


  • Clone detection
  • Cross-project cloning
  • Deckard
  • GitHub