Cross-project code clones in GitHub

  • Mohammad GharehyazieEmail author
  • Baishakhi Ray
  • Mehdi Keshani
  • Masoumeh Soleimani Zavosht
  • Abbas Heydarnoori
  • Vladimir Filkov


Code reuse has well-known benefits on code quality, coding efficiency, and maintenance. Open Source Software (OSS) programmers gladly share their own code and they happily reuse others’. Social programming platforms like GitHub have normalized code foraging via their common platforms, enabling code search and reuse across different projects. Removing project borders may facilitate more efficient code foraging, and consequently faster programming. But looking for code across projects takes longer and, once found, may be more challenging to tailor to one’s needs. Learning how much code reuse goes on across projects, and identifying emerging patterns in past cross-project search behavior may help future foraging efforts. Our contribution is two fold. First, to understand cross-project code reuse, here we present an in-depth empirical study of cloning in GitHub. Using Deckard, a popular clone finding tool, we identified copies of code fragments across projects, and investigate their prevalence and characteristics using statistical and network science approaches, and with multiple case studies. By triangulating findings from different analysis methods, we find that cross-project cloning is prevalent in GitHub, ranging from cloning few lines of code to whole project repositories. Some of the projects serve as popular sources of clones, and others seem to contain more clones than their fair share. Moreover, we find that ecosystem cloning follows an onion model: most clones come from the same project, then from projects in the same application domain, and finally from projects in different domains. Second, we utilized these results to develop a novel tool named CLONE-HUNTRESS that streamlines finding and tracking code clones in GitHub. The tool is GitHub integrated, built around a user-friendly interface and runs efficiently over a modern database system. We describe the tool and make it publicly available at


Clone detection Cross-project cloning Deckard GitHub 



We thank Prof. Prem Devanbu and members of the DECAL lab at UC Davis for valuable discussions. We also thank Mr. Seyed Mohammad Masoud Sadrnezhaad for his help in updating CLONE-HUNTRESS’s database.


  1. Al-Ekram R, Kapser C, Holt R, Godfrey M (2005) Cloning by accident: an empirical study of source code cloning across software systems. In: 2005 international symposium on Empirical software engineering. IEEE, pp 10–ppGoogle Scholar
  2. Bajracharya S, Ngo T, Linstead E, Dou Y, Rigor P, Baldi P, Lopes C (2006) Sourcerer: a search engine for open source code supporting structure-based search. In: Companion to the 21st ACM SIGPLAN symposium on object-oriented programming systems, languages, and applications. ACM, pp 681–682Google Scholar
  3. Barr ET, Brun Y, Devanbu P, Harman M, Sarro F (2014) The plastic surgery hypothesis. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, pp 306–317Google Scholar
  4. Bogdan V, Posnett D, Ray B, Brand Mvd, Filkov AS, Premkumar D, Filkov V (2015) Gender and tenure diversity in github teams. CHI ’15 ACMGoogle Scholar
  5. Dabbish L, Stuart C, Tsay J, Herbsleb J (2012) Social coding in github: transparency and collaboration in an open software repository. In: Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work. ACM, pp 1277–1286Google Scholar
  6. Duala-Ekoko E, Robillard MP (2008) Clonetracker: tool support for code clone management. In: Proceedings of the 30th international conference on Software engineering. ACM, pp 843–846Google Scholar
  7. Gabel M, Su Z (2010) A study of the uniqueness of source code. In: Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering. ACM, pp 147–156Google Scholar
  8. Gharehyazie M, Posnett D, Vasilescu B, Filkov V (2015) Developer initiation and social interactions in oss: a case study of the apache software foundation. Empir Softw Eng 20(5):1318–1353CrossRefGoogle Scholar
  9. Gharehyazie M, Ray B, Filkov V (2017) Some from here, some from there: cross-project code reuse in github. In: Proceedings of the 14th International Conference on Mining Software Repositories. IEEE Press, pp 291–301Google Scholar
  10. Goues CL, Nguyen T, Forrest S, Weimer W (2012) Genprog: a generic method for automatic software repair. IEEE Trans Softw Eng 38(1):54–72CrossRefGoogle Scholar
  11. Gousios G (2013) The ghtorent dataset and tool suite. In: Proceedings of the 10th Working Conference on Mining Software Repositories. IEEE Press, pp 233–236Google Scholar
  12. Jiang L, Misherghi G, Su Z, Glondu S (2007) Deckard: scalable and accurate tree-based detection of code clones. In: Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society, pp 96–105Google Scholar
  13. Juergens E, Deissenboeck F, Hummel B, Wagner S (2009) Do code clones matter?. In: Proceedings of the 31st International Conference on Software Engineering, ICSE ’09. IEEE Computer Society, Washington, pp 485–495Google Scholar
  14. Kamiya T, Kusumoto S, Inoue K (2002) Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans Softw Eng 28 (7):654–670CrossRefGoogle Scholar
  15. Kim M, Bergman L, Lau T, Notkin D (2004) An ethnographic study of copy and paste programming practices in oopl. In: 2004 Proceedings of the International Symposium on Empirical Software Engineering, ISESE’04. IEEE, pp 83–92Google Scholar
  16. Kim M, Sazawal V, Notkin D, Murphy G (2005) An empirical study of code clone genealogies. In: ACM SIGSOFT Software engineering notes, vol 30. ACM, pp 187–196Google Scholar
  17. Li J, Ernst MD (2012) Cbcd: cloned buggy code detector. In: Proceedings of the 34th International Conference on Software Engineering. IEEE Press, pp 310–320Google Scholar
  18. Lv F, Zhang H, Lou J-G, Wang S, Zhang D, Zhao J (2015) Codehow: effective code search based on api understanding and extended boolean model (e). In: 2015 30th IEEE/ACM International Conference on Automated software engineering (ASE). IEEE, pp 260–270Google Scholar
  19. Meng N, Kim M, McKinley KS (2011) Systematic editing: generating program transformations from an example. In: ACM SIGPLAN Notices, vol 46. ACM, pp 329–342Google Scholar
  20. Meng N, Kim M, McKinley KS (2013) Lase: locating and applying systematic edits by learning from examples. In: Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, pp 502–511Google Scholar
  21. Nguyen HA, Nguyen AT, Nguyen TT, Nguyen TN, Rajan H (2013) A study of repetitiveness of code changes in software evolution. In: Proceedings of the 28th International Conference on Automated Software Engineering. ASEGoogle Scholar
  22. Ossher J, Sajnani H, Lopes C (2011) File cloning in open source java projects: the good, the bad, and the ugly. In: 2011 27th IEEE International Conference on Software Maintenance (ICSM). IEEE, pp 283–292Google Scholar
  23. Ponzanelli L, Bavota G, Di Penta M, Oliveto R, Lanza M (2014) Mining stackoverflow to turn the ide into a self-confident programming prompter. In: Proceedings of the 11th Working Conference on Mining Software Repositories. ACM, pp 102–111Google Scholar
  24. Rattan D, Bhatia R, Singh M (2013) Software clone detection: a systematic review. Inf Softw Technol 55(7):1165–1199CrossRefGoogle Scholar
  25. Ray B, Kim M (2012) A case study of cross-system porting in forked projects. In: Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering. ACM, p 53Google Scholar
  26. Ray B, Nagappan M, Bird C, Nagappan N, Zimmermann T (2014) The uniqueness of changes: characteristics and applications. Technical report, Microsoft Research Technical ReportGoogle Scholar
  27. Ray B, Posnett D, Filkov V, Devanbu P (2014) A large scale study of programming languages and code quality in github. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, pp 155–165Google Scholar
  28. Reiss SP (2009) Semantics-based code search. In: Proceedings of the 31st International Conference on Software Engineering. IEEE Computer Society, pp 243–253Google Scholar
  29. Roy CK, Cordy JR, Koschke R (2009) Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci Comput Program 74 (7):470–495MathSciNetCrossRefzbMATHGoogle Scholar
  30. Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) Sourcerercc: scaling code clone detection to big-code. In: 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE). IEEE, pp 1157–1168Google Scholar
  31. Scacchi W (2010) Collaboration practices and affordances in free/open source software development. In: Collaborative software engineering. Springer, pp 307–327Google Scholar
  32. Sim SE, Clarke CL, Holt RC (1998) Archetypal source code searches: a survey of software developers and maintainers. In: 1998 Proceedings of the 6th international workshop on Program comprehension, IWPC’98. IEEE, pp 180–187Google Scholar
  33. Su F-H, Bell J, Harvey K, Sethumadhavan S, Kaiser G, Jebara T (2016) Code relatives: detecting similarly behaving software. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, pp 702–714Google Scholar
  34. Thummalapenta S, Xie T (2007) Parseweb: a programmer assistant for reusing open source code on the web. In: Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering. ACM, pp 204–213Google Scholar
  35. Vasilescu B, Blincoe K, Xuan Q, Casalnuovo C, Damian D, Devanbu P, Filkov V (2016) The sky is not the limit: multitasking on GitHub projects. In: International Conference on Software Engineering, ICSE. to appearGoogle Scholar
  36. Xuan Q, Okano A, Devanbu P, Filkov V (2014) Focus-shifting patterns of oss developers and their congruence with call graphs. In: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, pp 401–412Google Scholar
  37. Zhang H, Jain A, Khandelwal G, Kaushik C, Ge S, Hu W (2016) Bing developer assistant: improving developer productivity by recommending sample code. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, pp 956–961Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of CaliforniaDavisUSA
  2. 2.AICT Innovation CenterSharif University of TechnologyTehranIran
  3. 3.Columbia UniversityNew YorkUSA
  4. 4.Sharif University of TechnologyTehranIran
  5. 5.Department of Computer EngineeringSharif University of TechnologyTehranIran

Personalised recommendations