Siamese: scalable and incremental code clone search via multiple code representations


This paper presents a novel code clone search technique that is accurate, incremental, and scalable to hundreds of million lines of code. Our technique incorporates multiple code representations (i.e., a technique to transform code into various representations to capture different types of clones), query reduction (i.e., a technique to select clone search keywords based on their uniqueness), and a customised ranking function (i.e., a technique to allow a specific clone type to be ranked on top of the search results) to improve clone search performance. We implemented the technique in a clone search tool, called Siamese, and evaluated its search accuracy and scalability on three established clone data sets. Siamese offers the highest mean average precision of 95% and 99% on two clone benchmarks compared to seven state-of-the-art clone detection tools, and reported the largest number of Type-3 clones compared to three other code search engines. Siamese is scalable and can return cloned code snippets within 8 seconds for a code corpus of 365 million lines of code. Using an index of 130,719 GitHub projects, we demonstrate that Siamese’s incremental indexing capability dramatically decreases the index preparation time for large-scale data sets with multiple releases of software projects. The paper discusses the applications of Siamese to facilitate software development and research with two use cases including online code clone detection and clone search with automated license analysis.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12


  1. 1.

    Tool and data sets used are available at

  2. 2.

    Detailed explanation: a query normalisation factor, queryNorm(q), enables a comparison between results of different queries; query coordination, coord(q,d), gives higher scores to documents that contain a high percentage of terms in the query; query boosting, t.getBoost(), gives a boosted term more importance than another; and field length normalisation, norm(t,d), gives higher weight to a shorter field than a long field in case a document is represented by more than one field, e.g., title and body.

  3. 3.

  4. 4.

    We could not include the FaCoy tool in our other RQs because the tool is released as a virtual machine image with an existing clone database. The only way to use the tool is via a web interface and there is no instruction on how to switch FaCoy to analyse a new data set (such as OCD, SOCO used in our study).

  5. 5.


  6. 6.

    There were a few Stack Overflow posts which contained more than one code snippet and the table only presents the results from the largest code snippet in each post. The full results can be found from our study’s website:

  7. 7.

    We were deterred from using the well-known mean Average Precision (MAP) or Normalised Discounted Cumulative Gain (NDCG) that were suitable for assessing the quality of ranked results. MAP and NDCG need a complete ground truth of relevant documents which were not the case for BigCloneBench.

  8. 8.

    We had also tried NiCad, CCFinderX, iClones, DECKARD, and PMD-CPD, but they failed to analyse incomplete code snippets on Stack Overflow or took too long to report clones. Simian and SourcererCC also have the benefit of having two different clone granularity levels. They complement each other as SourcererCC’s clone fragments are always confined to method boundaries while Simian’s fragments are not.

  9. 9.

    We also tried integrating Ninka (German et al. 2010), a license identification tool, into Siamese but found that it dramatically slowed down the indexing and querying time.

  10. 10.

    GitHub license type:

  11. 11.

  12. 12.


  1. Abdalkareem R, Shihab E, Rilling J (2017) On code reuse from StackOverflow: an exploratory study on android apps. Inf Softw Technol 88:148–158

    Article  Google Scholar 

  2. Acar Y, Backes M, Fahl S, Kim D, Mazurek ML, Stransky C (2016) You get where you’re looking for: the impact of information sources on code security. In: SP ’16, pp 289–305

  3. An L, Mlouki O, Khomh F, Antoniol G (2017) Stack Overflow: a code laundering platform?. In: SANER ’17, pp 283–293

  4. Aragon Consulting Group Inc (2018) Krugle., Online; Access 23-April-2018

  5. Aversano L, Cerulo L, Di Penta M (2007) How clones are maintained: an empirical study. In: Proceedings of the 11th European conference on software maintenance and reengineering (CSMR ’07), IEEE, Los Alamitos, California, USA, pp 81–90

  6. Bajracharya SK, Ossher J, Lopes CV (2010) Leveraging usage similarity for effective retrieval of examples in code repositories. In: Proceedings of the 18th ACM SIGSOFT international symposium on foundations of software engineering (FSE ’10), p 157

  7. Balasubramanian N, Kumaran G, Carvalho VR (2010) Exploring reductions for long web queries. In: SIGIR ’10, p 571

  8. Baltes S, Diehl S (2018) Usage and attribution of Stack Overflow code snippets in GitHub projects. Empir Softw Eng:1–37

  9. Bauer V, Volke T, Eder S (2016) Combining clone detection and latent semantic indexing to detect re-implementations. In: Proceedings of the IEEE 23rd international conference on software analysis, evolution, and reengineering (SANER ’16), pp 23–29

  10. Baxter I, Yahin A, Moura L, Sant’Anna M, Bier L (1998) Clone detection using abstract syntax trees. In: ICSM ’98, vol 98, pp 368–377

  11. Beckman NE, Kim D, Aldrich J (2011) An empirical study of object protocols in the wild. In: ECOOP ’11, pp 2–26

  12. Bellon S, Koschke R, Antoniol G, Krinke J, Merlo E (2007) Comparison and evaluation of clone detection tools. TSE 33(9):577–591

    Google Scholar 

  13. Bendersky M, Croft WB (2008) Discovering key concepts in verbose queries. In: SIGIR’08, p 491

  14. BlackDuck (2016) OpenHub., online; access 18-May-2016

  15. Boyter B (2018) Searchcode., online; access 23-April-2018

  16. Burrows S, Tahaghoghi SMM, Zobel J (2007) Efficient plagiarism detection for large code repositories. Software: Practice and Experience 37(2):151–175

    Google Scholar 

  17. Chatterji D, Carver JC, Kraft NA (2016) Code clones and developer behavior: results of two surveys of the clone research community. Empir Softw Eng 21(4):1476–1508

    Article  Google Scholar 

  18. Craswell N (2009) Encyclopedia of Database Systems. Springer, Berlin

    Google Scholar 

  19. Davey N, Barson P, Field S, Frank R, Tansley D (1995) The development of a software clone detector. Int J Appl Softw Technol 1:3–4

    Google Scholar 

  20. Elasticsearch BV (2012) Lucene’s practical scoring function., online; access 20-March-2017

  21. Elasticsearch BV (2016) Elasticsearch., online; access 25-Jun-2016

  22. Flores E, Rosso P, Moreno L, Villatoro-Tello E (2014) Detection of source code re-use., accessed: 2016-02-14

  23. Fowler M (1999) Refactoring: improving the design of existing code. Addison-Wesley, Boston

    MATH  Google Scholar 

  24. Gallardo-Valencia RE, Sim SE (2009) Internet-scale code search. SUITE ’09, pp 49–52

  25. German DM, Manabe Y, Inoue K (2010) A sentence-matching method for automatic license identification of source code files. In: ASE ’10, p 437

  26. Göde N, Koschke R (2009) Incremental clone detection. In: CSMR’09, pp 219–228

  27. Grechanik M, Fu C, Xie Q, McMillan C, Poshyvanyk D, Cumby C (2010) A search engine for finding highly relevant applications. In: ICSE ’10, pp 475–484

  28. Gu X, Zhang H, Kim S (2018) Deep code search. In: Proceedings of the 40th international conference on software engineering (ICSE ’18), pp 933–944

  29. Harris S (2015) Simian – similarity analyser, version 2.4., accessed: 2016-02-14

  30. Hummel B, Juergens E, Heinemann L, Conradt M (2010) Index-based code clone detection: incremental, distributed, scalable. In: ICSM’10, pp 1–9

  31. Inoue K, Sasaki Y, Xia P, Manabe Y (2012) Where does this code come from and where does it go? — integrated code history tracker for open source systems. In: ICSE ’12, pp 331–341

  32. Ishio T, Sakaguchi Y, Ito K, Inoue K (2017) Source file set search for clone-and-own reuse analysis. In: Proceedings of the IEEE/ACM 14th international conference on mining software repositories (MSR ’17), pp 257–268

  33. Jiang L, Misherghi G, Su Z, Glondu S (2007) DECKARD: scalable And accurate tree-based detection of code clones. In: ICSE’07. IEEE, Minneapolis, pp 96–105

  34. Juergens E, Deissenboeck F, Hummel B (2011) Code similarities beyond copy & paste. In: Proceedings of the 15th European conference on software maintenance and reengineering (CSMR ’11), IEEE, pp 78– 87

  35. Kamiya T, Kusumoto S, Inoue K (2002) CCFinder: a multilinguistic token-based code clone detection system for large scale source code. TSE 28(7):654–670

    Google Scholar 

  36. Kapser C, Godfrey MW (2006) Cloning considered harmful considered harmful. In: Proceedings of the 13th Working Conference on Reverse Engineering (WCRE ’06), Benevento, taly, pp 19–28

  37. Kawaguchi S, Yamashina T, Uwano H, Fushida K, Kamei Y, Nagura M, Iida H (2009) SHINOBI: a tool for automatic code clone detection in the IDE. In: WCRE ’09, pp 313–314

  38. Ke Y, Stolee KT, Goues CL, Brun Y (2015) Repairing programs with semantic code search. In: ASE’15, pp 295–306

  39. Keivanloo I, Rilling J, Charland P (2011a) Internet-scale real-time code clone search via multi-level indexing. In: WCRE ’11, pp 23–27

  40. Keivanloo I, Rilling J, Charland P (2011b) SeClone – a hybrid approach to internet-scale real-time code clone search. In: ICPC ’11, pp 223–224

  41. Keivanloo I, Forbes C, Rilling J (2012) Similarity search plug-in: Clone detection meets internet-scale code search. In: SUITE ’12, pp 21–22

  42. Keivanloo I, Rilling J, Zou Y (2014) Spotting working code examples. In: ICSE ’14, pp 664–675

  43. Kim K, Kim D, Bissyandé TF, Choi E, Li L, Klein J, Traon YL (2018) FaCoY – a code-to-code search engine. In: ICSE’18

  44. Knuth DE (1971) An empirical study of fortran programs. Software: Practice and Experience 1(2):105– 133

    MATH  Google Scholar 

  45. Koschke R (2014) Large-scale inter-system clone detection using suffix trees and hashing. Journal of Software: Evolution and Process 26(8):747–769

    Google Scholar 

  46. Koschke R, Falke R, Frenzel P (2006) Clone detection using abstract syntax suffix trees. In: WCRE ’06, pp 253–262

  47. Krinke J (2001) Identifying similar code with program dependence graphs. In: WCRE ’01

  48. Kumaran G, Allan J (2007) A case for shorter queries, and helping users create them. In: NAACL-HLT ’07, pp 220–227

  49. Kumaran G, Carvalho VR (2009) Reducing long queries using query quality predictors. In: SIGIR’09 , p 564

  50. Lavoie T, Eilers-Smith M, Merlo E (2010) Challenging cloning related problems with gpu-based algorithms. In: Proceedings of the 4th international workshop on software clones (IWSC ’10), ACM, Cape Town, South Africa, pp 25–32

  51. Lee MW, Roh JW, Hwang SW, Kim S (2010) Instant code clone search. In: FSE ’10, p 167

  52. Li L, Feng H, Zhuang W, Meng N, Ryder B (2017) CCLearner: a deep learning-based clone detection approach. In: ICSME’17, pp 249–260

  53. Linstead E, Bajracharya S, Ngo T, Rigor P, Lopes C, Baldi P (2009) Sourcerer: mining and searching internet-scale software repositories, vol 18, pp 300–336

  54. Livieri S, German DM, Inoue K (2010) A needle in the stack: Efficient clone detection for huge collections of source code. Tech. rep., OSaka University

  55. Lopes CV, Maj P, Martins P, Saini V, Yang D, Zitny J, Sajnani H, Vitek J (2017) DéjàVu: a map of code duplicates on GitHub. Proceedings of the ACM on Programming Languages (OOPSLA) 1:1–28

    Article  Google Scholar 

  56. Manning CD, Raghavan P, Schutze H (2009) An introduction to information retrieval, vol 21. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  57. Martie L, Hoek AVD, Kwak T (2017) Understanding the impact of support for iteration on code search. In: ESEC/FSE ’17, pp 774–785

  58. McMillan C, Grechanik M, Poshyvanyk D, Xie Q, Fu C (2011) Portfolio: finding relevant functions and their usages. In: ICSE ’11, p 111

  59. Miller GA (1956) The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychol Rev 63(2):81–97

    Article  Google Scholar 

  60. Myles G, Collberg C (2005) K-gram based software birthmarks. In: SAC ’05, p 314

  61. Nasehi SM, Sillito J, Maurer F, Burns C (2012) What makes a good code example?: a study of programming Q&A in StackOverflow. In: ICSM’12, pp 25–34

  62. Nguyen TT, Nguyen HA, Al-Kofahi JM, Pham NH, Nguyen TN (2009) Scalable and incremental clone detection for evolving software. In: ICSM ’09, pp 491–494

  63. Nishi MA, Damevski K (2018) Scalable code clone detection and search based on adaptive prefix filtering. J Syst Softw 137:130–142

    Article  Google Scholar 

  64. Niu H, Keivanloo I, Zou Y (2017) Learning to rank code examples for code search engines. Empir Softw Eng 22(1):259–291

    Article  Google Scholar 

  65. Ohmann T, Rahal I (2014) Efficient clustering-based source code plagiarism detection using PIY. Knowl Inf Syst 43(2):445–472

    Article  Google Scholar 

  66. Omar C, Yoon YS, LaToza TD, Myers BA (2012) Active code completion. In: ICSE ’12, pp 859–869

  67. Park JW, Lee MW, Roh JW, Hwang SW, Kim S (2014) Surfacing code in the dark: an instant clone search approach. Knowl Inf Syst 41(3):727–759

    Article  Google Scholar 

  68. Parr T, Harwell S, Kochurkin I (2017) Grammars written for ANTLR v4., accessed: 2017-11-21

  69. Ponzanelli L, Bacchelli A, Lanza M (2013) Seahawk: Stack Overflow in the IDE. In: ICSE ’13, pp 1295– 1298

  70. Ponzanelli L, Bavota G, Di Penta M, Oliveto R, Lanza M (2014) Mining StackOverflow to turn the IDE into a self-confident programming prompter. In: MSR ’14, pp 102–111

  71. Prechelt L, Malpohl G, Philippsen M (2002) Finding plagiarisms among a set of programs with JPlag. J Univ Comput Sci 8(11):1016–1038

    Google Scholar 

  72. Ragkhitwetsagul C, Krinke J, Clark D (2018) A comparison of code similarity analysers. Empir Softw Eng 23(4):2464–2519

    Article  Google Scholar 

  73. Ragkhitwetsagul C, Krinke J, Paixao M, Bianco G, Oliveto R (2019) Toxic code snippets on Stack Overflow. Transactions on Software Engineering (Early Access)

  74. Rajaraman A, Ullman JD (2011) Mining of massive datasets, vol 67. Cambridge University Press, Cambridge

    Book  Google Scholar 

  75. Rilling J, Keivanloo I, Forbes C, Erfani M (2018) IJaDataset 2.0., online; access 13-March-2018

  76. Robertson S (1990) On term selection for query expansion. J Doc 46(4):359–364

    Article  Google Scholar 

  77. Roy CK, Cordy JR (2008) NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In: ICPC ’08, pp 172–181

  78. Roy CK, Cordy JR (2009) Near-miss function clones in open source software: an empirical study. J Softw Maint Evol Res Pract 26(12):165–189

    Google Scholar 

  79. Roy CK, Cordy JR, Koschke R (2009) Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci Comput Program 74 (7):470–495

    MathSciNet  Article  MATH  Google Scholar 

  80. Sadowski C, Stolee KT, Elbaum S (2015) How developers search for code: a case study. In: ESEC/FSE ’15, pp 191–201

  81. Saini V, Farmahinifarahani F, Lu Y, Baldi P, Lopes C (2018) Oreo: detection of clones in the twilight zone. In: The 26th ACM joint European software engineering conference and symposium on the foundations of software engineering (ESEC/FSE ’18)

  82. Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) SourcererCC: scaling code clone detection to big-code. In: ICSE’16, pp 1157–1168

  83. Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  MATH  Google Scholar 

  84. Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: local algorithms for document fingerprinting. In: SIGMOD ’03, ACM, p 76

  85. Sim SE, Gallardo-Valencia RE (2013) Finding source code on the web for remix and reuse. Springer, Berlin

  86. Sim SE, Umarji M, Ratanotayanon S, Lopes CV (2011) How well do search engines support code retrieval on the web? ACM Trans Softw Eng Methodol 21(1):1–25

    Article  Google Scholar 

  87. Slaney M, Casey M (2008) Locality-sensitive hashing for finding nearest neighbors. IEEE Signal Proc Mag 25(2):128–131

    Article  Google Scholar 

  88. Smucker MD, Allan J, Carterette B (2007) A comparison of statistical significance tests for information retrieval evaluation. In: CIKM ’07, p 623

  89. Svajlenko J, Roy CK (2014) Evaluating modern clone detection tools. In: Proceedings of the 30th international conference on software maintenance and evolution (ICSME ’14), IEEE, pp 321– 330

  90. Svajlenko J, Roy CK (2015) Evaluating clone detection tools with BigCloneBench. In: ICSME’15, pp 131–140

  91. Svajlenko J, Roy CK (2016) BigCloneEval: a clone detection tool evaluation framework with BigCloneBench. In: Proceedings of the international conference on software maintenance and evolution (ICSME ’16), vol 1, pp 596–600

  92. Svajlenko J, Roy CK (2017) Fast and flexible large-scale clone detection with CloneWorks. In: Proceedings of the IEEE/ACM 39th international conference on software engineering companion (ICSE-C ’17), pp 27–30

  93. Svajlenko J, Islam JF, Keivanloo I, Roy CK, Mia MM (2014) Towards a big data curated benchmark of inter-project code clones. In: ICSME’14, pp 476–480

  94. Tamersoy A, Roundy K, Chau DH (2014) Guilt by association: Large scale malware detection by mining. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’14). ACM, New York, pp 1524–1533

  95. Taube-Schock C, Walker RJ, Witten IH (2011) Can we avoid high coupling?. In: ECOOP ’11, pp 204– 228

  96. Tempero E, Anslow C, Dietrich J, Han T, Li J, Lumpe M, Melton H, Noble J (2010) Qualitas corpus: a curated collection of Java code for empirical studies. In: APSEC ’10, pp 336–345

  97. van Bruggen D (2017) JavaParser – process Java code programmatically., accessed: 2017-11-21

  98. Vargha A, Delaney HD (2000) A critique and improvement of the CL common language effect size statistics of McGraw and Wong. J Educ Behav Stat 25(2 (Summer, 2000)):101–132

    Google Scholar 

  99. Vasilescu B, Serebrenik A, van den Brand M (2011) You can’t control the unfamiliar: a study on the relations between aggregation techniques for software metrics. In: ICSM ’11, pp 313–322

  100. Wang T, Harman M, Jia Y, Krinke J (2013) Searching for better configurations: a rigorous approach to clone evaluation. In: ESEC/FSE ’13, pp 455–465

  101. White M, Tufano M, Vendome C, Poshyvanyk D (2016) Deep learning code fragments for code clone detection. In: ASE ’16, pp 87–98

  102. Yang D, Martins P, Saini V, Lopes C (2017) Stack Overflow in Github: any snippets there?. In: MSR ’17

  103. Zhang F, Niu H, Keivanloo I, Zou Y (2017) Expanding queries for code search using semantically related API class-names. TSE

  104. Zhang H (2008) Exploring regularity in source code: software science and Zipf’s law. In: WCRE’08, pp 101–110

  105. Zipf GK (1932) Selective studies and the principle of relative frequency in language. Harvard University Press, Cambridge

    Book  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Chaiyong Ragkhitwetsagul.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by: Yasutaka Kamei

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ragkhitwetsagul, C., Krinke, J. Siamese: scalable and incremental code clone search via multiple code representations. Empir Software Eng 24, 2236–2284 (2019).

Download citation


  • Code clone search
  • Code search engine