Skip to main content
Log in

Siamese: scalable and incremental code clone search via multiple code representations

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

This paper presents a novel code clone search technique that is accurate, incremental, and scalable to hundreds of million lines of code. Our technique incorporates multiple code representations (i.e., a technique to transform code into various representations to capture different types of clones), query reduction (i.e., a technique to select clone search keywords based on their uniqueness), and a customised ranking function (i.e., a technique to allow a specific clone type to be ranked on top of the search results) to improve clone search performance. We implemented the technique in a clone search tool, called Siamese, and evaluated its search accuracy and scalability on three established clone data sets. Siamese offers the highest mean average precision of 95% and 99% on two clone benchmarks compared to seven state-of-the-art clone detection tools, and reported the largest number of Type-3 clones compared to three other code search engines. Siamese is scalable and can return cloned code snippets within 8 seconds for a code corpus of 365 million lines of code. Using an index of 130,719 GitHub projects, we demonstrate that Siamese’s incremental indexing capability dramatically decreases the index preparation time for large-scale data sets with multiple releases of software projects. The paper discusses the applications of Siamese to facilitate software development and research with two use cases including online code clone detection and clone search with automated license analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. Tool and data sets used are available at https://github.com/UCL-CREST/Siamese.

  2. Detailed explanation: a query normalisation factor, queryNorm(q), enables a comparison between results of different queries; query coordination, coord(q,d), gives higher scores to documents that contain a high percentage of terms in the query; query boosting, t.getBoost(), gives a boosted term more importance than another; and field length normalisation, norm(t,d), gives higher weight to a shorter field than a long field in case a document is represented by more than one field, e.g., title and body.

  3. https://www.elastic.co/use-cases/github

  4. We could not include the FaCoy tool in our other RQs because the tool is released as a virtual machine image with an existing clone database. The only way to use the tool is via a web interface and there is no instruction on how to switch FaCoy to analyse a new data set (such as OCD, SOCO used in our study).

  5. JavaParser: http://javaparser.org

  6. There were a few Stack Overflow posts which contained more than one code snippet and the table only presents the results from the largest code snippet in each post. The full results can be found from our study’s website: https://github.com/UCL-CREST/Siamese

  7. We were deterred from using the well-known mean Average Precision (MAP) or Normalised Discounted Cumulative Gain (NDCG) that were suitable for assessing the quality of ranked results. MAP and NDCG need a complete ground truth of relevant documents which were not the case for BigCloneBench.

  8. We had also tried NiCad, CCFinderX, iClones, DECKARD, and PMD-CPD, but they failed to analyse incomplete code snippets on Stack Overflow or took too long to report clones. Simian and SourcererCC also have the benefit of having two different clone granularity levels. They complement each other as SourcererCC’s clone fragments are always confined to method boundaries while Simian’s fragments are not.

  9. We also tried integrating Ninka (German et al. 2010), a license identification tool, into Siamese but found that it dramatically slowed down the indexing and querying time.

  10. GitHub license type: https://help.github.com/articles/licensing-a-repository

  11. http://code.openhub.net

  12. http://sel.ist.osaka-u.ac.jp/SPARS/index.html.en

References

  • Abdalkareem R, Shihab E, Rilling J (2017) On code reuse from StackOverflow: an exploratory study on android apps. Inf Softw Technol 88:148–158

    Article  Google Scholar 

  • Acar Y, Backes M, Fahl S, Kim D, Mazurek ML, Stransky C (2016) You get where you’re looking for: the impact of information sources on code security. In: SP ’16, pp 289–305

  • An L, Mlouki O, Khomh F, Antoniol G (2017) Stack Overflow: a code laundering platform?. In: SANER ’17, pp 283–293

  • Aragon Consulting Group Inc (2018) Krugle. http://krugle.com, Online; Access 23-April-2018

  • Aversano L, Cerulo L, Di Penta M (2007) How clones are maintained: an empirical study. In: Proceedings of the 11th European conference on software maintenance and reengineering (CSMR ’07), IEEE, Los Alamitos, California, USA, pp 81–90

  • Bajracharya SK, Ossher J, Lopes CV (2010) Leveraging usage similarity for effective retrieval of examples in code repositories. In: Proceedings of the 18th ACM SIGSOFT international symposium on foundations of software engineering (FSE ’10), p 157

  • Balasubramanian N, Kumaran G, Carvalho VR (2010) Exploring reductions for long web queries. In: SIGIR ’10, p 571

  • Baltes S, Diehl S (2018) Usage and attribution of Stack Overflow code snippets in GitHub projects. Empir Softw Eng:1–37

  • Bauer V, Volke T, Eder S (2016) Combining clone detection and latent semantic indexing to detect re-implementations. In: Proceedings of the IEEE 23rd international conference on software analysis, evolution, and reengineering (SANER ’16), pp 23–29

  • Baxter I, Yahin A, Moura L, Sant’Anna M, Bier L (1998) Clone detection using abstract syntax trees. In: ICSM ’98, vol 98, pp 368–377

  • Beckman NE, Kim D, Aldrich J (2011) An empirical study of object protocols in the wild. In: ECOOP ’11, pp 2–26

  • Bellon S, Koschke R, Antoniol G, Krinke J, Merlo E (2007) Comparison and evaluation of clone detection tools. TSE 33(9):577–591

    Google Scholar 

  • Bendersky M, Croft WB (2008) Discovering key concepts in verbose queries. In: SIGIR’08, p 491

  • BlackDuck (2016) OpenHub. http://code.openhub.net, online; access 18-May-2016

  • Boyter B (2018) Searchcode. https://searchcode.com, online; access 23-April-2018

  • Burrows S, Tahaghoghi SMM, Zobel J (2007) Efficient plagiarism detection for large code repositories. Software: Practice and Experience 37(2):151–175

    Google Scholar 

  • Chatterji D, Carver JC, Kraft NA (2016) Code clones and developer behavior: results of two surveys of the clone research community. Empir Softw Eng 21(4):1476–1508

    Article  Google Scholar 

  • Craswell N (2009) Encyclopedia of Database Systems. Springer, Berlin

    Google Scholar 

  • Davey N, Barson P, Field S, Frank R, Tansley D (1995) The development of a software clone detector. Int J Appl Softw Technol 1:3–4

    Google Scholar 

  • Elasticsearch BV (2012) Lucene’s practical scoring function. https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html, online; access 20-March-2017

  • Elasticsearch BV (2016) Elasticsearch. https://www.elastic.co/products/elasticsearch, online; access 25-Jun-2016

  • Flores E, Rosso P, Moreno L, Villatoro-Tello E (2014) Detection of source code re-use. http://users.dsic.upv.es/grupos/nle/soco/, accessed: 2016-02-14

  • Fowler M (1999) Refactoring: improving the design of existing code. Addison-Wesley, Boston

    MATH  Google Scholar 

  • Gallardo-Valencia RE, Sim SE (2009) Internet-scale code search. SUITE ’09, pp 49–52

  • German DM, Manabe Y, Inoue K (2010) A sentence-matching method for automatic license identification of source code files. In: ASE ’10, p 437

  • Göde N, Koschke R (2009) Incremental clone detection. In: CSMR’09, pp 219–228

  • Grechanik M, Fu C, Xie Q, McMillan C, Poshyvanyk D, Cumby C (2010) A search engine for finding highly relevant applications. In: ICSE ’10, pp 475–484

  • Gu X, Zhang H, Kim S (2018) Deep code search. In: Proceedings of the 40th international conference on software engineering (ICSE ’18), pp 933–944

  • Harris S (2015) Simian – similarity analyser, version 2.4. http://www.harukizaemon.com/simian/, accessed: 2016-02-14

  • Hummel B, Juergens E, Heinemann L, Conradt M (2010) Index-based code clone detection: incremental, distributed, scalable. In: ICSM’10, pp 1–9

  • Inoue K, Sasaki Y, Xia P, Manabe Y (2012) Where does this code come from and where does it go? — integrated code history tracker for open source systems. In: ICSE ’12, pp 331–341

  • Ishio T, Sakaguchi Y, Ito K, Inoue K (2017) Source file set search for clone-and-own reuse analysis. In: Proceedings of the IEEE/ACM 14th international conference on mining software repositories (MSR ’17), pp 257–268

  • Jiang L, Misherghi G, Su Z, Glondu S (2007) DECKARD: scalable And accurate tree-based detection of code clones. In: ICSE’07. IEEE, Minneapolis, pp 96–105

  • Juergens E, Deissenboeck F, Hummel B (2011) Code similarities beyond copy & paste. In: Proceedings of the 15th European conference on software maintenance and reengineering (CSMR ’11), IEEE, pp 78– 87

  • Kamiya T, Kusumoto S, Inoue K (2002) CCFinder: a multilinguistic token-based code clone detection system for large scale source code. TSE 28(7):654–670

    Google Scholar 

  • Kapser C, Godfrey MW (2006) Cloning considered harmful considered harmful. In: Proceedings of the 13th Working Conference on Reverse Engineering (WCRE ’06), Benevento, taly, pp 19–28

  • Kawaguchi S, Yamashina T, Uwano H, Fushida K, Kamei Y, Nagura M, Iida H (2009) SHINOBI: a tool for automatic code clone detection in the IDE. In: WCRE ’09, pp 313–314

  • Ke Y, Stolee KT, Goues CL, Brun Y (2015) Repairing programs with semantic code search. In: ASE’15, pp 295–306

  • Keivanloo I, Rilling J, Charland P (2011a) Internet-scale real-time code clone search via multi-level indexing. In: WCRE ’11, pp 23–27

  • Keivanloo I, Rilling J, Charland P (2011b) SeClone – a hybrid approach to internet-scale real-time code clone search. In: ICPC ’11, pp 223–224

  • Keivanloo I, Forbes C, Rilling J (2012) Similarity search plug-in: Clone detection meets internet-scale code search. In: SUITE ’12, pp 21–22

  • Keivanloo I, Rilling J, Zou Y (2014) Spotting working code examples. In: ICSE ’14, pp 664–675

  • Kim K, Kim D, Bissyandé TF, Choi E, Li L, Klein J, Traon YL (2018) FaCoY – a code-to-code search engine. In: ICSE’18

  • Knuth DE (1971) An empirical study of fortran programs. Software: Practice and Experience 1(2):105– 133

    MATH  Google Scholar 

  • Koschke R (2014) Large-scale inter-system clone detection using suffix trees and hashing. Journal of Software: Evolution and Process 26(8):747–769

    Google Scholar 

  • Koschke R, Falke R, Frenzel P (2006) Clone detection using abstract syntax suffix trees. In: WCRE ’06, pp 253–262

  • Krinke J (2001) Identifying similar code with program dependence graphs. In: WCRE ’01

  • Kumaran G, Allan J (2007) A case for shorter queries, and helping users create them. In: NAACL-HLT ’07, pp 220–227

  • Kumaran G, Carvalho VR (2009) Reducing long queries using query quality predictors. In: SIGIR’09 , p 564

  • Lavoie T, Eilers-Smith M, Merlo E (2010) Challenging cloning related problems with gpu-based algorithms. In: Proceedings of the 4th international workshop on software clones (IWSC ’10), ACM, Cape Town, South Africa, pp 25–32

  • Lee MW, Roh JW, Hwang SW, Kim S (2010) Instant code clone search. In: FSE ’10, p 167

  • Li L, Feng H, Zhuang W, Meng N, Ryder B (2017) CCLearner: a deep learning-based clone detection approach. In: ICSME’17, pp 249–260

  • Linstead E, Bajracharya S, Ngo T, Rigor P, Lopes C, Baldi P (2009) Sourcerer: mining and searching internet-scale software repositories, vol 18, pp 300–336

  • Livieri S, German DM, Inoue K (2010) A needle in the stack: Efficient clone detection for huge collections of source code. Tech. rep., OSaka University

  • Lopes CV, Maj P, Martins P, Saini V, Yang D, Zitny J, Sajnani H, Vitek J (2017) DéjàVu: a map of code duplicates on GitHub. Proceedings of the ACM on Programming Languages (OOPSLA) 1:1–28

    Article  Google Scholar 

  • Manning CD, Raghavan P, Schutze H (2009) An introduction to information retrieval, vol 21. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Martie L, Hoek AVD, Kwak T (2017) Understanding the impact of support for iteration on code search. In: ESEC/FSE ’17, pp 774–785

  • McMillan C, Grechanik M, Poshyvanyk D, Xie Q, Fu C (2011) Portfolio: finding relevant functions and their usages. In: ICSE ’11, p 111

  • Miller GA (1956) The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychol Rev 63(2):81–97

    Article  Google Scholar 

  • Myles G, Collberg C (2005) K-gram based software birthmarks. In: SAC ’05, p 314

  • Nasehi SM, Sillito J, Maurer F, Burns C (2012) What makes a good code example?: a study of programming Q&A in StackOverflow. In: ICSM’12, pp 25–34

  • Nguyen TT, Nguyen HA, Al-Kofahi JM, Pham NH, Nguyen TN (2009) Scalable and incremental clone detection for evolving software. In: ICSM ’09, pp 491–494

  • Nishi MA, Damevski K (2018) Scalable code clone detection and search based on adaptive prefix filtering. J Syst Softw 137:130–142

    Article  Google Scholar 

  • Niu H, Keivanloo I, Zou Y (2017) Learning to rank code examples for code search engines. Empir Softw Eng 22(1):259–291

    Article  Google Scholar 

  • Ohmann T, Rahal I (2014) Efficient clustering-based source code plagiarism detection using PIY. Knowl Inf Syst 43(2):445–472

    Article  Google Scholar 

  • Omar C, Yoon YS, LaToza TD, Myers BA (2012) Active code completion. In: ICSE ’12, pp 859–869

  • Park JW, Lee MW, Roh JW, Hwang SW, Kim S (2014) Surfacing code in the dark: an instant clone search approach. Knowl Inf Syst 41(3):727–759

    Article  Google Scholar 

  • Parr T, Harwell S, Kochurkin I (2017) Grammars written for ANTLR v4. https://github.com/antlr/grammars-v4, accessed: 2017-11-21

  • Ponzanelli L, Bacchelli A, Lanza M (2013) Seahawk: Stack Overflow in the IDE. In: ICSE ’13, pp 1295– 1298

  • Ponzanelli L, Bavota G, Di Penta M, Oliveto R, Lanza M (2014) Mining StackOverflow to turn the IDE into a self-confident programming prompter. In: MSR ’14, pp 102–111

  • Prechelt L, Malpohl G, Philippsen M (2002) Finding plagiarisms among a set of programs with JPlag. J Univ Comput Sci 8(11):1016–1038

    Google Scholar 

  • Ragkhitwetsagul C, Krinke J, Clark D (2018) A comparison of code similarity analysers. Empir Softw Eng 23(4):2464–2519

    Article  Google Scholar 

  • Ragkhitwetsagul C, Krinke J, Paixao M, Bianco G, Oliveto R (2019) Toxic code snippets on Stack Overflow. Transactions on Software Engineering (Early Access)

  • Rajaraman A, Ullman JD (2011) Mining of massive datasets, vol 67. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Rilling J, Keivanloo I, Forbes C, Erfani M (2018) IJaDataset 2.0. https://sites.google.com/site/asegsecold/projects/seclone, online; access 13-March-2018

  • Robertson S (1990) On term selection for query expansion. J Doc 46(4):359–364

    Article  Google Scholar 

  • Roy CK, Cordy JR (2008) NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In: ICPC ’08, pp 172–181

  • Roy CK, Cordy JR (2009) Near-miss function clones in open source software: an empirical study. J Softw Maint Evol Res Pract 26(12):165–189

    Google Scholar 

  • Roy CK, Cordy JR, Koschke R (2009) Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci Comput Program 74 (7):470–495

    Article  MathSciNet  MATH  Google Scholar 

  • Sadowski C, Stolee KT, Elbaum S (2015) How developers search for code: a case study. In: ESEC/FSE ’15, pp 191–201

  • Saini V, Farmahinifarahani F, Lu Y, Baldi P, Lopes C (2018) Oreo: detection of clones in the twilight zone. In: The 26th ACM joint European software engineering conference and symposium on the foundations of software engineering (ESEC/FSE ’18)

  • Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) SourcererCC: scaling code clone detection to big-code. In: ICSE’16, pp 1157–1168

  • Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  MATH  Google Scholar 

  • Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: local algorithms for document fingerprinting. In: SIGMOD ’03, ACM, p 76

  • Sim SE, Gallardo-Valencia RE (2013) Finding source code on the web for remix and reuse. Springer, Berlin

  • Sim SE, Umarji M, Ratanotayanon S, Lopes CV (2011) How well do search engines support code retrieval on the web? ACM Trans Softw Eng Methodol 21(1):1–25

    Article  Google Scholar 

  • Slaney M, Casey M (2008) Locality-sensitive hashing for finding nearest neighbors. IEEE Signal Proc Mag 25(2):128–131

    Article  Google Scholar 

  • Smucker MD, Allan J, Carterette B (2007) A comparison of statistical significance tests for information retrieval evaluation. In: CIKM ’07, p 623

  • Svajlenko J, Roy CK (2014) Evaluating modern clone detection tools. In: Proceedings of the 30th international conference on software maintenance and evolution (ICSME ’14), IEEE, pp 321– 330

  • Svajlenko J, Roy CK (2015) Evaluating clone detection tools with BigCloneBench. In: ICSME’15, pp 131–140

  • Svajlenko J, Roy CK (2016) BigCloneEval: a clone detection tool evaluation framework with BigCloneBench. In: Proceedings of the international conference on software maintenance and evolution (ICSME ’16), vol 1, pp 596–600

  • Svajlenko J, Roy CK (2017) Fast and flexible large-scale clone detection with CloneWorks. In: Proceedings of the IEEE/ACM 39th international conference on software engineering companion (ICSE-C ’17), pp 27–30

  • Svajlenko J, Islam JF, Keivanloo I, Roy CK, Mia MM (2014) Towards a big data curated benchmark of inter-project code clones. In: ICSME’14, pp 476–480

  • Tamersoy A, Roundy K, Chau DH (2014) Guilt by association: Large scale malware detection by mining. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’14). ACM, New York, pp 1524–1533

  • Taube-Schock C, Walker RJ, Witten IH (2011) Can we avoid high coupling?. In: ECOOP ’11, pp 204– 228

  • Tempero E, Anslow C, Dietrich J, Han T, Li J, Lumpe M, Melton H, Noble J (2010) Qualitas corpus: a curated collection of Java code for empirical studies. In: APSEC ’10, pp 336–345

  • van Bruggen D (2017) JavaParser – process Java code programmatically. http://javaparser.org, accessed: 2017-11-21

  • Vargha A, Delaney HD (2000) A critique and improvement of the CL common language effect size statistics of McGraw and Wong. J Educ Behav Stat 25(2 (Summer, 2000)):101–132

    Google Scholar 

  • Vasilescu B, Serebrenik A, van den Brand M (2011) You can’t control the unfamiliar: a study on the relations between aggregation techniques for software metrics. In: ICSM ’11, pp 313–322

  • Wang T, Harman M, Jia Y, Krinke J (2013) Searching for better configurations: a rigorous approach to clone evaluation. In: ESEC/FSE ’13, pp 455–465

  • White M, Tufano M, Vendome C, Poshyvanyk D (2016) Deep learning code fragments for code clone detection. In: ASE ’16, pp 87–98

  • Yang D, Martins P, Saini V, Lopes C (2017) Stack Overflow in Github: any snippets there?. In: MSR ’17

  • Zhang F, Niu H, Keivanloo I, Zou Y (2017) Expanding queries for code search using semantically related API class-names. TSE https://doi.org/10.1109/TSE.2017.2750682

  • Zhang H (2008) Exploring regularity in source code: software science and Zipf’s law. In: WCRE’08, pp 101–110

  • Zipf GK (1932) Selective studies and the principle of relative frequency in language. Harvard University Press, Cambridge

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chaiyong Ragkhitwetsagul.

Additional information

Communicated by: Yasutaka Kamei

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ragkhitwetsagul, C., Krinke, J. Siamese: scalable and incremental code clone search via multiple code representations. Empir Software Eng 24, 2236–2284 (2019). https://doi.org/10.1007/s10664-019-09697-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-019-09697-7

Keywords

Navigation