Abstract
This paper presents a novel code clone search technique that is accurate, incremental, and scalable to hundreds of million lines of code. Our technique incorporates multiple code representations (i.e., a technique to transform code into various representations to capture different types of clones), query reduction (i.e., a technique to select clone search keywords based on their uniqueness), and a customised ranking function (i.e., a technique to allow a specific clone type to be ranked on top of the search results) to improve clone search performance. We implemented the technique in a clone search tool, called Siamese, and evaluated its search accuracy and scalability on three established clone data sets. Siamese offers the highest mean average precision of 95% and 99% on two clone benchmarks compared to seven state-of-the-art clone detection tools, and reported the largest number of Type-3 clones compared to three other code search engines. Siamese is scalable and can return cloned code snippets within 8 seconds for a code corpus of 365 million lines of code. Using an index of 130,719 GitHub projects, we demonstrate that Siamese’s incremental indexing capability dramatically decreases the index preparation time for large-scale data sets with multiple releases of software projects. The paper discusses the applications of Siamese to facilitate software development and research with two use cases including online code clone detection and clone search with automated license analysis.
Similar content being viewed by others
Notes
Tool and data sets used are available at https://github.com/UCL-CREST/Siamese.
Detailed explanation: a query normalisation factor, queryNorm(q), enables a comparison between results of different queries; query coordination, coord(q,d), gives higher scores to documents that contain a high percentage of terms in the query; query boosting, t.getBoost(), gives a boosted term more importance than another; and field length normalisation, norm(t,d), gives higher weight to a shorter field than a long field in case a document is represented by more than one field, e.g., title and body.
We could not include the FaCoy tool in our other RQs because the tool is released as a virtual machine image with an existing clone database. The only way to use the tool is via a web interface and there is no instruction on how to switch FaCoy to analyse a new data set (such as OCD, SOCO used in our study).
JavaParser: http://javaparser.org
There were a few Stack Overflow posts which contained more than one code snippet and the table only presents the results from the largest code snippet in each post. The full results can be found from our study’s website: https://github.com/UCL-CREST/Siamese
We were deterred from using the well-known mean Average Precision (MAP) or Normalised Discounted Cumulative Gain (NDCG) that were suitable for assessing the quality of ranked results. MAP and NDCG need a complete ground truth of relevant documents which were not the case for BigCloneBench.
We had also tried NiCad, CCFinderX, iClones, DECKARD, and PMD-CPD, but they failed to analyse incomplete code snippets on Stack Overflow or took too long to report clones. Simian and SourcererCC also have the benefit of having two different clone granularity levels. They complement each other as SourcererCC’s clone fragments are always confined to method boundaries while Simian’s fragments are not.
We also tried integrating Ninka (German et al. 2010), a license identification tool, into Siamese but found that it dramatically slowed down the indexing and querying time.
GitHub license type: https://help.github.com/articles/licensing-a-repository
References
Abdalkareem R, Shihab E, Rilling J (2017) On code reuse from StackOverflow: an exploratory study on android apps. Inf Softw Technol 88:148–158
Acar Y, Backes M, Fahl S, Kim D, Mazurek ML, Stransky C (2016) You get where you’re looking for: the impact of information sources on code security. In: SP ’16, pp 289–305
An L, Mlouki O, Khomh F, Antoniol G (2017) Stack Overflow: a code laundering platform?. In: SANER ’17, pp 283–293
Aragon Consulting Group Inc (2018) Krugle. http://krugle.com, Online; Access 23-April-2018
Aversano L, Cerulo L, Di Penta M (2007) How clones are maintained: an empirical study. In: Proceedings of the 11th European conference on software maintenance and reengineering (CSMR ’07), IEEE, Los Alamitos, California, USA, pp 81–90
Bajracharya SK, Ossher J, Lopes CV (2010) Leveraging usage similarity for effective retrieval of examples in code repositories. In: Proceedings of the 18th ACM SIGSOFT international symposium on foundations of software engineering (FSE ’10), p 157
Balasubramanian N, Kumaran G, Carvalho VR (2010) Exploring reductions for long web queries. In: SIGIR ’10, p 571
Baltes S, Diehl S (2018) Usage and attribution of Stack Overflow code snippets in GitHub projects. Empir Softw Eng:1–37
Bauer V, Volke T, Eder S (2016) Combining clone detection and latent semantic indexing to detect re-implementations. In: Proceedings of the IEEE 23rd international conference on software analysis, evolution, and reengineering (SANER ’16), pp 23–29
Baxter I, Yahin A, Moura L, Sant’Anna M, Bier L (1998) Clone detection using abstract syntax trees. In: ICSM ’98, vol 98, pp 368–377
Beckman NE, Kim D, Aldrich J (2011) An empirical study of object protocols in the wild. In: ECOOP ’11, pp 2–26
Bellon S, Koschke R, Antoniol G, Krinke J, Merlo E (2007) Comparison and evaluation of clone detection tools. TSE 33(9):577–591
Bendersky M, Croft WB (2008) Discovering key concepts in verbose queries. In: SIGIR’08, p 491
BlackDuck (2016) OpenHub. http://code.openhub.net, online; access 18-May-2016
Boyter B (2018) Searchcode. https://searchcode.com, online; access 23-April-2018
Burrows S, Tahaghoghi SMM, Zobel J (2007) Efficient plagiarism detection for large code repositories. Software: Practice and Experience 37(2):151–175
Chatterji D, Carver JC, Kraft NA (2016) Code clones and developer behavior: results of two surveys of the clone research community. Empir Softw Eng 21(4):1476–1508
Craswell N (2009) Encyclopedia of Database Systems. Springer, Berlin
Davey N, Barson P, Field S, Frank R, Tansley D (1995) The development of a software clone detector. Int J Appl Softw Technol 1:3–4
Elasticsearch BV (2012) Lucene’s practical scoring function. https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html, online; access 20-March-2017
Elasticsearch BV (2016) Elasticsearch. https://www.elastic.co/products/elasticsearch, online; access 25-Jun-2016
Flores E, Rosso P, Moreno L, Villatoro-Tello E (2014) Detection of source code re-use. http://users.dsic.upv.es/grupos/nle/soco/, accessed: 2016-02-14
Fowler M (1999) Refactoring: improving the design of existing code. Addison-Wesley, Boston
Gallardo-Valencia RE, Sim SE (2009) Internet-scale code search. SUITE ’09, pp 49–52
German DM, Manabe Y, Inoue K (2010) A sentence-matching method for automatic license identification of source code files. In: ASE ’10, p 437
Göde N, Koschke R (2009) Incremental clone detection. In: CSMR’09, pp 219–228
Grechanik M, Fu C, Xie Q, McMillan C, Poshyvanyk D, Cumby C (2010) A search engine for finding highly relevant applications. In: ICSE ’10, pp 475–484
Gu X, Zhang H, Kim S (2018) Deep code search. In: Proceedings of the 40th international conference on software engineering (ICSE ’18), pp 933–944
Harris S (2015) Simian – similarity analyser, version 2.4. http://www.harukizaemon.com/simian/, accessed: 2016-02-14
Hummel B, Juergens E, Heinemann L, Conradt M (2010) Index-based code clone detection: incremental, distributed, scalable. In: ICSM’10, pp 1–9
Inoue K, Sasaki Y, Xia P, Manabe Y (2012) Where does this code come from and where does it go? — integrated code history tracker for open source systems. In: ICSE ’12, pp 331–341
Ishio T, Sakaguchi Y, Ito K, Inoue K (2017) Source file set search for clone-and-own reuse analysis. In: Proceedings of the IEEE/ACM 14th international conference on mining software repositories (MSR ’17), pp 257–268
Jiang L, Misherghi G, Su Z, Glondu S (2007) DECKARD: scalable And accurate tree-based detection of code clones. In: ICSE’07. IEEE, Minneapolis, pp 96–105
Juergens E, Deissenboeck F, Hummel B (2011) Code similarities beyond copy & paste. In: Proceedings of the 15th European conference on software maintenance and reengineering (CSMR ’11), IEEE, pp 78– 87
Kamiya T, Kusumoto S, Inoue K (2002) CCFinder: a multilinguistic token-based code clone detection system for large scale source code. TSE 28(7):654–670
Kapser C, Godfrey MW (2006) Cloning considered harmful considered harmful. In: Proceedings of the 13th Working Conference on Reverse Engineering (WCRE ’06), Benevento, taly, pp 19–28
Kawaguchi S, Yamashina T, Uwano H, Fushida K, Kamei Y, Nagura M, Iida H (2009) SHINOBI: a tool for automatic code clone detection in the IDE. In: WCRE ’09, pp 313–314
Ke Y, Stolee KT, Goues CL, Brun Y (2015) Repairing programs with semantic code search. In: ASE’15, pp 295–306
Keivanloo I, Rilling J, Charland P (2011a) Internet-scale real-time code clone search via multi-level indexing. In: WCRE ’11, pp 23–27
Keivanloo I, Rilling J, Charland P (2011b) SeClone – a hybrid approach to internet-scale real-time code clone search. In: ICPC ’11, pp 223–224
Keivanloo I, Forbes C, Rilling J (2012) Similarity search plug-in: Clone detection meets internet-scale code search. In: SUITE ’12, pp 21–22
Keivanloo I, Rilling J, Zou Y (2014) Spotting working code examples. In: ICSE ’14, pp 664–675
Kim K, Kim D, Bissyandé TF, Choi E, Li L, Klein J, Traon YL (2018) FaCoY – a code-to-code search engine. In: ICSE’18
Knuth DE (1971) An empirical study of fortran programs. Software: Practice and Experience 1(2):105– 133
Koschke R (2014) Large-scale inter-system clone detection using suffix trees and hashing. Journal of Software: Evolution and Process 26(8):747–769
Koschke R, Falke R, Frenzel P (2006) Clone detection using abstract syntax suffix trees. In: WCRE ’06, pp 253–262
Krinke J (2001) Identifying similar code with program dependence graphs. In: WCRE ’01
Kumaran G, Allan J (2007) A case for shorter queries, and helping users create them. In: NAACL-HLT ’07, pp 220–227
Kumaran G, Carvalho VR (2009) Reducing long queries using query quality predictors. In: SIGIR’09 , p 564
Lavoie T, Eilers-Smith M, Merlo E (2010) Challenging cloning related problems with gpu-based algorithms. In: Proceedings of the 4th international workshop on software clones (IWSC ’10), ACM, Cape Town, South Africa, pp 25–32
Lee MW, Roh JW, Hwang SW, Kim S (2010) Instant code clone search. In: FSE ’10, p 167
Li L, Feng H, Zhuang W, Meng N, Ryder B (2017) CCLearner: a deep learning-based clone detection approach. In: ICSME’17, pp 249–260
Linstead E, Bajracharya S, Ngo T, Rigor P, Lopes C, Baldi P (2009) Sourcerer: mining and searching internet-scale software repositories, vol 18, pp 300–336
Livieri S, German DM, Inoue K (2010) A needle in the stack: Efficient clone detection for huge collections of source code. Tech. rep., OSaka University
Lopes CV, Maj P, Martins P, Saini V, Yang D, Zitny J, Sajnani H, Vitek J (2017) DéjàVu: a map of code duplicates on GitHub. Proceedings of the ACM on Programming Languages (OOPSLA) 1:1–28
Manning CD, Raghavan P, Schutze H (2009) An introduction to information retrieval, vol 21. Cambridge University Press, Cambridge
Martie L, Hoek AVD, Kwak T (2017) Understanding the impact of support for iteration on code search. In: ESEC/FSE ’17, pp 774–785
McMillan C, Grechanik M, Poshyvanyk D, Xie Q, Fu C (2011) Portfolio: finding relevant functions and their usages. In: ICSE ’11, p 111
Miller GA (1956) The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychol Rev 63(2):81–97
Myles G, Collberg C (2005) K-gram based software birthmarks. In: SAC ’05, p 314
Nasehi SM, Sillito J, Maurer F, Burns C (2012) What makes a good code example?: a study of programming Q&A in StackOverflow. In: ICSM’12, pp 25–34
Nguyen TT, Nguyen HA, Al-Kofahi JM, Pham NH, Nguyen TN (2009) Scalable and incremental clone detection for evolving software. In: ICSM ’09, pp 491–494
Nishi MA, Damevski K (2018) Scalable code clone detection and search based on adaptive prefix filtering. J Syst Softw 137:130–142
Niu H, Keivanloo I, Zou Y (2017) Learning to rank code examples for code search engines. Empir Softw Eng 22(1):259–291
Ohmann T, Rahal I (2014) Efficient clustering-based source code plagiarism detection using PIY. Knowl Inf Syst 43(2):445–472
Omar C, Yoon YS, LaToza TD, Myers BA (2012) Active code completion. In: ICSE ’12, pp 859–869
Park JW, Lee MW, Roh JW, Hwang SW, Kim S (2014) Surfacing code in the dark: an instant clone search approach. Knowl Inf Syst 41(3):727–759
Parr T, Harwell S, Kochurkin I (2017) Grammars written for ANTLR v4. https://github.com/antlr/grammars-v4, accessed: 2017-11-21
Ponzanelli L, Bacchelli A, Lanza M (2013) Seahawk: Stack Overflow in the IDE. In: ICSE ’13, pp 1295– 1298
Ponzanelli L, Bavota G, Di Penta M, Oliveto R, Lanza M (2014) Mining StackOverflow to turn the IDE into a self-confident programming prompter. In: MSR ’14, pp 102–111
Prechelt L, Malpohl G, Philippsen M (2002) Finding plagiarisms among a set of programs with JPlag. J Univ Comput Sci 8(11):1016–1038
Ragkhitwetsagul C, Krinke J, Clark D (2018) A comparison of code similarity analysers. Empir Softw Eng 23(4):2464–2519
Ragkhitwetsagul C, Krinke J, Paixao M, Bianco G, Oliveto R (2019) Toxic code snippets on Stack Overflow. Transactions on Software Engineering (Early Access)
Rajaraman A, Ullman JD (2011) Mining of massive datasets, vol 67. Cambridge University Press, Cambridge
Rilling J, Keivanloo I, Forbes C, Erfani M (2018) IJaDataset 2.0. https://sites.google.com/site/asegsecold/projects/seclone, online; access 13-March-2018
Robertson S (1990) On term selection for query expansion. J Doc 46(4):359–364
Roy CK, Cordy JR (2008) NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization. In: ICPC ’08, pp 172–181
Roy CK, Cordy JR (2009) Near-miss function clones in open source software: an empirical study. J Softw Maint Evol Res Pract 26(12):165–189
Roy CK, Cordy JR, Koschke R (2009) Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci Comput Program 74 (7):470–495
Sadowski C, Stolee KT, Elbaum S (2015) How developers search for code: a case study. In: ESEC/FSE ’15, pp 191–201
Saini V, Farmahinifarahani F, Lu Y, Baldi P, Lopes C (2018) Oreo: detection of clones in the twilight zone. In: The 26th ACM joint European software engineering conference and symposium on the foundations of software engineering (ESEC/FSE ’18)
Sajnani H, Saini V, Svajlenko J, Roy CK, Lopes CV (2016) SourcererCC: scaling code clone detection to big-code. In: ICSE’16, pp 1157–1168
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: local algorithms for document fingerprinting. In: SIGMOD ’03, ACM, p 76
Sim SE, Gallardo-Valencia RE (2013) Finding source code on the web for remix and reuse. Springer, Berlin
Sim SE, Umarji M, Ratanotayanon S, Lopes CV (2011) How well do search engines support code retrieval on the web? ACM Trans Softw Eng Methodol 21(1):1–25
Slaney M, Casey M (2008) Locality-sensitive hashing for finding nearest neighbors. IEEE Signal Proc Mag 25(2):128–131
Smucker MD, Allan J, Carterette B (2007) A comparison of statistical significance tests for information retrieval evaluation. In: CIKM ’07, p 623
Svajlenko J, Roy CK (2014) Evaluating modern clone detection tools. In: Proceedings of the 30th international conference on software maintenance and evolution (ICSME ’14), IEEE, pp 321– 330
Svajlenko J, Roy CK (2015) Evaluating clone detection tools with BigCloneBench. In: ICSME’15, pp 131–140
Svajlenko J, Roy CK (2016) BigCloneEval: a clone detection tool evaluation framework with BigCloneBench. In: Proceedings of the international conference on software maintenance and evolution (ICSME ’16), vol 1, pp 596–600
Svajlenko J, Roy CK (2017) Fast and flexible large-scale clone detection with CloneWorks. In: Proceedings of the IEEE/ACM 39th international conference on software engineering companion (ICSE-C ’17), pp 27–30
Svajlenko J, Islam JF, Keivanloo I, Roy CK, Mia MM (2014) Towards a big data curated benchmark of inter-project code clones. In: ICSME’14, pp 476–480
Tamersoy A, Roundy K, Chau DH (2014) Guilt by association: Large scale malware detection by mining. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’14). ACM, New York, pp 1524–1533
Taube-Schock C, Walker RJ, Witten IH (2011) Can we avoid high coupling?. In: ECOOP ’11, pp 204– 228
Tempero E, Anslow C, Dietrich J, Han T, Li J, Lumpe M, Melton H, Noble J (2010) Qualitas corpus: a curated collection of Java code for empirical studies. In: APSEC ’10, pp 336–345
van Bruggen D (2017) JavaParser – process Java code programmatically. http://javaparser.org, accessed: 2017-11-21
Vargha A, Delaney HD (2000) A critique and improvement of the CL common language effect size statistics of McGraw and Wong. J Educ Behav Stat 25(2 (Summer, 2000)):101–132
Vasilescu B, Serebrenik A, van den Brand M (2011) You can’t control the unfamiliar: a study on the relations between aggregation techniques for software metrics. In: ICSM ’11, pp 313–322
Wang T, Harman M, Jia Y, Krinke J (2013) Searching for better configurations: a rigorous approach to clone evaluation. In: ESEC/FSE ’13, pp 455–465
White M, Tufano M, Vendome C, Poshyvanyk D (2016) Deep learning code fragments for code clone detection. In: ASE ’16, pp 87–98
Yang D, Martins P, Saini V, Lopes C (2017) Stack Overflow in Github: any snippets there?. In: MSR ’17
Zhang F, Niu H, Keivanloo I, Zou Y (2017) Expanding queries for code search using semantically related API class-names. TSE https://doi.org/10.1109/TSE.2017.2750682
Zhang H (2008) Exploring regularity in source code: software science and Zipf’s law. In: WCRE’08, pp 101–110
Zipf GK (1932) Selective studies and the principle of relative frequency in language. Harvard University Press, Cambridge
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Yasutaka Kamei
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ragkhitwetsagul, C., Krinke, J. Siamese: scalable and incremental code clone search via multiple code representations. Empir Software Eng 24, 2236–2284 (2019). https://doi.org/10.1007/s10664-019-09697-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-019-09697-7