1 Introduction

Assessing source code similarity is a fundamental activity in software engineering and it has many applications. These include clone detection, the problem of locating duplicated code fragments; plagiarism detection; software copyright infringement; and code search, in which developers search for similar implementations. Whilst that list covers the more common applications, similarity assessment is used in many other areas, too. Examples include finding similar bug fixes (Hartmann et al., 2010), identifying cross-cutting concerns (Bruntink et al., 2005), program comprehension (Maletic and Marcus 2001), code recommendation (Holmes and Murphy 2005), and example extraction (Moreno et al., 2015).

1.1 Motivation

The assessment of source code similarity has a co-evolutionary relationship with the modifications made to the code at the point of its creation. Although there is a large number of clone detectors, plagiarism detectors, and code similarity detectors invented in the research community, there are relatively few studies that compare and evaluate their performances. Bellon et al. (2007) proposed a framework for comparing and evaluating 6 clone detectors, Roy et al. (2009) evaluated a large set of clone detection tools but only based on results obtained from the tools’ published papers, Hage et al. (2010) compare five plagiarism detectors against 17 code modifications, Burd and Bailey (2002) compare five clone detectors for preventive maintenance tasks, Biegel et al. (2011) compare three code similarity measures to identify code that needs refactoring, Svajlenko and Roy (2016) developed and used a clone evaluation framework called BigCloneEval to evaluate 10 state-of-the-art clone detectors. Although these studies cover various goals of tool evaluation and cover the different types of code modification found in the chosen data sets, they suffer from two limitations: (1) the selected tools are limited to only a subset of clone or plagiarism detectors, and (2) the results are based on different data sets, so one cannot compare a tool’s performance from one study to another tool’s from another study. To the best of our knowledge, there is no study that performs a comprehensive and fair comparison of widely-used code similarity analysers based on the same data sets.

In this paper, we fill the gap by presenting the largest extant study on source code similarity that covers the widest range of techniques and tools. We study the tools’ performances on both local and pervasive (global) code modifications usually found in software engineering activities such as code cloning, software plagiarism, and code refactoring. This study is motivated by the question:

“When source code is copied and modified, which code similarity detection techniques or tools get the most accurate results?”

To answer this question, we provide a thorough evaluation of the performance of the current state-of-the-art similarity detection techniques using several error measures. The aim of this study is to provide a foundation for the appropriate choice of a similarity detection technique or tool for a given application based on a thorough evaluation of strengths and weaknesses on source code with local and global modifications. Choosing the wrong technique or tool with which to measure software similarity or even just choosing the wrong parameters may have detrimental consequences.

We have selected as many techniques for source code similarity measurement as possible, 30 in all, covering techniques specifically designed for clone and plagiarism detection, plus the normalised compression distance, string matching, and information retrieval. In general, the selected tools require the optimisation of their parameters as these can affect the tools’ execution behaviours and consequently their results. A previous study regarding parameter optimisation (Wang et al., 2013) has explored only a small set of clone detectors’ parameters using search-based techniques. Therefore, whilst including more tools in this study, we have also searched through a wider range of configurations for each tool, studied their impact, and discovered the best configurations for each data set in our experiments. After obtaining tools’ optimal configurations derived from one data set, we apply them to another data set and observe if they can be reused effectively.

Clone and plagiarism detection use intermediate representations like token streams or abstract syntax trees or other transformations like pretty printing or comment removal to achieve a normalised representation (Roy et al., 2009). We integrated compilation and decompilation as a normalisation pre-process step for similarity detection and evaluated its effectiveness.

1.2 Contributions

This paper makes the following primary contributions:

1. A broad, thorough study of the performance of similarity tools and techniques: We compare a large range of 30 similarity detection techniques and tools using five experimental scenarios for Java source code in order to measure the techniques’ performances and observe their behaviours. We apply several error measures including pair-based and query-based measures. The results show that, in overall, highly specialised source code similarity detection techniques and tools can perform better than more general, textual similarity measures. However, we also observed some situations where compression-based, and textual similarity tools are recommended over clone and plagiarism detectors.

The results of the evaluation can be used by researchers as guidelines for selecting techniques and tools appropriate for their problem domain. Our study confirms both that tool configurations have strong effects on tool performance and that they are sensitive to particular data sets. Poorly chosen techniques or configurations can severely affect results.

2. Normalisation by decompilation: Our study confirms that compilation and decompilation as a pre-processing step can normalise pervasively modified source code and can improve the effectiveness of similarity measurement techniques with statistical significance. Three of the similarity detection techniques and tools reported no false classifications once such normalisation was applied.

3. Data set of pervasive code modifications: The generated data set with pervasive modifications used in this study has been created to be challenging for code similarity analysers. According to the way we constructed the data set, the complete ground truth is known. We make the data set publicly available so that it can be used in future studies of tool evaluation and comparison.

Compared to our previous work (Ragkhitwetsagul et al., 2016), we have expanded the study further as follows. First, we doubled the size of the data set from 5 original Java classes to 10 classes and re-evaluated the tools. This change made the number of pairwise comparisons quadratically increase from 2,500 to 10,000. With this expanded data set, we could better observe the tools’ performances on pervasively modified source code. We found some differences in the tool rankings using the new data set when compared to the previous one. Second, besides source code with pervasive modifications, we compared the similarity analysers on an available data set containing reuse of boiler-plate code, and a data set of boiler-plate code with pervasive modifications. Since boiler-plate code is inherently different from pervasively modified code and normally found in software development (Kapser 2006), the findings give a guideline to choosing the right tools/techniques when measuring code similarity in the presence of boiler-plate code. Third, we investigated the effects of reusing optimal configurations from one data set on another data set. Our empirical results show that the optimal configurations are very sensitive to a specific data set and not suitable for reuse.

2 Background

2.1 Source Code Modifications

We are interested in two scenarios of code modifications in this study: pervasive code modifications (global) and boiler-plate code (local). Their definitions are as follows.

Pervasive modifications are code changes that affect the code globally across the whole file with multiple changes applied one after another. These are code transformations that are mainly found in the course of software plagiarism when one wants to conceal copied code by changing their appearance and avoid detection (Daniela et al., 2012). Nevertheless, they also represent code clones that are repeatedly modified over time during software evolution (Pate et al., 2013), and source code before and after refactoring activities (Fowler 2013). However, our definition of pervasive modifications excludes strong obfuscation (Collberg et al., 1997), that aims to protect code from reverse engineering by making it difficult or impossible to understand.

Most clone or plagiarism detection tools and techniques tolerate different degrees of change and still identify cloned or plagiarised fragments. However, whilst they usually have no problem in the presence of local or confined modifications, pervasive modifications that transform whole files remain a challenge (Roy and Cordy 2009). For example, in a situation that multiple methods are merged into a single method due to a code refactoring activity. A clone detector focusing on method-level clones would not report the code before and after merging as a clone pair. Moreover, with multiple lexical and structural code changes applied repeatedly at the same time, resulting source code can be totally different. When one looks at code before and after applying pervasive modifications, one might not be able to tell that both originate from the same file. We found that code similarity tools have the same confusion.

We define source code with pervasive modifications to contain a combination of the following code changes:

  1. 1.

    Lexical changes of formatting, layout modifications (Type I clones), and identifier renaming (Type II clones).

  2. 2.

    Structural changes, e.g. if to case or whilst to for, or insertions or deletions of statements (Type III clones).

  3. 3.

    Extreme code transformations that preserve source code semantics but change its syntax (Type IV clones).

Figure 1 shows an example of code before and after applying pervasive modifications. It is a real-world example of plagiarism from a university’s programming class submission.Footnote 1 Boiler-plate code occurs when developers reuse a code template, usually a function or a code block, to achieve a particular task. It has been defined as one of the code cloning patterns by (Kapser 2006; Kapser and Godfrey 2008). Boiler-plate code can be found when building device drivers for operating systems (Baxter et al., 1998), developing android applications (Crussell et al., 2013), and giving programming assignments (Burrows et al., 2007; Schleimer et al., 2003). Boiler-plate code usually contains small code modifications in order to adapt the boiler-plate code to a new environment. In contrast to pervasive modifications, the modifications made to boiler-plate code are usually contained in a function or block. Figure 2 depicts an example of boiler-plate code used for creating new HTTP connection threads which can be reused as-is or with minimum changes.

Fig. 2
figure 2

A boiler-plate code to create connection threads

2.2 Code Similarity Measurement

Since the 1970s, myriads of tools have been introduced to measure the similarity of source code. They are used to tackle problems such as code clone detection, software licencing violation, and software plagiarism. The tools utilise different approaches to computing the similarity of two programs. We can classify them into metrics-based, text-based, token-based, tree-based, and graph-based approaches. Early approaches to detect software similarity (Ottenstein 1976; Donaldson et al., 1981; Grier 1981; Berghel and Sallach 1984; Faidhi and Robinson 1987) are based on metrics or software measures. One of the early code similarity detection tools was created by Ottenstein (1976) and was based on Halstead complexity measures (Halstead 1977) and was able to discover a plagiarised pair out of 47 programs of students registered in a programming class. Unfortunately, the metrics-based approaches have been found empirically to be less effective in comparison with other, newer approaches (Kapser and Godfrey 2003).

Text-based approaches perform similarity checking based on comparing two string sequences of source code. They are able to locate exact copies of source code, whilst usually susceptible to finding similar code with syntactic and semantic modifications. Some supporting techniques are incorporated to handle syntactic changes such as variable renaming (Roy and Cordy 2008). There are several code similarity analysers that compute textual similarity. One of the widely-used string similarity methods is to find a longest common subsequence (LCS) which is adopted by the NiCad clone detector (Roy and Cordy 2008), Plague (Whale 1990), the first version of YAP (Wise 1992), and CoP (Luo et al., 2014). Other text-based tools with string matching algorithms other than LCS include, but not limited to, Duploc (Ducasse et al., 1999), Simian (Harris 2015), and PMD’s Copy/Paste Detector (CPD) (Dangel A and Pelisse R 2011).

To take one step of abstraction up from literal code text, we can transform source code into tokens (i.e. words). A stream of tokens can be used as an abstract representation of a program. The abstraction level can be adjusted by defining the types of tokens. Depending on how the tokens are defined, the token stream may normalise textual differences and capture only an abstracted sequence of a program. For example, if every word in a program is replaced by a W token, a statement int x = 0; will be similar to String s = "Hi"; because they both share a token stream of W W = W;. Different similarity measurements such as suffix trees, string alignment, Jaccard similarity, etc., can be applied to sequences or sets of tokens. Tools that rely on tokens include Sherlock (Joy and Luck 1999), BOSS (Joy et al., 2005), Sim (Gitchell and Tran 1999), YAP3 (Wise 1996), JPlag (Prechelt et al., 2002), CCFinder (Kamiya et al., 2002), CP-Miner (Li et al., 2006), MOSS (Schleimer et al., 2003), Burrows et al. (2007), and the Source Code Similarity Detector System (SCSDS) (Duric and Gasevic 2013). The token-based representation is widely used in source code similarity measurement and very efficient on a scale of millions SLOC. An example is the large-scale token-based clone detection tool SourcererCC (Sajnani et al., 2016).

Tree-based code similarity measurement can avoid issues of formatting and lexical differences and focus only on locating structural sameness between two programs. Abstract Syntax Trees (ASTs) are a widely-used structure when computing program similarity by finding similar subtrees between two ASTs. The capability of comparing programs’ structures allows tree-based tools to locate similar code with a wider range of modifications such as added or deleted statements. However, tree-based similarity measures have a high computational complexity. The comparison of two ASTs with N nodes can have an upper bound of O(N 3) (Baxter et al., 1998). Usually an optimising mechanism or approximation is included in the similarity computation to lower the computation time (Jiang et al., 2007b). A few examples of well-known tree-based tools include CloneDR (Baxter et al., 1998), and Deckard (Jiang et al., 2007b).

Graph-based structures are chosen when one wants to capture not only the structure but also the semantics of a program. However, like trees, graph similarity suffers from a high computational complexity. Algorithms for graph comparison are mostly NP-complete (Liu et al., 2006; Crussell et al., 2012; Krinke 2001; Chae et al., 2013). In clone and plagiarism detection, a few specific types of graphs are used, e.g. program dependence graphs (PDG), or control flow graphs (CFG). Examples of code similarity analysers using graph-based approaches are the ones invented by Krinke (2001), Komondoor and Horwitz (2001), Chae et al. (2013) and Chen et al. (2014). Although the tools demonstrate high precision and recall (Krinke 2001), they suffer scalability issues (Bellon et al., 2007).

Code similarity measurement can not only be measured on source code but also on compiled code. Measuring similarity of compiled code is useful when the source code is absent or unavailable. Moreover, it can also capture dynamic behaviours of the programs by executing the compiled code. In the last few years, there have been several studies to discover cloned and plagiarised programs (especially mobile applications) based on compiled code (Chae et al., 2013; Chen et al., 2014; Gibler et al., 2013; Crussell et al., 2012, 2013; Tian et al., 2014; Tamada et al., 2004; Myles and Collberg 2004; Hi et al., 2009; Zhang et al., 2012, 2014; McMillan et al., 2012; Luo et al., 2014).

Besides the text, token, tree, and graph-based approaches, there are several other alternative techniques adopted from other fields of research to code similarity measurement such as information theory, information retrieval, or data mining. These techniques show positive results and open further possibilities in this research area. Examples of these techniques include Software Bertillonage (Davies et al., 2013), Kolmogorov complexity (Li and Vitâanyi 2008), Latent Semantic Indexing (LSI) (McMillan et al., 2012), and Latent Semantic Analysis (LSA) (Cosma and Joy 2012).

2.3 Obfuscation and Deobfuscation

Obfuscation is a mechanism of making changes to a program whilst preserving its original functions. It originally aimed to protect intellectual property of computer programs from reverse engineering or from malicious attack (Collberg et al., 2002) and can be achieved in both source and binary level. Many automatic code obfuscation tools are available nowadays both for commercial (e.g. Semantic Designs Inc.’s C obfuscator (Semantic Designs 2016), Stunnix’s obfuscators (Stunnix 2016), Diablo (Maebe and Sutter 2006)) and research purposes (Chow et al., 2001; Schulze and Meyer 2013; Madou et al., 2006; Necula et al., 2002).

Given a program P, and the transformed program P , the definition of obfuscation transformations T is \(P \xrightarrow {T} P^{\prime }\) requiring P and P to hold the same observational behaviour (Collberg et al., 1997). Specifically, legal obfuscation transformation requires: 1) if P fails to terminate or terminates with errors then P may or may not terminate, and 2) P must terminate if P terminates.

Generally, there are three approaches for obfuscation transformations: lexical (layout), control, and data transformation (Collberg et al. 1997, 2002). Lexical transformations can be achieved by renaming identifiers and formatting changes, whilst control transformations use more sophisticated methods such as embedding spurious branches and opaque predicates which can be deducted only at runtime. Data transformations make changes to data structures and hence make the source code difficult to reverse engineer. Similarly, binary-code obfuscators transform the content of executable files.

Many obfuscation techniques have been invented and put to use in commercial obfuscators. Collberg et al. (2003) introduce several reordering techniques (e.g. method parameters, basic block instructions, variables, and constants), splitting of classes, basic blocks, arrays, and also merging of method parameters, classes. These techniques are implemented in their tool, SandMark. Wang et al. (2001) propose a sophisticated deep obfuscation method called control flow flattening which is used in a commercial tool called Cloakware. ProGuard (GuardSquare 2015) is a Java bytecode obfuscator which performs obfuscation by removing existing names (e.g. class, method names), replacing them with meaningless characters, and also gets rid of all debugging information from Java bytecode. Loco (Madou et al., 2006) is a binary obfuscator capable of performing obfuscation using control flow flattening and opaque predicates on selected fragments of code.

Deobfuscation is a method aiming at reversing the effects of obfuscation which can be achieved at either static and dynamic level. It can be useful in many aspects such as detection of obfuscated malware (Nachenberg 1996) or as a resiliency test for a newly developed obfuscation method (Madou et al., 2006). Whilst surface obfuscation such as variable renaming can be handled straightforwardly, deep obfuscation which makes large changes to the structure of the program (e.g. opaque predicates or control flow flattening) is much more difficult to reverse. However, it is not totally impossible. It has been shown that one can counter control flow flattening by either cloning the portions of added spurious code to separate them from the original execution path or use static path feasibility analysis (Udupa et al., 2005) .

2.4 Program Decompilation

Decompilation of a program generates high-level code from low-level code. It has several benefits including recovery of lost source code from compiled artefacts such as binary or bytecode, reverse engineering, finding similar applications (Chen et al., 2014). On the other hand, decompilation can also be used to create program clones by decompiling a program, making changes, and repacking it into a new program. An example of this malicious use of decompilation can be seen from a study by Chen et al. (2014). They found that 13.51% of all applications from five different Android markets are clones. Gibler et al. (2013) discovered that these decompiled and cloned apps can divert advertisement impressions from the original app owners by 14% and divert potential users by 10%.

Many decompilers have been invented in the literature for various programming languages (Cifuentes and Gough 1995; Proebsting and Watterson 1997; Desnos and Gueguen 2011; Mycroft 1999; Breuer and Bowen 1994). Several techniques are involved to successfully decompile a program. The decompiled source code may be different according to each particular decompiler. Conceptually, decompilers extract semantics of programs from their executables, then, with some heuristics, generate the source code based on this extraction. For example Krakatoa (Proebsting and Watterson 1997), a Java decompiler, extracts expressions and type information from Java bytecode using symbolic execution, and creates a control flow graph (CFG) of the program representing the behaviour of the executable. Then, to generate source code, a sequencer arranges the nodes and creates an abstract syntax tree (AST) of the program. The AST is then simplified by rewriting rules and, finally, the resulting Java source code is created by traversing the AST.

It has been found that program decompilation has an additional benefit of code normalisation. An empirical study (Ragkhitwetsagul and Krinke 2017b) shows that, compared to clones in the original versions, additional clones were found after compilation/decompilation in three real-world software projects. Many of the newly discovered clone pairs contained several modifications which were causing difficulty for clone detectors. Compilation and decompilation canonicalise these changes and the clone pairs became very similar after the decompilation step.

3 Empirical Study

Our empirical study consists of five experimental scenarios covering different aspects and characteristics of source code similarity. Three experimental scenarios examined tool/technique performance on three different data sets to discover any strengths and weaknesses. These three are (1) experiments on the products of the two obfuscation tools, (2) experiments on an available data set for identification of reuse boiler-plate code (Flores et al., 2014), and (3) experiments on the combinations of pervasive modifications and boiler-plate code. The fourth scenario examined the effectiveness of compilation/decompilation as a preprocessing normalisation strategy and the fifth evaluated the use of error measures from information retrieval for comparing tool performance without relying on a threshold value.

The empirical study aimed to answer the following research questions:

RQ1 (Performance comparison): How well do current similarity detection techniques perform in the presence of pervasive source code modifications and boiler-plate code? We compare 30 code similarity analysers using a data set of 100 pervasively modified pieces of source code and a data set of 259 pieces of Java source code that incorporate reused boiler-plate code.

RQ2 (Optimal configurations): What are the best parameter settings and similarity thresholds for the techniques? We exhaustively search wide ranges of the tools’ parameter values to locate the ones that give optimal performances so that we can fairly compare the techniques. We are also interested to see if one can gain optimal performance of the tools by relying on default configurations.

RQ3 (Normalisation by decompilation): How much does compilation followed by decompilation as a pre-processing normalisation method improve detection results for pervasively modified code? We apply compilation and decompilation to the data set before running the tools. We compare the performances before and after applying this normalisation.

RQ4 (Reuse of configurations): Can we effectively reuse optimal tool configurations for one data set on another data set? We apply the optimal tool configurations obtained using one data set when using the tools with another data set and investigate whether they still offer the tools’ best performances.

RQ5 (Ranked Results): Which tools perform best when only the top n results are retrieved? Besides the set-based error measures normally used in clone and plagiarism detection evaluation (e.g. precision, recall, F-scores), we also compare and report the tools’ performances using ranked results adopted from information retrieval. This comparison has a practical benefit in terms of plagiarism detection, manual clone study, and automated software repair.

RQ6 (Local + global code modifications): How well do the techniques perform when source code containing boiler-plate code clones have been pervasively modified? We evaluate the tools on a data set combining both local and global code modifications. This question also studies which types of pervasive modifications (source code obfuscation, bytecode obfuscation, compilation/decompilation) strongly affect tools’ performances.

3.1 Experimental Framework

The general framework of our study, as shown in Fig. 3, consists of 5 main steps. In Step 1, we collect test data consisting of Java source code files. Next, the source files are transformed by applying pervasive modifications at source and bytecode level. In the third step, all original and transformed source files are normalised. A simple form of normalisation is pretty printing the source files which is used in similarity or clone detection (Roy and Cordy 2008). We also use decompilation. In Step 4, the similarity detection tools are executed pairwise against the set of all normalised files, producing similarity reports for every pair. In the last step, the similarity reports are analysed.

Fig. 3
figure 3

The experimental framework

In the analysis step, we extract a similarity value sim(x,y) from the report for every pair of files x,y, and based on the reported similarity, the pair is classified as being similar (reused code) or not according to some chosen threshold T. The set of similar pairs of files Sim(F) out of all files F is

$$ \text{Sim}(F)=\{(x,y) \in F \times F: \text{sim}(x,y) > T\} $$

We selected data sets for which we know the ground truth, allowing decisions on whether a code pair is correctly classified as a similar pair (true positive, T P), correctly classified as a dissimilar pair (true negative, T N), incorrectly classified as similar pair whilst it is actually dissimilar (false positive, F P), and incorrectly classified as dissimilar pair whilst it is actually a similar pair (false negative, F N). Then, we create a confusion matrix for every tool containing the values of these T P, F P, T N, and F N frequencies. Subsequently the confusion matrix is used to compute an individual technique’s performance.

3.2 Tools and Techniques

Several tools and techniques were used in this study. These fall into three categories: obfuscators, decompilers, and detectors. The tool set included source and bytecode obfuscators, and two decompilers. The detectors cover a wide range of similarity measurement techniques and methods including plagiarism and clone detection, compression distance, string matching, and information retrieval. All tools are open source in order to expedite the repeatability of our experiments.

3.2.1 Obfuscators

In order to create pervasive modifications in Step 2 (transformation) of the framework, we used two obfuscators that do not employ strong obfuscations, Artifice and ProGuard. Artifice (Schulze and Meyer 2013) is an Eclipse plugin for source-level obfuscation. The tool makes 5 different transformations to Java source code including 1) renaming of variables, fields, and methods, 2) changing assignment, increment, and decrement operations to normal form, 3) inserting additional assignment, increment, and decrement operations when possible, 4) changing whilst to for and the other way around, and 5) changing if to its short form. Artifice cannot be automated and has to be run manually because it is an Eclipse plugin. ProGuard (GuardSquare 2015) is a well known open-source bytecode obfuscator. It is a versatile tool containing several functions including shrinking Java class files, optimisation, obfuscation, and pre-verification. ProGuard obfuscates Java bytecode by renaming classes, fields, and variables with short and meaningless ones. It also performs package hierarchy flattening, class repackaging, merging methods/classes and modifying package access permissions.

Using source and bytecode obfuscators, we can create pervasively modified source code that contains modifications of lexical and structural changes. We have investigated the code transformations offered by Artifice and ProGuard and found that they cover changes commonly found in both code cloning and code plagiarism as reported by Roy and Cordy (2009), Schulze and Meyer (2013), Duric and Gasevic (2013), Joy and Luck (1999), and Brixtel et al. (2010). The details of code modifications supported by our obfuscators are shown in Table 1.

Table 1 List of pervasive code modifications offered by our source code and bytecode obfuscator, and compiler/decompilers

3.2.2 Compiler and Decompilers

Our study uses compilation and decompilation for two purposes: transformation (obfuscation) and normalisation.

One can use a combination of compilation and decompilation as a method of source code obfuscation or transformation. Luo et al. (2014) use GCC/G + + with different optimisation options to generate 10 different binary versions of the same program. However, if the desired final product is source code, a decompiler is also required in the process in order to transform the bytecode back to its source form. The only compiler deployed in this study is the standard Java compiler (javac).

Decompilation is a method for reversing the process of program compilation. Given a low-level language program such as an executable file, a decompiler generates a high-level language counterpart that resembles the (original) source code. This has several applications including recovery of lost source code, migrating a system to another platform, upgrading an old program into a newer programming language, restructuring poorly-written code, finding bugs or malicious code in binary programs, and program validation (Cifuentes and Gough 1995). An example of using the decompiler to reuse code is a well-known lawsuit between Oracle and Google (United States District Court 2011). It seems that Google decompiled a Java library to obtain the source code of its APIs and then partially reused them in their Android operating system.

Since each decompiler has its own decompiling algorithm, one decompiler usually generates source code which is different from the source code generated by other decompilers. Using more than one decompiler can also be a method of obfuscation by creating variants of the same program with the same semantics but with different source code.

We selected two open source decompilers: Krakatau and Procyon. Krakatau (Grosse 2016) is an open-source tool set comprising a decompiler, a class file dissembler, and an assembler. Procyon (Strobel 2016) includes a Java open-source decompiler. It has advantages over other decompilers for declaration of enum, String, switch statements, anonymous and named local classes, annotations, and method references. They are used in both the transformation (obfuscation) and normalisation post-process steps (Steps 2 and 3) of the framework.

Using a combination of compilation and decompilation to generate code with pervasive modifications can represent source code that has been refactored (Fowler 2013), or rewritten (i.e. Type IV clones) (Roy et al., 2009). Whilst its semantics has been preserved, the source code syntax including layout, variable names, and structure may be different. Table 1 shows code modifications that are supported by our compiler and decompilers.

3.2.3 Plagiarism Detectors

The selected plagiarism detectors include JPlag, Sherlock, Sim, and Plaggie. JPlag (Prechelt et al., 2002) and Sim (Gitchell and Tran 1999) are token-based tools which comes in versions for text (jplag-text and simtext) and Java (jplag-java and simjava), whilst Sherlock (Pike R and Loki 2002) relies on digital signatures (a number created from a series of bits converted from the source code text). Plaggie’s detection (Ahtiainen et al., 2006) method is not public but claims to have the same functionalities as JPlag. Although there are several other plagiarism detection tools available, some of them could not be chosen for the study due to the absence of command-line versions preventing them from being automated. Moreover, we require a quantitative similarity measurement so we can compare their performances. All chosen tools report a numerical similarity value, sim(x,y), for a given file pair x,y.

3.2.4 Clone Detectors

We cover a wide spectrum of clone detection techniques including text-based, token-based, and tree-based techniques. Like the plagiarism detectors, the selected tools are command-line based and produce clone reports providing a similarity value between two files.

Most state-of-the-art clone detectors do not report similarity values. Thus, we adopted the General Clone Format (GCF) as a common format for clone reports. We modified and integrated the GCF Converter (Wang et al., 2013) to convert clone reports generated by unsupported clone detectors into GCF format.

Since a GCF report contains several clone fragments found between two files x and y, the similarity of x to y can be calculated as the ratio of the size of clone fragment between x and y found in x (overlaps are handled), i.e. \(\mathit {frag}_{i}^{x}(x,y)\), to the size of x and vice versa.

$$ \text{sim}_{\text{GCF}}(x,y) = \frac{\sum\nolimits_{i = 1}^{n} |\mathit{frag}_{i}^{x}(x,y)|}{|x|} $$

Using this method, we included five state-of-the-art clone detectors: CCFinderX, NICAD, Simian, iClones, and Deckard. CCFinderX (ccfx) (Kamiya et al., 2002) is a token-based clone detector detecting similarity using suffix trees. NICAD (Roy and Cordy 2008) is a clone detection tool embedding TXL for pretty-printing, and compares source code using string similarity. Simian (Harris 2015) is a pure, text-based, clone detection tool relying on text line comparison with a capability for checking basic code modifications, e.g. identifier renaming. iClones (Göde and Koschke 2009) performs token-based incremental clone detection over several revisions of a program. Deckard (Jiang et al., 2007a) converts source code into an AST and computes similarity by comparing characteristic vectors generated from the AST to find cloned code based on approximate tree similarity.

Although most of the clone reports only contain clone lines, the actual implementation of clone detection tools works at a different granularity of code fragments. Measuring clone similarity at a single granularity level, such as line, may penalise some tools whilst favouring another set of tools. With this concern in mind, our clone similarity calculation varies over multiple granularity levels to avoid biases to any particular tools. We consider three different granularity levels: line, token, and character. Computing similarity at a level of lines or tokens is common for clone detectors. Simian and NICAD detect clones based on source code lines whilst CCFinderX and iClones work at token level. However, Deckard compares clones based on ASTs so its similarity comes from neither lines nor tokens. To make sure that we get the most accurate similarity calculation for Deckard and other clone detectors, we also cover the most fine-grained level of source code: characters. Using these three levels of granularity (line, word, and character), we calculate three simGCF(x,y) values for each of the tools.

3.2.5 Compression Tools

Normalised compression distance (NCD) is a distance metric between two documents based on compression (Cilibrasi and Vitányi 2005). It is an approximation of the normalised information distance which is in turn based on the concept of Kolmogorov complexity (Li and Vitâanyi 2008). The NCD between two documents can be computed by

$$ \text{NCD}_{z}(x,y) = \frac{Z(xy)-\min\left\lbrace Z(x),Z(y)\right\rbrace }{\max\left\lbrace Z(x),Z(y)\right\rbrace } $$

where Z(x) means the length of the compressed version of document x using compressor Z. In this study, five variations of NCD tools are chosen. One is part of CompLearn (Cilibrasi et al., 2015) which uses the built-in bzlib and zlib compressors. The other four have been created by the authors as shell scripts. The first one utilises 7-Zip (Pavlov 2016) with various compression methods including BZip2, Deflate, Deflate64, PPMd, LZMA, and LZMA2. The other three rely on Linux’s gzip, bzip2, and xz compressors respectively.

Lastly, we define another, asymmetric, similarity measurement based on compression called inclusion compression divergence (ICD). It is a compressor based approximation to the ratio between the conditional Kolmogorov complexity of string x given string y and the Kolmogorov complexity of x, i.e. to K(x|y)/K(x), the proportion of the randomness in x not due to that of y. It is defined as

$$ \text{ICD}_{Z}(x,y)=\frac{Z(xy)-Z(y)}{Z(x)} $$

and when C is NCD Z or ICD Z then we use sim C (x,y) = 1 − C(x,y).

3.2.6 Other Techniques

We expanded our study with other techniques for measuring similarity including a range of libraries that measure textual similarity: difflib (Python Software Foundation 2016) compares text sequences using Gestalt pattern matching, Python NGram (Poulter 2012) compares text sequences via fuzzy search using n-grammes, FuzzyWuzzy (Cohen 2011) uses fuzzy string token matching, jellyfish (Turk and Stephens 2016) does approximate and phonetic matching of strings, and cosine similarity from scikit-learn (Pedregosa et al., 2011) which is a machine learning library providing data mining and data analysis. We also employed diff, the classic file comparison tool, and bsdiff, a binary file comparison tool. Using diff or bsdiff, we calculate the similarity between two Java files x and y using

$$ \text{sim}_{D}(x,y)= 1-\frac{\min(|y|,|D(x,y)|)}{|y|} $$

where D(x,y) is the output of diff or bsdiff.

The result of sim D (x,y) is asymmetric as it depends on the size of the denominator. Hence sim D (x,y) usually produces a different result from sim D (y,x). This is because sim D (x,y) provides the distance of editing x into y which is different in the opposite direction.

The summary of all selected tools and their respective similarity measurement methods are presented in Table 2. The default configurations of each tools, as displayed in Table 3, are extracted from (1) the values displayed in the help menu of the tools, (2) the tools’ websites, (3) or the tools’ papers (e.g. Deckard (Jiang et al., 2007b)). The range of parameter values we searched for in our study are also included in Table 3.

Table 2 Tools with their similarity measures
Table 3 Tools and their parameters with chosen value ranges (DF denotes default parameters)

4 Experimental Scenarios

To answer the research questions, five experimental scenarios have been designed and studied following the framework presented in Fig. 3. The experiment was conducted on a virtual machine with 2.67 GHz CPU (dual core) and 2 GB RAM running Scientific Linux release 6.6 (Carbon), and 24 Microsoft Azure virtual machines with up to 16 cores, 56 GB memory running Ubuntu 14.04 LTS. The details of each scenario are explained below.

4.1 Scenario 1 (Pervasive Modifications)

Scenario 1 studies tool performance against pervasive modifications (as simulated through source and bytecode obfuscation). At the same time, the best configuration for every tool is discovered. For this data set, we completed all the 5 steps of the framework: data preparation, transformation, post-processing, similarity detection, and analysing the similarity report. However, post-processing is limited to pretty printing and no normalisation through decompilation is applied.

4.1.1 Preparation, Transformation, and Normalisation

This section follows Steps 1 and 2 in the framework. The original data consists of 10 Java classes: BubbleSort, EightQueens, GuessWord, TowerOfHanoi, InfixConverter, Kapreka_Transformation, MagicSquare, RailRoadCar, SLinkedList, and, finally, SqrtAlgorithm. We downloaded them from two programming websites as shown in Table 4 along with the class descriptions. We selected only the classes that can be compiled and decompiled without any required dependencies other than the Java SDK. All of them are short Java programs with less than 200 LOC and they illustrate issues that are usually discussed in basic programming classes. The process of test data preparation and transformation is illustrated in Fig. 5. First, we selected each original source code file and obfuscated it using Artifice. This produced the first type of obfuscation: source-level obfuscation (No. 1). An example of a method before and after source-level obfuscation by Artifice is displayed on the top of Fig. 4 (formatting has been adjusted due to space limits).

Fig. 4
figure 4

The same code fragments, a constructor of MagicSquare, after pervasive modifications, and compilation/decompilation

Fig. 5
figure 5

Test data generation process

Table 4 Descriptions of the 10 original Java classes in the generated data set

Next, both the original and the obfuscated versions were compiled to bytecode, producing two bytecode files. Then, both bytecode files were obfuscated once again by ProGuard, producing two more bytecode files.

All four bytecode files were then decompiled by either Krakatau or Procyon giving back eight additional obfuscated source code files. For example, No. 1 in Fig. 5 is a pervasively modified version via source code obfuscation with Artifice. No. 2 is a version which is obfuscated by Artifice, compiled, obfuscated with ProGuard, and then decompiled with Krakatau. No. 3 is a version obfuscated by Artifice, compiled and then decompiled with Procyon. Using this method, we obtained 9 pervasively modified versions for each original source file, resulting in 100 files for the data set. The only post-processing step in this scenario is normalisation through pretty printing.

4.1.2 Similarity Detection

The generated data set of 100 Java code files is used for pairwise similarity detection in Step 4 of the framework in Fig. 3, resulting in 10,000 pairs of source code files with their respective similarity values. We denote each pair and their similarity as a triple (x,y,s i m). Since each tool can have multiple parameters to adjust and we aimed to cover as many parameter settings as possible, we repeatedly ran each tool several times with different settings in the range listed in Table 3. Hence, the number of reports generated by one tool equals the number of combinations of its parameter values. A tool with two parameters p 1P 1 and p 2P 2 has |P1|×|P2| different settings. For example, sherlock has two parameters N ∈{1,2,3,...,8} and Z ∈{0,1,2,3,...,8}. We needed to do 8 × 9 × 10,000 = 720,000 pairwise comparisons and generated 72 similarity reports. To cover the 30 tools with all of their possible configurations, we performed 14,880,000 pairwise comparisons in total and analysed 1,488 reports.

4.1.3 Analysing the Similarity Reports

In Step 5 of the framework, the results of the pairwise similarity detection are analysed. The 10,000 pairwise comparisons result in 10,000 (x,y,s i m) entries. As in (1), all pairs x,y are considered to be similar when the reported similarity sim is larger than a threshold T. Such a threshold must be set in an informed way to produce sensible results. However, as the results of our experiment will be extremely sensitive to the chosen threshold, we want to use the optimal threshold, i.e. the threshold that produces the best results. Therefore, we vary the cut-off threshold T between 0 and 100.

As shown in Table 5, the ground truth of the generated data set contains 1,000 positives and 9,000 negatives. The positive pairs are the pairs of files generated from the same original code. For example, all pairs that are the derivatives of InfixConverter.java must be reported as similar. The other 9,000 pairs are negatives since they come from different original source code files and must be classified as dissimilar. Using this ground truth, we can count the number of true and false positives in the results reported for each of the tools. We choose the F-score as the method to measure the tools’ performance. The F-score is preferred in this context since the sets of similar files and dissimilar files are unbalanced and the F-score does not take true negatives into account.Footnote 2

The F-score is the harmonic mean of precision (ratio of correctly identified reused pairs to retrieved pairs) and recall (ratio of correctly identified pairs to all the identified pairs):

$$\text{precision}=\frac{\mathit{TP}}{\mathit{TP}+\mathit{FP}}\hspace{1cm} \text{recall}=\frac{\mathit{TP}}{\mathit{TP}+\mathit{FN}} $$
$$\mathrm{F-score}=\frac{2 \times\text{precision}\times\text{recall}}{\text{precision}+\text{recall}} $$

Using the F-score we can search for the best threshold T under which each tool has its optimal performance with the highest F-score. For example in Fig. 6, after varying the threshold from 0 to 100, ncd-bzlib has the best threshold T = 37 with the highest F-score of 0.846. Since each tool may have more than one parameter setting, we call the combination of the parameter settings and threshold that produces the highest F-score the tool’s “optimal configuration”.

Fig. 6
figure 6

The graph shows the F-score and the threshold values of ncd-bzlib. The tool reaches the highest F-score when the threshold equals 37

4.2 Scenario 2 (Reused Boiler-Plate Code)

In this scenario, we analyse the tools’ performance against an available data set that contains files in which fragments of boiler-plate code are reused with or without modifications. We choose the data set that has been provided by the Detection of SOurce COde Re-use competition for discovering monolingual re-used source code amongst a given set of programs (Flores et al., 2014), which we call the SOCO data set. We found that many of them share the same or very similar boiler-plate code fragments which perform the same task. Some of the boiler-plate fragments have been modified to adapt to the environment in which the fragments are re-used. Since we reused the data set from another study (Flores et al., 2014), we merely needed to format the source code files by removing comments and applying pretty-printing to them in step 1 of our experimental framework (see Fig. 3). We later skipped step 2 and 3 of pervasive modifications and followed only step 4 – similarity detection, and step 5 – analysing similarity report in our framework.

We selected the Java training set containing 259 files for which the answer key of true clone pairs is provided. The answer key contains 84 file pairs that share boiler-plate code. Using the provided pairs, we are able to measure both false positives and negatives. For each tool, this data set produced 259 × 259 = 67,081 pairwise comparisons. Out of these 67,081 file pairs, 259 + 2 × 84 = 427 pairs are similar. However, after manually investigating false positives in a preliminary study, we found that the provided ground truth contains errors. An investigation revealed that the provided answer key contained two large clusters in which pairs were missing and that two given pairs were wrong.Footnote 3 After removing the wrong pairs and adding the missing pairs, the corrected ground truth contains 259 + 2 × 97 = 453 pairs.

We performed two analyses on this data set: 1) applying the derived configurations to the data set and measuring the tools’ performances, and 2) searching for the optimal configurations. Again, no transformation or normalisation has been applied to this data set as it is already prepared.

Since the SOCO data set is 2.59 times larger than the generated data set (259 Java files vs. 100 Java files), it takes much longer to run. For example, it took CCFinderX 7 hours 48 minutesFootnote 4 to complete 2592 = 67,081 pairwise comparisons with one of its configurations on our Azure virtual machine. To complete the search space of 20 × 14 = 280 CCFinderX’s configurations, it took us 90 days. Executions of the 30 tools with all of their possible configurations cover 99,816,528 pairwise comparisons in total for this data set compared to 14,880,000 comparisons in Scenario 1. We analysed 1,448 similarity reports in total.

4.3 Scenario 3 (Decompilation)

We are interested in studying the effects of normalisation through compilation/decompilation before performing similarity detection. This is based on the observation that compilation has a normalising effect. Variable names disappear in bytecode and nominally different kinds of control structures can be replaced by the same bytecode, e.g. for and whilst loops are replaced by the same if and goto structures at the bytecode level.

Likewise, changes made by bytecode obfuscators may also be normalised by decompilers. Suppose a Java program P is obfuscated (transformed, T) into Q (\(P \xrightarrow {T} Q\)), then compiled (C) to bytecode B Q , and decompiled (D) to source code Q (\(Q \xrightarrow {C} B_{Q} \xrightarrow {D} Q^{\prime }\)). This Q should be different from both P and Q due to the changes caused by the compiler and decompiler. However, with the same original source code P, if it is compiled and decompiled using the same tools to create P (\(P \xrightarrow {C} B_{P} \xrightarrow {D} P^{\prime }\)), P should have some similarity to Q due to the analogous compiling/decompiling transformations made to both of them. Hence, one might apply similarity detection to find similarity sim(P ,Q ) and get more accurate results than sim(P,Q).

In this scenario, we focus on the generated data set containing pervasive code modifications of 100 source code files generated in Scenario 1. However, we added normalisation through decompilation to the post-processing (Step 3 in the framework) by compiling all the transformed files using javac and decompiling them using either Krakatau or Procyon. We then followed the same similarity detection and analysis process in Steps 4 and 5. The results are then compared to the results obtained from Scenario 1 to observe the effects of normalisation through decompilation.

4.4 Scenario 4 (Ranked Results)

In our three previous scenarios, we compared the tools’ performances using their optimal F-scores. The F-score offers a weighted harmonic mean of precision and recall. It is a set-based measure that does not consider any ordering of results. The optimal F-scores are obtained by varying the threshold T to find the highest F-score. We observed from the results of the previous scenarios that the thresholds are highly sensitive to each particular data set. Therefore, we had to repeat the process of finding the optimal threshold every time we changed to a new data set. This was burdensome but could be done since we knew the ground truth data of the data sets. The configuration problem for clone detection tools including setting thresholds has been mentioned by several studies as one of the threats to validity (Wang et al., 2001). There has also been an initiative to avoid using thresholds at all for clone detection (Keivanloo et al., 2015). Hence, we try to avoid the problem of threshold sensitivity affecting our results. Moreover, this approach also has applications in software engineering including finding candidates for plagiarism detection, automated software repair, working code examples, and large-scale code clone detection.

Instead of looking at the results as a set and applying a cut-off threshold to obtain true and false positives, we consider only a subset of the results based on their rankings. We adopt three error measures mainly used in information retrieval: precision-at-n (prec@n), average r-precision (ARP), and mean average precision (MAP) to measure the tools’ performances. We present their definitions below.

Given n as a number of top n results ranked by similarity, precision-at-n (Manning et al., 2009) is defined as:

$$\mathrm{prec@\textit{n}} = \frac{\text{TP}}{n} $$

In the presence of ground truth, we can set the value of n to be the number of relevant results (i.e. true positives). With a known ground truth, precision-at-n when n equals to the number of true positives is called r-precision (RP) where r stands for “relevant” (Manning et al., 2009). If a set of relevant files for each query qQ is \(R_{q}=\{\mathrm {\textit {rf}}_{q_{1}}, ..., \mathrm {\textit {rf}}_{q_{n}}\}\), then the r-precisions for a query q is:

$$\text{RP}_{q} = \frac{\text{TP}_{q}}{|R_{q}|} $$

With presence of more than one query, an average r-precision (ARP) can be computed as the mean of all r-precision values (Beitzel et al., 2009):

$$\text{ARP} = \frac{1}{|Q|} {\sum}_{i = 1}^{|Q|}{\text{RP}_{q}} $$

Lastly, mean average precision (MAP) measures the quality of results across several recall levels where each relevant result is returned. It is calculated from multiple average precision-at-n values where \(n_{q_{i}}\) is the number of retrieved results after each relevant result \(rf_{q_{i}}\in R_{q}\) of a query q is found. An average precision-at-n (aprec@n) of a query q is calculated from:

$$\mathrm{aprec@n}_{q} = \frac{1}{|R_{q}|} {\sum}_{i = 1}^{|R_{q}|}{\mathrm{prec@n}_{q_{i}}} $$

Mean average precision (MAP) is then derived from the mean of all aprec@n values of all the queries in Q (Manning et al., 2009):

$$\text{MAP} = \frac{1}{|Q|} {\sum}_{i = 1}^{|Q|}{\mathrm{aprec@n}_{q_{i}}} $$

Precision-at-n, ARP, and MAP are used to measure how well the tools retrieve relevant results within top-n ranked items for a given query (Manning et al., 2009). We simulate a querying process by 1) running the tools on our data sets and generating similarity pairs, and 2) ranking the results based on their similarities reported by the tools. The higher the similarity value, the higher the rank. The top ranked result has the highest similarity value. If a tie happens, we resort to a ranking by alphabetical order of the file names.

For precision-at-n, the query is “what are the most similar files in this data set?” and we inspect only the top n results. Our calculation of precision-at-n in this study can be considered as a hybrid between a set-based and a ranked-based measure. We put the results from different original files in the same “set” and we “rank” them by their similarities. This is suitable for a case of plagiarism detection. To locate plagiarised source code files, one may not want to give a specific file as a query (since they do not know which file has been copied) but they want to retrieve a set of all similar pairs in a set ranked by their similarities. JPlag uses this method to report plagiarised source code pairs (Prechelt et al., 2002). Moreover, finding the most similar files is useful in a manual study of large-scale code clones (e.g. in a study by Yang et al. (2017)) when too many clones are reported and researchers are only feasibly able to investigate by hand a few of the most similar clone candidates.

ARP and MAP are calculated by considering the question “what are the most similar files for each given query q?” For example, since we had a total of 100 files in our generated data set, we queried 100 times. We picked one file at a time from the data set as a query and retrieved a ranked result of 100 files (including the query itself) according to the query. An r-precision was calculated from the top 10 results. We limited results to only the top 10, since our ground truth contained 10 pervasively modified versions for each original source code file (including itself). Thus, the number of relevant results, r, is 10 in this study. We derive ARP from the average of the 100 r-precision values. The same process is repeated for MAP except using average precision-at-n instead of r-precision. The query-based approach is suitable when one does not require the retrieval of all the similar pairs of code but only the most relevant ones for a given query. This situation occurs when performing code search for automated software repair (Ke et al., 2015). One may not feasibly try all returned repair candidates but only the top-ranked ones. Another example is searching for working code examples (Keivanloo et al., 2014) when one wants to pick only the top ranked solution.

Using these three error measures, we can compare performances of the similarity detection techniques and tools without relying on the threshold at all. They also provide another aspect of evaluating the tools’ performances by observing how well the tools report correct results within the top n pairs.

4.5 Scenario 5 (Pervasive Modifications + Boiler-Plate Code)

We have two objectives for this experimental scenario. First, we are interested in a situation where local and global code modifications are combined together. This is done by applying pervasive modifications on top of reused boiler-plate code. This scenario occurs in software plagiarism when only a small fragment of code is copied and later pervasive modifications are applied to the whole source code to conceal the copied part of the code. It also represents a situation where a boiler-plate code has been reused and repeatedly modified (or refactored) during software evolution. We are interested to see if the tools can still locate the reused boiler-plate code. Second, we shift our focus from measuring how well our tools find all similar pairs of pervasively modified code pieces, as we did in Scenario 1, to measuring how well our tools find similar pairs of code pieces based on each pervasive code modification type. This is a finer-grained result and provides insights into the effects of each pervasive code modification type on code similarity. The default configurations are chosen for this experimental scenario to reflect a real use case when one does not know the optimal configurations of the tools and also to show the effect of each pervasive code modifications on the tools’ performances when they are picked off-the-shelf without any tuning. Since some threshold needs to be chosen, we used the optimal threshold for each tool.

We use the data set called SOCO gen which is derived from the SOCO data set used in Scenario 3. We follow the 5 steps in our experimental framework (see Fig. 3) by using the SOCO’s data set with boiler-plate code as a test data (Step 1). Amongst 259 SOCO files, 33 are successfully compiled and decompiled after code obfuscations by our framework. Each of the 33 files generates 10 pervasively modified files (including itself) resulting in 330 files available for detection (Step 4). The statistics of SOCO gen is shown in Table 5.

We change the similarity detection in Step 4 to focus only on comparing modified code to their original. Given M as a set of the 10 pervasive code modification types, a set of similar pairs of files Sim m (F) out of all files F with a pervasive code modification m is

$$ \begin{array}{lcl} M & = & \{ O , A , K , P_{c} , P_{g}K , P_{g}P_{c} , AK , AP_{c} , AP_{g}K , AP_{g}P_{c} \} \\ \text{Sim}_{m}(F) & = & \{(x,y) \in F_{O} \times F_{m}: m \in M; \text{sim}(x,y) > T\} \end{array} $$

Table 6 presents the 10 pervasive code modification types; including the original (O), source code obfuscation by Artifice (A), decompilation by Krakatau (K), decompilation by Procyon (P c ), bytecode obfuscation by ProGuard and decompilation by Krakatau (P g K), bytecode obfuscation by ProGuard and decompilation by Procyon (P g P c ), and four other combinations (AK, A P c , A P g K, A P g P c ); and ground truth for each of them. The number of code pairs and true positive pairs of A to A P g P c are twice larger than the Original (O) type because of asymmetric similarity between pairs, i.e. Sim(x,y) and Sim(y,x).

Table 6 10 pervasive code modification types

We measured the tools’ performance on each Sim m (F) set. By applying tools on a pair of original and pervasively modified code, we measure the tools based on one particular type of code modifications at a time. In total, we made 620,730 pairwise comparisons and analysed 330 similarity reports in this scenario.

5 Results

We used the five experimental scenarios of pervasive modifications, decompilation, reused boiler-plate code, ranked results, and the combination of local and global code modification to answer the six research questions. The execution of 30 similarity analysers on the data sets along with searching for their optimal parameters took several months to complete. We carefully observed and analysed the similarity reports and the results are discussed below in order of the six research questions.

5.1 RQ1: Performance Comparison

figure d

The results for this research question are collected from the experimental Scenario 1 (pervasive modifications) and Scenario 2 (reused boiler-plate code).

5.1.1 Pervasively Modified Code

A summary of the tools’ performances and their optimal configurations on the generated data set are listed in Table 7. We show seven error measures in the table including false positives (FP), false negatives (FN), accuracy (Acc), precision (Prec), recall (Rec), area under ROC curve (AUC), and F-score (F1). The tools are classified into 4 groups: clone detection tools, plagiarism detection tools, compression tools, and other similarity analysers. We can see that the tools’ performances vary over the same data set. For clone detectors, we applied three different granularity levels of similarity calculation: line (L), token (T), and character (C). We find that measuring code similarity at different code granularity levels has an impact on the performance of the tools. For example, ccfx gives a higher F-score when measuring similarity at character level than at line or token level. We present only the results for the best granularity level in each case here. The complete results of the tools can be downloaded from the study website (Ragkhitwetsagul and Krinke 2017a), including the generated data set before and after compilation/decompilation.

Table 7 Generated data set (Scenario 1): rankings (R) by F-scores (F1) and optimal configuration of every tool and technique

In terms of accuracy and F-score, the token-based clone detector ccfx is ranked first. The top 10 tools with highest F-score include ccfx (0.9760) followed by fuzzywuzzy (0.876), jplag-java (0.8636), difflib (0.8629), simjava (0.0.8618), deckard (0.8509), bzip2ncd (0.8494), ncd-bzlib (0.8465), simian (0.8413), and ncd-zlib (0.8361) respectively. Interestingly, tools from all the four groups appear in the top ten.

For clone detectors, we have a token-based tool (ccfx), an AST-based tool (deckard), and a string-based tool (simian) in the top ten. This shows that with pervasive modifications, multiple clone detectors with different detection techniques can offer comparable results given their optimal configurations are provided. However, some clone detectors, e.g. iclones and nicad, did not perform well in this data set. ccfx performs the best – possibly due to a combination of using a suffix tree matching algorithm on a small number of tokens (b = 5). This means that ccfx performs similarity computation on one small chunk of code at a time. This approach is flexible and effective in handling code with pervasive modifications that spread changes over the whole file. We also manually investigated the similarity reports of poorly performing iclones and nicad and found that the tools were susceptible to code changes involving the two decompilers, Krakatau and Procyon. When comparing files after decompilation by Krakatau to Procyon with or without bytecode obfuscation, they could not find any clones and hence reported zero similarity.

For plagiarism detection tools, jplag-java and simjava, which are token-based plagiarism detectors, are the leaders. Other plagiarism detectors give acceptable performance except simtext. This is expected since the tool is intended for plagiarism detection on natural text rather than source code. Compression tools show promising results using NCD for code similarity detection. They are ranked mostly in the middle from 7 th to 24 th with comparable results. The three bzip2-based NCD implementations, ncd-zlib, ncd-bzlib, and bzip2ncd only slightly outperform other compressors like gzip or LZMA. So the actual compression method may not have a strong effect in this context. Other techniques for code similarity offer varied performance. Tools such as ngram, diff, cosine, jellyfish and bsdiff perform badly. They are ranked amongst the last positions at 22 th, 26 th, 28 th, 29 th, and 30 th respectively. Surprisingly, two Python tools using difflib and fuzzywuzzy string matching techniques produce very high F-scores.

Fig. 7
figure 7

The (zoomed) ROC curves of the 10 tools that have the highest area under the curve (AUC)

To find the overall performance over similarity thresholds from 0 to 100, we drew the receiver operating characteristic (ROC) curves, calculated the area under the curve (AUC), and compared them. The closer the value is to one, the better the tool’s performance. Figure 7 include the ten highest AUC valued tools. We can see from the figure that ccfx is again the best performing tool with the highest AUC (0.9995), followed by fuzzywuzzy (0.9772), simjava (0.9711), jplag-text (0.9658), ncd-bzlib (0.9636), bzip2ncd (0.9635), deckard (0.9585), and ncd-zlib (0.9584). The two other tools, jplag-java and 7zncd-BZip2, offer AUCs of 0.9563 and 0.9557.

The best tool with respect to accuracy, and F-score is ccfx. The tool with the lowest false positive is difflib. The lowest false negatives is given by diff. However, considering the large amount of false positive for diff (8,810 false positives which mean 8,810 out of 9,000 dissimilar files are treated as similar), the tool tends to judge everything as similar. The second lowest false negative is once again ccfx.

Compared to our previous study (Ragkhitwetsagul et al., 2016), we expanded the generated data set to be two times bigger. Although half (i.e. 50 files) of the generated data set are the same Java files as in the previous study, our 50 newly added files potentially introduce more diversity into the data set. This, as a result, makes our results more generalisable, i.e. mitigates our threats to external validity.

To sum up, we found that specialised tools such as source code clone and plagiarism detectors perform well against pervasively modified code. They were better than most of the compression-based and general string similarity tools. Compression-based tools mostly give decent and comparable results for all compression algorithms. String similarity tools perform poorly and mostly ranked amongst the last. However, we found that Python difflib and fuzzywuzzy perform surprisingly better with this expanded version of the data set than on the original data set in our previous study (Ragkhitwetsagul et al., 2016). They are both ranked highly amongst the top 5. Lastly, ccfx performed well on both the smaller data set in our previous study and the current data set, and is ranked the 1 st on several error measures.

5.1.2 Boiler-plate Code

We report the complete evaluation of the tools on the SOCO data set with the optimal configurations in Table 8. Amongst the 30 tools, the top ranked tool in terms of F-score is jplag-text (0.9692), followed by simjava (0.9682), simian (0.9593) and jplag-java (0.9576). Most of the tools and techniques perform well on this data set. We observed high accuracy, precision, recall, and an F-score of over 0.7 for every tool except for diff and bsdiff. Since the data set contains source code that is copied and pasted with local modifications, the three clone detectors; ccfx, deckard, nicad, and simian; and plagiarism detectors; jplag-text, jplag-java and simjava; performed very well with F-scores between 0.9576 and 0.9692. ccfx and deckard produced the highest F-score when measuring similarity at character and token levels respectively. Other clone detectors including iclones, nicad, and simian provide the highest F-score at line level. The Python difflib and fuzzywuzzy are outliers of the Others group offering high performance against boiler-plate code with F-score of 0.9338 and 0.9443. Once again, these two string similarity techniques show promising results. The compression-based techniques are amongst the last although they still offer relatively high F-scores ranging from 0.8630 to 0.8776.

Table 8 SOCO data set (Scenario 3): rankings (R) by F-scores (F1) and optimal configuration of every tool and technique

Regarding the overall performance over similarity thresholds of 0 to 100, the results are illustrated as ROC curves in Fig. 8. The tool with the highest AUC is difflib (0.9999), followed by sherlock (0.9996), fuzzywuzzy (0.9989), and simjava (0.9987).

To sum up, we observed that almost every tool detected boiler-plate code effectively by reporting high scores on all error measures. jplag-text, simjava, simian, jplag-java, and deckard are the top 5 tools for this data set in terms of F-score. Similar to pervasive modifications, we found the string matching techniques difflib and fuzzywuzzy ranked amongst the top 10.

Fig. 8
figure 8

The (zoomed) ROC curves of the 10 tools that have the highest area under the curve (AUC) for SOCO

5.1.3 Observations of the Tools’ Performances on the Two Data Sets

We can notice a clear distinction between the F-score rankings of clone/plagiarism detectors and string/compression-based tools on the SOCO data set. This is due to the nature of boiler-plate code that has local modifications, contained within a single method or code block on which clone and plagiarism detectors perform well. However, on a more challenging pervasive modifications data set, there is no clear distinction in terms of ranking between dedicated code similarity techniques, compression-based, and general text similarity tools. We found that Python difflib string matching and Python fuzzywuzzy token similarity techniques even outperform several clone and plagiarism detection tools on both data sets. Provided that they are simple and easy-to-use Python libraries, one can adopt these two techniques to measure code similarity in a situation where dedicated tools are not available (e.g. unparsable, incomplete methods or code blocks). Compression-based techniques are not ranked at the top in either scenario, possibly due to the small size of the source code – NCD is known to perform better with large files.

5.2 RQ2: Optimal Configurations

figure e

In the experimental Scenarios 1 and 2, we thoroughly analysed various configurations of every tool and found that some specific settings are sensitive to pervasively modified and boiler-plate code whilst others are not.

5.2.1 Pervasively Modified Code

The complete list of the best configurations of every tool for pervasive modifications from Scenario 1 can be found in the second column of Table 7. The optimal configurations are significantly different from the default configurations, in particular for the clone detectors. For example, using the default settings for ccfx (b = 50, t = 12) leads to a very low F-score of 0.5781 due to a very high number of false negatives. Interestingly, a previous study on agreement of clone detectors (Wang et al., 2013) observed the same difference between default and optimal configurations.

In addition, we performed a detailed analysis of ccfx’s configurations. This is because ccfx is a widely-used tool in several clone research studies. Two parameter settings are chosen for ccfx in this study: b, the minimum length of clone fragments in the unit of tokens, and t, the minimum number of kinds of tokens in clone fragments. We initially observed that the optimal F-scores of the tool were at either b = 5 or b = 19. Hence, we expanded the search space of ccfx parameters from 280 (|b| = 20 ×|t| = 14) to 392 settings (|b| = 28 ×|t| = 14) to reduce chances of finding a local optimum. We did a fine-grained search of b starting from 3 to 25 stepping by one and coarse-grained search from 30 to 50 stepping by 5.

From Fig. 9, we can see that the default settings of ccfx, b = 50 and t = 12 (denoted with a × symbol), provides a decent precision but very low recall. Whilst there is no setting for ccfx to obtain the optimal precision and recall at the same time, there are a few cases that ccfx can obtain high precision and recall as shown on the top right corner of Fig. 9. Our derived ccfx’s optimal configuration is one of them. The best settings for precision and recall of ccfx are described in Table 9. The ccfx tool gives the best precision with b = 19 and t = 7, 8, 9 and gives the best recall with b = 5 and t = 12.

Fig. 9
figure 9

Trade off between precision and recall for 392 ccfx parameter settings. The default settings provide high precision but low recall against pervasive code modifications

Table 9 ccfx’s parameter settings for the highest precision and recall

The landscape of ccfx performance in terms of F-score is depicted in Fig. 10. Visually, we can distinguish regions that are the sweet spot for ccfx’s parameter settings against pervasive modifications from the rest. There are two regions covering the b value of 19 with t value from 7 to 9, and b value of 5 with t value from 11 to 12. The two regions provide F-scores ranging from 0.9589 up to 0.9760.

Fig. 10
figure 10

F-scores of 392 ccfx’s b and t parameter values on pervasive code modifications

5.2.2 Boiler-Plate Code

For boiler-plate code, we found another set of optimal configurations for the 30 tools by once again analysing a large search space of their configurations. The complete list of the best configurations for every tool from Scenario 3 can be found in the second column of Table 8. Similar to the generated data set, the derived optimal configurations for SOCO are different from the tools’ default configurations. For example, ccfx’s best configurations have a smaller b, minimum number of tokens, of 15 compared to the default value of 50 whilst jplag-java’s best configurations have a higher t value, the minimum number of tokens, of 12 compared to the default value of 9.

The results for both pervasively modified code and boiler-plate code show that the default configurations cannot offer the tools’ their best performance. These empirical results support the findings of Wang et al. (2013) that one cannot rely on the tools’ default configurations. We suggest researchers and practitioners try their best to tune the tools before performing any benchmarking or comparisons of the tools’ results to mitigate the threats to internal validity in their studies. Our optimal configurations can be used as guidelines for studies involving pervasive modifications and boiler-plate code. Nevertheless, they are only effective against their respective data set and not guaranteed to work well on other data sets.

5.3 RQ3: Normalisation by Decompilation

figure f
Fig. 11
figure 11

Comparison of tool performances (F1-score) before and after decompilation

The results after adding compilation and decompilation for normalisation to the post-processing step before performing similarity detection on the generated data set in the experimental scenario 3 is shown in Fig. 11. We can clearly observe that decompilation by both Krakatau and Procyon boosts the F-scores of every tool in the study.

Table 10 shows the performances of the tools after decompilation by Krakatau in terms of false positive (FP) rate, false negative (FN) rate, accuracy (Acc), precision (Prec), recall (Rec), area under ROC curve (AUC), and F-score. We can see that normalisation by compilation/decompilation has a strong effect on the number of false results reported by the tools. Every tool has its number of false positives and negatives greatly reduced and three tools, simian, jplag-java, and simjava, even no longer report any false results. All compression or other techniques still report some false results. This supports the results of our previous study (Ragkhitwetsagul et al., 2016) that using compilation/decompilation as a code normalisation method can improve the F-scores of every tool.

Table 10 Optimal configuration of every tool obtained from the generated decomp data set decompiled by Krakatau in Scenario 2 and their rankings (R) by F-scores (F1)

To strengthen the findings, we performed a statistical test to see if the performances before and after normalisation via decompilation differ with statistical significance. We chose the non-parametric two-tailed Wilcoxon signed-rank test (Wilcoxon 1945)Footnote 5 and performed the test with a confidence interval value of 95% (i.e. α ≤ 0.05). Table 11 shows that the observed F-scores before and after decompilation are different with statistical significance for both Krakatau and Procyon. We complemented the statistical test by employing a non-parametric effect size measure called Vargha and Delaney’s A 12 measure (Vargha and Delaney 2000) to measure the level of differences between two populations. We choose Vargha and Delaney’s A 12 measure because it is robust with respect to the shape of the distributions being compared (Thomas et al., 2014). Put it another way, it does not require the two populations under comparison to be normally distributed, which is the case in our results of the tools’ F1 scores. According to Vargha and Delaney (2000), the A 12 value of 0.5 means there is no difference between the two populations. A 12 value over or below 0.5 means the first population outperforms the second population, and vice versa. The guideline in Vargha and Delaney (2000) shows that 0.56 is interpreted as small, 0.64 as medium, and 0.71 as large. Using this scale, our F-score differences after decompilation by Krakatau (A12 = 0.969) and Procyon (A12 = 0.937) compared to the original are large. According to the interpretation of A 12 in Vargha and Delaney (2000), with Krakatau’s A 12 of 0.969 we can compute the probability that a random X 1 score from the set of tools’ performance after decompilation by Krakatau will be greater than a random X 2 score from the set of tools’ performance before decompilation by 2A 12 − 1 = 2 × 0.969 − 1 = 0.938. This A 12 effect size confirms that the tools’ performance after decompilation by Krakatau will be higher than the original 93.8% of the time. The similar finding also applies to Procyon (87.4%). The large effect sizes clearly supports the findings that compilation and decompilation is an effective normalisation technique against pervasive modifications.

Table 11 Wilcoxon signed-rank test of tools’ performances before and after decompilation by Krakatau and Procyon (α = 0.05)

To gain insight, we carefully investigated the source code after normalisation and found that decompiled files created by Krakatau are very similar despite the applied obfuscation. As depicted in Fig. 4 in the middle, the two code fragments become very similar after compilation and decompilation by Krakatau. This is because Krakatau has been designed to be robust with respect to minor obfuscations and the transformations made by Artifice and ProGuard are not very complex. Code normalisation by Krakatau resulted in multiple optimal configurations found for some of the tools. We selected only one optimal configuration to include in Table 10 and separately reported the complete list of optimal configurations on our study website (Ragkhitwetsagul and Krinke 2017a).

Normalisation via decompilation using Procyon also improves the performance of the similarity detectors, but not as much as Krakatau (see Table 12). Interestingly, Procyon performs slightly better for deckard, sherlock, and cosine. An example of code before and after decompilation by Procyon is shown in Fig. 4 at the bottom.

Table 12 Optimal configuration of every tool obtained from the generated decomp data set (decompiled by Procyon) in Scenario 2 and their rankings (R) by F-scores (F1)

The main difference between Krakatau and Procyon is that Procyon attempts to produce much more high-level source code whilst Krakatau’s is nearer to the bytecode. It seems that the low-level approach of Krakatau has a stronger normalisation effect. Hence, compilation/decompilation may be used as an effective normalisation method that greatly improves similarity detection between Java source code.

5.4 RQ4: Reuse of Configurations

figure g

We answer this research question using the results from RQ1 and RQ2 (experimental Scenario 1 and 2 respectively). For the 30 tools from RQ1, we applied the derived optimal configurations obtained from the generated data set (denoted as C gen) to the SOCO data set. Table 13 shows that using these configurations has a detrimental impact on the similarity detection results for another data set, even for tools that have no parameters (e.g. ncd-zlib and ncd-bzlib) and are only influenced by the chosen similarity threshold. We noticed that the low F-scores when C gen are reused on SOCO come from high number of false positives possibly due to their relaxed configurations.

Table 13 The table displays the results after applying the best configurations (C gen) from Scenario 1 to the SOCO data set and the derived best configurations for the SOCO set (C soco). The selected 10 tools are compared by their F-scores

To confirm this, we refer for the best configurations (settings and threshold) for the SOCO data set discussed in RQ1 (see Table 8), the comparison of best configurations between the two data sets is shown in Table 13. The reported F-scores are very high for the dataset-based optimal configurations (denoted as C soco), confirming that configurations are very sensitive to the data set on which the similarity detection is applied. We found the dataset-based optimal configurations, C soco, to be very different from the configuration for the generated data set C gen. Although the table shows only the top 10 tools from the generated data set, the same findings apply for every tool in our study. The complete results can be found from our study website (Ragkhitwetsagul and Krinke 2017a).

Lastly, we noticed that the best thresholds for the tools are very different between one data set and another and that the chosen similarity threshold tends to have the largest impact on the performance of similarity detection. This observation provides further motivation for a threshold-free comparison using precision-at-n.

5.5 RQ5: Ranked Results

figure h

In experimental scenario 4, we applied three error measures; precision-at-n (prec@n), average r-precision (ARP) and mean average precision (MAP); adopted from information retrieval to the generated and SOCO data set. The results are discussed below.

5.5.1 Precision-at-n

As discussed in Section 4.4, we used prec@n in a pair-based manner. For the generated data set, we sorted the 10,000 pairs of documents by their similarity values from the highest to the lowest. Then, we evaluated the tools based on a set of top n elements. We varied the value of n from 100 to 1500. In Table 14, we only reported the n equals to 1,000 since it is the number of true positives in the data set. The tools’ optimal configurations are not included in the table but can be found on our study website (Ragkhitwetsagul and Krinke 2017a). The ccfx tool is ranked 1st with the highest prec@n of 0.976 followed by simjava, and fuzzywuzzy. In comparison with the rankings for F-scores, the ranking of the ten tools changed slightly, as simjava and simian perform better whilst jplag-java and difflib tool now performed worse. ncd-zlib is no longer in the top 10 and is replaced by 7ncd-BZip2 in the 10th place.

Table 14 Top-10 rankings of using prec@n, ARP, and MAP over the generated data set with the tools’ optimal configurations

As illustrated in Fig. 12, varying fifteen n values of prec@n from 100 to 1500, stepping up by 100, gave us an overview of how well the tools perform across different n sizes. The number of true positives is depicted by a dotted line. We could see that most of the tools performed really well in the very first few hundreds of top n results by having steady flat lines at prec@n of 1.0 until the top 500 pairs. However, at the top 600 pairs, the performance of 7zncd-BZip2, deckard, ncd-bzlib, simian and simjava started dropping. bzip2ncd, jplag-java, and fuzzywuzzy started reporting false positives after the top 700 pairs whilst difflib could stay until the top 800 pairs. ccfx was the only tool that could maintain 100% correct results until the top 900 pairs. After that, it also started reporting false positives. At the top 1,500 pairs, all the tools offered prec@n at approximately 0.6 to 0.7. Due to a fairly small data set, this finding of perfect 1.0 prec@n until the first 500 pairs may not generalise to other data sets, as the similar performances achieved by the tools on the first 500 pairs might be due to intrinsic properties of the analysed programs.

Fig. 12
figure 12

Precision-at-n of the tools according to varied numbers of n against generated data set

For the SOCO data set, we varied the n value from 100 to 800, also stepping up by 100. The results in Table 15 used the n value of 453 which is the number of true positives in the corrected ground truth. We can clearly see that the ranking of 10 tools using prec@n closely resembles the one using F-scores. jplag-text is the top ranked tool followed by simjava, jplag-java, simian, and deckard. The ranking of eight tools is exactly the same as using F-score. jplag-java and nicad perform slightly worse using prec@453 and move down one position. The overall performances of the tools across various n values is depicted in Fig. 13 with the dotted line representing the number of true positives. The chart is somewhat analogous to the generated data set (Fig. 12). Most of the tools started reporting false positives at the top 300 pairs except jplag-java, difflib, fuzzywuzzy and simjava. After the top 400 pairs, no tool could any longer maintain 100% true positive results.

Fig. 13
figure 13

Precision-at-n of the tools according to varied numbers of n against SOCO data set

Table 15 Top-10 rankings of using prec@n, ARP, and MAP over the SOCO data set with the tools’ optimal configurations

Since prec@n is calculated from a set of top-n ranked results, its value shows how fast a tool can retrieve correct answers to a limited set of n most similar files. It also reflects how well the tool can differentiate between similar and dissimilar documents. A good tool should not be confused and should produce a large gap in the similarity values between the true positive and the true negative results. In this study, ccfx and jplag-text have shown to be the best tools in terms of prec@n for pervasive modifications and boiler-plate code respectively. They are the also the best tools based on F-scores in RQ1.

5.5.2 Average r-Precision

ARP is a query-based error measure that needs knowledge of ground truth. Since we knew the ground truth for our two data sets, we did not need to vary the values of n as in prec@n. The value of n was set to the number of true positives.

For the generated data set, each file in the set of 100 files was used as a query once. Each query received 100 files ranked by their similarity values. We knew the ground truth that each file has 10 other similar files including itself (i.e. r or the number of relevant documents equals 10). We cut off after the top 10 ranked results and calculated an r-precision value. Finally, we computed ARP from an average of the 100 r-precisions. We reported the ARPs of the ten tools in Table 14. We can see that ccfx is still ranked first with the perfect ARP of 1.0000 followed by fuzzywuzzy. ncd-bzlib now performs much better using ARP and is ranked third. Interesting, the 3 rd to 10 th ranks are all compression-based tools. This shows that with the presence of pervasive modifications, code similarity using NCD-compression method is better at query-based results than most of the clone and plagiarism detectors and the string similarity tools.

For SOCO, only files with known, corrected, ground truth were used as queries. This is because ARP can only be computed when relevant answers are retrieved. We found that the 453 pairs in the ground truth were formed by 115 unique files, and we used them as our queries. The value of r here was not fixed as for the generated data set. It depended on how many relevant answers existed in the ground truth for each particular query file and we calculated the r-precision based on that. The ARPs of the SOCO data set is reported in Table 15. jplag-java and difflib are ranked first with an ARP of 0.998, followed by ccfx and simjava both with an ARP of 0.989. Similar to the findings for the generated data set, compression-based tools work well with a query-based approach by having 5 NCD tools ranked in the top 10.

Since ARP are computed based on means, we performed a statistical test to strengthen our results by testing for the statistical significance of differences in the set of r-precision values between tools. We chose a one-tailed non-parametric randomisation test (i.e. permutation test) due to its robustness in information retrieval as shown by Smucker et al. (2007).Footnote 6 We performed the test using 100,000 random samples with a confidence interval value of 95% (i.e. α ≤ 0.05). The statistical test results are shown in Tables 16 and 17. The tables are matrices of pairwise one-tailed statistical test results in the direction of rows ≥ columns. The symbol \(\blacktriangleright \) represents statistical significance whilst the symbol \(\square \) represents no statistical significance. For example, in Table 16, the \(\blacktriangleright \) on the left most of the top row [ccfx, fuzzywuzzy] shows that the mean of r-precision values of ccfx are higher than or equal to fuzzywuzzy’s with statistical significance. On the other hand, we can see that the mean of r-precision values of fuzzywuzzy is higher than ncd-bzlib with no statistical significance as represented by \(\square \) at the location of [fuzzywuzzy, ncd-bzlib].

Table 16 One-tailed randomisation test with 100K samples of the ARP and MAP values from the generated data set
Table 17 One-tailed randomisation test with 100K samples of the ARP values from the SOCO data set

For the generated data set (Table 16), we found that ccfx is the only tool that dominates other tools on their r-precisions values with statistical significance. For SOCO data set (Table 17), jplag-java and difflib outperform 7zncd-Deflate, 7zncd-Deflate64, and 7zncd-LZMA with statistical significance.

ARP tells us how well the tools perform when we want all the true positive results in a query-based manner. For example, in automated software repair one wants to find similar source code given some original, buggy, source code that one possesses. One can use the original source code as a query and look for similar source files in a set of source code files. In our study, ccfx is the best tool for this retrieval method against pervasive modifications. jplag-java and difflib are the best tool for boiler-plate code.

5.5.3 Mean Average Precision

We included MAP in this study due to its well-known quality of discrimination and stability across several recall levels. It is also used when the ground truth for relevant documents is known. We computed MAP in a very similar way to ARP except that instead of only looking at the top r pairs, we calculated precision every time a new, relevant, source code file is retrieved. An average across all recall levels is then calculated. Lastly, the final average across all the queries is computed as MAP. We used the same number of relevant files as in the ARP calculations for the generated and the SOCO data set. The results for MAP are reported in Tables 14 and 15.

For the generated data set (Table 14), the rankings are very similar to those for ARP. ccfx, fuzzywuzzy, and ncd-bzlib are ranked 1 st, 2 nd and 3 rd. For SOCO (Table 15), the rankings are very different to those obtained when using F-score and prec@n but similar to those for ARP. Tools jplag-java and difflib become the best performers followed by jplag-text and simjava.

Compression-based tools are again found to offer good performance with MAP. Five tools are ranked in the top 10 for both the generated and boiler-plate code data sets.

Similarly, since MAP is also computed based on mean, we performed a one-tailed non-parametric randomisation statistical test on pairwise comparisons of the tools’ MAP values. The test results are shown in Tables 16 and 18. For the generated data set, we found the same results of ccfx dominating other tools’ MAPs with statistical significance. For the SOCO data set, we found that jplag-java and difflib outperform gzipncd, ncd-zlib, sherlock, 7zncd-Deflate64, 7zncd-Deflate, and fuzzywuzzy with statistical significance.

Table 18 One-tailed randomisation test with 100K samples of the MAP values from SOCO data set

MAP is similar to ARP because recall is taken into account. However, it differs from ARP by measuring precision at multiple recall levels. It is also different from F-score in terms of being query-based measure instead of a pair-based measure. It shows how well a tool performs on average when it has to find all true positives for each query. In this study, the best performing tool in terms of MAP is ccfx for pervasively modified code and jplag-java and difflib for boiler-plate code respectively.

5.6 RQ6: Local + Global Code Modifications

figure i

Using the results from Experimental Scenario 5, we present the tools’ performances based on F-scores in Table 19 and show the distribution of F-scores in Fig. 14. The F-scores are grouped according to the 10 pervasive code modification types (see Table 6). The numbers are highlighted when F-scores are higher than 0.8.

Fig. 14
figure 14

Distribution of tools performance for each pervasive modification type

Table 19 F-scores of the tools on SOCO gen using the default configurations (with optimised threshold). Highlighted values have F-score higher than 0.8

5.6.1 Tools’ Performances vs. Individual Pervasive Modification Type

On the original boiler-plate code without any modification (O), every tool except iclones, nicad, bsdiff, and diff report high F-scores ranging from 0.8 to 1.0. This shows that most tools with their default configurations do not have a problem detecting boiler-plate code. The tool nicad performed poorly, possibly due to default configurations that aim at clones without variable renaming and code abstraction at all (i.e. set renaming = none and abstract = none). iclone’s default configurations of minimum 100 of clone tokens are too high compared to the optimal configurations of 40 found in RQ1. diff and bsdiff are too general to handle code with local modifications.

The tools perform worse after pervasive modifications are applied on top of the boiler-plate code. Source code obfuscation by Artifice (A) has strong effects to ccfx, iclones, nicad, simian, bsdiff, and diff according to low F-scores of 0.0 to 0.2. deckard, jplag-java, plaggie, simjava, difflib, fuzzywuzzy and ngram maintained their high F-scores of over 0.9. Interestingly, jplag-java reported a perfect F-score of 1.0 possibly due to it being designed for detecting plagiarised code which is usually pervasively modified at source code level.

According to the boxplot in Fig. 14, code after decompilation by Krakatau (K) results in lower F-scores than after decompilation by Procyon. Since the Krakatau decompilation process generates source code that is close to Java bytecode and mostly structurally different from the original, its generated code is challenging for tools that are based on lexical and syntactic similarity. In the group of clone detectors, ccfx, iclones, and nicad did not report any correct results at all (F-score = 0.0) whilst deckard and simian reported very low F-scores of 0.1667 and 0.0357 respectively. Code after decompilation by Procyon (P c ) had milder effects than Krakatau and Artifice. The tool fuzzywuzzy is the best for both K and P c with F-scores of 0.9259 and 0.9636 respectively.

A combination of ProGuard and either Krakatau or Procyon (P g K, P g P c ) reported the lowest F-scores as can be clearly seen from Fig. 14. This is due to bytecode modifications (e.g. renaming classes, fields, and variables, package hierarchy flattening, class repackaging, merging classes and modifying package access permissions) performed by ProGuard combined with a decompilation process that greatly changed both the lexemes and the structure of the code. It is interesting to see that difflib and fuzzywuzzy, a token matching technique, are the highest performing tools with F-scores of 0.4790 and 0.5116 for P g K and P g P c respectively. Thus, in the presence of pervasive modifications that heavily or completely change code structure, using a simpler, general, text similarity technique may give a higher chance of finding similar code than dedicated code similarity tools.

Code after source code obfuscation by Artifice and decompilation by Krakatau and Procyon (AK, A P c ) has comparable results to K and P c with marginal differences. Fuzzywuzzy and jplag-java are the best tools for this modification type.

Lastly, two combinations of obfuscation and decompilation (A P g K, A P g P c ) also provide almost identical F-score results to P g K and P g P c . This suggests that the pervasive modifications made to source code obfuscation may be no longer effective if decompilation is included. Vice versa, the modifications made by bytecode obfuscation persist through the compilation and decompilation process. Difflib and fuzzywuzzy are the best tools for this modification type.

To sum up, we found that most of the tools perform well on detecting boiler-plate code, and report lower performance when adding pervasive modifications. Some clone detection tools can tolerate pervasive modifications made by source code obfuscators, but all are susceptible to pervasive changes made by decompilers or a combination of a bytecode obfuscator and decompilers. Plagiarism detectors offer decent results over the 10 modification types. Interestingly, fuzzywuzzy and difflib, token and string matching techniques, outperformed dedicated tools on heavily modified code with a combination of obfuscators and decompilers.

5.7 Overall Discussions

In summary, we have answered the six research questions after performing five experimental scenarios. We found that the state-of-the-art code similarity analysers perform differently on pervasively modified code. Properly configured, a well known and often used clone detector, ccfx, performed the best, closely followed by a Python string matching algorithm, fuzzywuzzy. A comparison of the tools on boiler-plate code in the SOCO data set found the jplag-text plagiarism detector performed the best followed by simjava, simian, jplag-java, and deckard.

The experiment using compilation/decompilation for normalisation showed that compilation/decompilation is effective and improves similarity detection techniques with statistical significance. Therefore, future implementations of clone or plagiarism detection tools or other similarity detection approaches could consider using compilation/decompilation for normalisation.

However, every technique and tool turned out to be extremely sensitive to its own configurations consisting of several parameter settings and a similarity threshold. Moreover, for some tools the optimal configurations turned out to be very different to the default configuration, showing one cannot just reuse (default) configurations.

Finding an optimal configuration is naturally biassed by the particular data set. One cannot get optimal results from tools by directly applying the optimal derived parameter settings and similarity thresholds for one data set to another data set. The SOCO data set, where we have applied the optimal configurations from the generated data set, clearly shows that configurations that work well with a specific data set may not be guaranteed to work with future data sets. Researchers have to consider this limitation every time when they use similarity detection techniques in their studies.

The chosen similarity threshold has the strongest impact on the results of similarity detection. We have investigated the use of three information retrieval error measures, precision-at-n, r-precision, and mean average precision, to remove the threshold completely and rely only on the ranked pairs. These three error measures are often used in information retrieval research but are rarely seen in code similarity measurements such as code clone or plagiarism detection. Using the three measures, we can see how successful the different techniques and tools are in distinguishing similar code from dissimilar code based on ranked results. The tool rankings can be used as guidelines to select tools in real-world scenarios of similar code search or code plagiarism detection, for example, when one is interested in looking at only the top n most similar source code pairs due to limited time for manual inspection or when one uses a file to query for the other most similar files.

Lastly, we compare the tools on a data set of pervasively modified boiler-plate code. We found that whilst most tools offered high performance on boiler-plate code, they performed much worse after pervasive modifications were applied. We observed that pervasively modified code with changes made from a combination of bytecode obfuscation by ProGuard and the two decompilers had strongest effects on the tools’ F-scores.

5.8 Threats to Validity

Construct validity: We carefully chose the data sets for our experiment. We created the first data set (generated) by ourselves to obtain the ground truth for positive and negative results. We investigated whether our obfuscators (Artifice and ProGuard), compiler (javac) and decompilers (Krakatau and Procyon) offer code modifications that are commonly found in code cloning and code plagiarism (see Table 1). However, they may not totally represent all possible pervasive modifications found in software. The SOCO data set has been used in a competition for detecting reused code and a careful manual investigation has revealed errors in the provided ground truth that have been corrected.

Internal validity: Although we have attempted to use the tools with their best parameter settings, we cannot guarantee that we have done so successfully and it may be possible that the poor performance of some detectors is due to wrong usage as opposed to the techniques used in the detector. Moreover, in this study we tried to compare the tools’ performances based on several standard measurements of precision, recall, accuracy, F-score, AUC, prec@n, ARP and MAP. However, there might be some situations where other measurements are required and that might produce different results.

External validity: The tools used in this study were restricted to be open-source or at least be freely available, but they do cover several areas of similarity detection (including string-, token-, and tree-based approaches) and some of them are well-known similarity measurement techniques used in other areas such as normalised compression (information theory) and cosine similarity (information retrieval). Nevertheless, they might not be completely representative of all available techniques and tools.

The generated (100 Java files) and SOCO (259 Java files) data sets are fairly small and contain a single class with one or a few methods. They might not adequately represent real software projects. Hence, our results are limited to pervasive modifications and boiler-plate code at a file-level, not a whole software project. The optimal configurations presented in this paper are found relative to the data set of code modifications from which they were derived and may not generalise to all type of code modifications. In addition, the two decompilers (Krakatau, Procyon) are only a subset of all decompilers available. So they may not totally represent the performance of the other decompilers in the market or even other source code normalisation techniques. However, we have chosen two instead of only one so we can compare their behaviours and performances. As we are exploiting features of Java source and byte code, our findings only apply to Java code.

6 Related Work

Plagiarism is obviously a problem of serious concern in education. Similarly in industry, the copying of code or programs is copyright infringement. They affect both the originality of one’s idea, one’s credibility, and also the quality of one’s organisation. The problem of software plagiarism has been occurring for several decades in schools and universities (Cosma and Joy 2008; Daniela et al., 2012) and in law, where one of the more visible cases regarding copyright infringement of software is the ongoing lawsuit between Oracle and Google (United States District Court 2011).

To detect plagiarism or copyright infringement of source code, one has to measure the similarity of two programs. Two programs can be similar at the level of purpose, algorithm, or implementation (Zhang et al., 2012). Most software plagiarism tools and techniques focus on the level of implementation since it is most likely to be plagiarised. The process of code plagiarism involves pervasive modifications to hide the plagiarism which often includes obfuscation. The goal of code obfuscation is to make the modified code harder to understand by humans and harder to reverse engineer whilst preserving its semantics (Whale1990; Collberg et al. 1997, 2002, 2003). Deobfuscation attempts to reverse engineer obfuscated code (Udupa et al., 2005). Because Java byte code is comparatively high-level and easy to decompile, obfuscation of Java bytecode has focused on preventing decompilation (Batchelder and Hendren 2007) whilst decompilers like Krakatoa (Proebsting and Watterson 1997), Krakatau (Grosse 2016) and Procyon (Strobel 2016) attempt to decompile even in the presence of obfuscation.

Although there are a large number of clone detectors, plagiarism detectors, and code similarity detectors invented in the research community, there are relatively few studies that compare and evaluate their performances. Bellon et al. (2007) proposed a framework for comparing and evaluating clone detectors and six tools (Dup, CloneDr, CCFinder, Duplix, CLAN, Duploc) were chosen for the studies. Later, Roy et al. (2009) performed a thorough evaluation of clone detection tools and techniques covering a wider range of tools. However, they compare the tools and techniques using the evaluation results obtained from the tools’ published papers without any real experimentation. Moreover, the performances in terms of recall for 11 modern clone detectors are evaluated based on four different code clone benchmark frameworks including Bellon’s (Svajlenko and Roy 2014). Hage et al. (2010) compare five plagiarism detectors in term of their features and performances against 17 code modifications. Burd and Bailey (2002) compare five clone detectors for preventive maintenance task. Biegel et al. (2011) compare three code similarity measures to identify code that need refactoring. Roy and Cordy (2009) use a mutation based approach to create a framework for the evaluation of clone detectors. However, their framework was mostly limited to locally confined modifications, only including systematic renaming as a pervasive modification. Due to the limitations, we haven’t included their framework in our study. Moreover, they used their framework for a comparison limited to three variants of their own clone detector NICAD (Roy and Cordy 2008). (Svajlenko and Roy 2016) developed a clone evaluation framework called BigCloneEval that aimed to automatically measure clone detectors’ recall on the BigCloneBench data set. The BigCloneBench’s manually-confirmed clone oracle is built from IJaDataset, the repository of 25,000 Java open source projects, by searching for methods containing keywords and source code patterns representing 43 functionalities. Whilst BigCloneEval offers a benefit of manually confirmed clones from a large set of real-world Java software projects, their clone oracle, and also the measured recall, is limited to the selected functionalities. If a tool reports other clone pairs besides these 43 functionalities, the framework does not take them into account. Although our two data sets in this study are much smaller in size in comparison with the BigCloneBench, we were able to measure both precision and recall. Since we created one data set using code obfuscators, a compiler, and decompilers, and reused another data set from a competition, we had a complete knowledge of the ground truth for both of them and could take all possible similar code pairs, i.e. clones, into account.

Several code obfuscation methods can be found in the work of Luo et al. (2014). The techniques utilised include obfuscation by different compiler optimisation levels or using different compilers. Obfuscating tools exist at either source code level (e.g. Semantic Designs Inc.’s C obfuscator, Stunnix’s CXX-obfuscator), and binary level (e.g. Diablo, Loco (Madou et al., 2006), CIL (Necula et al., 2002)).

An evaluation of code obfuscation techniques has been performed by Ceccato et al. (2009). They evaluated how layout obfuscation by identifier renaming affects the participants’ comprehension of, and ability to modify, two given programs. They found that obfuscation by identifier renaming could slow down an attack by two to four times the time needed for clear, un-obfuscated programs. Their later study (Ceccato et al., 2013) confirms that identifier renaming is an effective obfuscation technique, even better than control-flow obfuscation by opaque predicates. Our two chosen obfuscators also perform layout obfuscation, including identifier renaming, in this study. However, instead of measuring understanding of obfuscated programs by human, we measure how well code similarity analysers perform on obfuscated code, which we use as a kind of pervasive code modifications. We also decompiled obfuscated bytecode and compared the tools’ performances based on the resulting source code.

There are studies that try to enhance the performance of clone detectors by looking for more clones from the code’s intermediate representation such as Jimple code (Selim et al., 2010), bytecode (Chen et al., 2014; Kononenko et al., 2014), or assembler code (Davis and Godfrey 2010). Using intermediate representation for clone detection gives satisfying results mainly by increasing the recall of the tools. Our study is different from them in the way that we apply decompilation as another code modification step before applying code similarity analysers. Our decompiled code is also Java source code and we can choose any source-based similarity analysers directly out of the box. Our results show that compilation/decompilation can also help in improving tools’ performances. An empirical study of using compilation/decompilation to enhance the performance of clone detection tool in three real-world system found similar results to our study (Ragkhitwetsagul and Krinke 2017b).

Keivanloo et al. (2015) discussed the problem of using a single threshold for clone detection over several repositories and propose a solution using threshold-free clone detection based on unsupervised learning. The method mainly utilises k-means clustering with the Friedman quality optimisation method. Our investigation of precision-at-n, ARP, and MAP focuses on the same problem but our goal is to compare the performance of several similarity detection tools instead of boosting the performance of one tool as in their study.

The work that is closest to ours is the empirical study of the efficiency of current detection tools against code obfuscation (Schulze and Meyer 2013). The authors created the Artifice source code obfuscator and measured the effects of obfuscation on clone detectors. However, the number of tools chosen for the study was limited to only three detectors: JPlag, CloneDigger, and Scorpio. Nor has bytecode obfuscation been considered. The study showed that token-based clone detection outperformed text-, tree- and graph-based clone detection (similar to our findings).

7 Conclusions

This study of source code similarity analysers is the largest existing similarity detection study covering the widest range (30) of similarity detection techniques and tools to date. We found that the techniques and tools achieve varied performances when run against five different scenarios of modifications on source code. Our analysis provides a broad, thorough, performance-based evaluation of tools and techniques for similarity detection.

Our experimental results show that highly specialised source code similarity detection techniques and tools can perform better than more general textual similarity measures. ccfx offers the highest performance on pervasively modified code and jplag-text on boiler-plate code. However, general string matching techniques, fuzzywuzzy and difflib, outperform dedicated code similarity tools in some cases especially for code with heavy structural changes. Moreover, we confirmed that compilation and decompilation can be used as an effective normalisation method that greatly improves similarity detection between Java source code, leading to three clone and plagiarism tools not reporting any false classification on our generated data set. The evaluation of ranked results provides a guideline for tool selections when one wants to retrieve only the highly similar results to the code query.

Once again, our study showed that similarity detection techniques and tools are very sensitive to their parameter settings. One cannot just use default settings or re-use settings that have been optimised for one data set to another data set.

Importantly, the results of the study can be used as a guideline for researchers to select a proper technique with appropriate configurations for their data sets.