Semantic Clone Detection via Probabilistic Software Modeling

Semantic clone detection is the process of finding program elements with similar or equal runtime behavior. For example, detecting the semantic equality between the recursive and iterative implementation of the factorial computation. Semantic clone detection is the de facto technical boundary of clone detectors. In recent years, this boundary has been tested using interesting new approaches. This article contributes a semantic clone detection approach that detects clones that have 0% syntactic similarity. We present Semantic Clone Detection via Probabilistic Software Modeling (SCD-PSM) as a stable and precise solution to semantic clone detection. PSM builds a probabilistic model of a program that is capable of evaluating and generating runtime data. SCD-PSM leverages this model and its model elements for finding behaviorally equal model elements. This behavioral equality is then generalized to semantic equality of the original program elements. It uses the likelihood between model elements as a distance metric. Then, it employs the likelihood ratio significance test to decide whether this distance is significant, given a pre-specified and controllable false-positive rate. The output of SCD-PSM are pairs of program elements (i.e., methods), their distance, and a decision on whether they are clones or not. SCD-PSM yields excellent results with a Matthews Correlation Coefficient greater than 0.9. These results are obtained on classical semantic clone detection problems such as detecting recursive and iterative versions of an algorithm, but also on complex problems used in coding competitions.


Introduction
Copying and pasting source code fragments leads to code clones, which are considered an anti-pattern. Code clones increase maintenance costs [31,32], promote ı The research reported in this paper has been supported by the Austrian Ministry for Transport, Innovation and Technology, the Federal Ministry of Science, Research and Economy, and the Province of Upper Austria in the frame of the COMET center SCCH. This research was funded in part, by the Austrian Science Fund (FWF) [P25513].
bad software design [29,13,17], and introduce or propagate bugs [4,28,14]. However, duplicating code fragments also allows faster adaptation to requirements, the re-use of stable and well-tested solutions [25,26], and helps to overcome language limitations [21,35], thereby lowering development costs. The impact of code clones and the contradicting evidence various studies provide are the topics of an ongoing discussion in the community. Meanwhile, it is certain that developers will continue duplicating source code to leverage its benefits, despite its drawbacks. The key is the awareness and management of clones to maximize e ciency while balancing quality.
Traditionally, the clone taxonomy distinguishes between four types of clones [35,2,34]. Type 1-3 describe code clones caused by copying and pasting the source code with or without changes. Type 4 clones describe code clones that do not have any syntactic similarity but implement the same functionality (semantic equivalence). For example, the recursive and iterative implementation of an algorithm (e.g., Fibonacci computation) have no syntactic similarity while implementing the same functionality. Existing tools have limited or no capabilities to detect Type 4 clones [19]. Most current studies exclude them because of the lack of tool support [23,35,2,39,11]. Nevertheless, Type 4 clones exist, and recent research e orts have tried to deepen the understanding of them [19,49,20]. This article provides a significant contribution to semantic clone detection in the form of novel concepts and a prototype implementing them.
We present Semantic Clone Detection via Probabilistic Software Modeling (SCD-PSM). SCD-PSM extends our work on Probabilistic Software Modeling (PSM) [43] via a semantic clone detection pipeline. PSM builds Probabilistic Models (PMs) from programs. It analyzes the static structure and dynamic runtime behavior and replicates the program in the form of a generative probabilistic model. These models allow developers to reason about the semantics of a program. SCD-PSM extends this work by leveraging the PMs and causal reasoning to find semantically (i.e., behaviorally) equivalent code elements. SCD-PSM allows full quantification of the behavioral distance of code elements via likelihoods. Furthermore, the likelihood evaluation via PMs allows for statistical significance tests to decide whether a pair of code elements are clones. SCD-PSM detects semantic clones with no textual similarity, such as the iterative and recursive version of an algorithm. The average performance of the approach reaches a Matthews Correlation Coe cient of 0.965 on a complex problem set indicating a robust method for semantic clone detection. This work extends our previous work [41] with a full evaluation and the theoretical foundation.
Section 2 provides the background needed to understand SCD-PSM including the basics of PSM. Section 3 clarifies what semantic clones are in the context of this work. Section 4 presents the approach in which representation, search space, and the various similarity stages are described. Section 5 evaluates the approach while Section 6 discusses the results. Limitations of the approach and possible threats are given in Section 7 and Section 8. Section 9 compares the work to the state-of-art and Section 10 concludes this article.

Background
The clone detection research community has a long history and defines many concepts, algorithms, and tools. In contrast, Probabilistic Software Modeling (PSM) is relatively new and combines software engineering and probabilistic modeling. Some terms need clarification; others require an introduction if they diverge from their traditional names.

Clone Detection
Clone detection is the process of finding two similar program fragments. Listings 1.1 to 1.4 are four di erent implementations of the factorial function (n!). Listing 1.1 is a for-loop implementation, Listing 1.2 uses a while-loop, and Listing 1.3 is recursively defined. Finally, Listing 1.4 delegates its implementation to fc() from Listing 1.3 but may also return ≠1 in case of invalid inputs (including n = 0).
Representation, pairing, similarity evaluation, and clone decision are the core concepts of clone detection. Representations describe on which artifact the detector operates, such as text, graphs (e.g., AST), or probabilistic models. Pairing describes the selection of two code fragments that are potentially clones (e.g., fa() and fb()). Each pair is called a candidate clone pair (or candidate pair). The similarity evaluation measures the similarity of a candidate pair, e.g., by counting the number of di erent characters. Finally, the clone decision labels the candidate pair as a clone given a criterion on the similarity, e.g., less than ten di erent characters.
The properties of the similarity metric split clones into two groups [35]. Type 1-3 clones capture textual similarity while Type 4 clones capture semantic similarity [2,23,24,35,34,44]. Type 1 (Exact Clones) clones are program fragments that are identical except for variations in white-space and comments. Type 2 (Parameterized Clones) clones are program fragments that are structurally or syntactically similar except for changes in identifiers, literals, types, and comments. Type 3 (Near-Miss Clones) clones are program fragments that include insertions or deletions in addition to changes in identifiers, literals, types, and layouts. Type 4 (Semantic Clones) clones are program fragments that are functionally or semantically similar (i.e., perform the same computation) without textual similarities. These types are increasingly challenging to detect, with Type 4 being the most complex one. Note that the definition of Semantic Clones is often relaxed, where up-to 50% syntactic similarity of the code fragments is allowed (e.g., BigCloneBench [39]). However, we consider these clones as complex Type 3 clones (additions, deletions, reordering) and not as semantic clones. This means that semantic clones in the context of this work are clones with no syntactic similarity except for per-chance similarities.
We will use a ƒ b to denote that a is a clone of b. Furthermore, a " ƒ b denotes that a is not a clone of b.

Programs & Code Elements
PSM generalizes object-oriented terms to describe code elements in a program. Code elements are types T , properties P r, and executables Ex that refer to, e.g., classes, fields, and methods in Java [1], or classes, properties, and functions in Python [45]. Additional code elements are parameters P r and results Re of executables that refer to parameters and return values of a method. Properties, parameters, and results are atomic code elements that have identifiable states at runtime. Types and executables are compositional elements that act as a collection of atomic elements. Types declare properties and executables, capturing structural relationships. Executables have behavioral relationships that are categorized into Inputs (I) and Outputs (O). Inputs are received parameters P a I , read properties P r I , and requested invocation results Re I . Outputs are returned executable results Re O , written properties P r O , and provided parameters P a O . We will denote atomic elements in lowercase, and compositional elements in bold-face lowercase, e.g., n and fa in Listing 1.1. Executable results are named after their executables, e.g., fa in Listing 1.

Probabilistic Software Modeling
Probabilistic Software Modeling (PSM) [40] is a data-driven modeling paradigm that transforms a program into a Probabilistic Model (PM). PSM extracts the structure and behavior of a program. Code elements and their dependency graph represent the structure as described in Section 2.2. All observable events at runtime represent the behavior. The resulting PM and its model elements are a probabilistic copy of the program.
Model elements in the PM are the equivalent to code elements in the program. P (x) denotes the probability distribution of variable x, e.g., P fa (n) denotes the probability distribution of input parameter n of the fa-method. p(x) denotes the probability of a specific event of a variable, e.g., p fa (n = 2). This extends the notation of code elements with probabilistic quantities. However, the notation reasons about the probabilistic behavior of code elements instead of their structural properties.
Each model element is a flow-based latent variable model [7] that learns an invertible mapping between the original observations and an isotropic unit norm Gaussian N (0, 1) with f : X ' ae Z. An example for x oe X may be n oe fa with n z oe fa z being its latent Gaussian representation. The Gaussian latent space enables the model elements to generate new samples and evaluate the likelihood of samples.
Generation (or Sampling) draws, either marginally or conditionally, observations from a model element simulating the execution of the corresponding code element. For example, drawing 100 observations from fa ≥ P fa (n, f a), i.e., values for n I and fa O , simulates 100 program executions of this method. An example for conditional generation would be fa |n<10 ≥ P fa (fa | n < 10) that only draws observations where n < 10. The process involves sampling from the latent Gaussian variables, and inverting the Gaussian samples to the original domain via the flow f ≠1 (z) = x. Evaluation takes observations and evaluates their likelihood under a model element. For example, P fa (n = 4, fa = 24) evaluates the likelihood of input 4 and output 24 under the fa model element. The process of evaluation involves mapping a given sample into the latent space and evaluating it under the Gaussians p N (0,1) (f (x)). Generation and evaluation are the core of any PSM applications and of SCD-PSM. A detailed description is given in our previous work [43].

Semantic Clones
A clear understanding of what SCD-PSM defines a semantic clone is essential in understanding the approach and its design choices.

Definition 1. A semantic clone is a pair of executables whose (partial) input, and output relationships exhibit significant (conditional) similarities.
Definition 1 defines semantic clones over the similarity between IO relationships of executables. This holds if the IO relationships are only partially similar, i.e., not all combinations of IO pairs between executables have to be similar. For example, fd in Listing 1.4 has two IO pairs (fd IO while fa in Listing 1.1 has one IO pair (fa IO = {(n, f a))}). According to the definition, at least one IO pair comparison needs to be similar such that both executables are declared as a semantic clone (e.g., (n, f d) ƒ (n, f a)).
Furthermore, the similarities between IO pairs may only be conditional, i.e., the similarity of matching IO pairs might be depending on the state of any other code element in the comparison context. For example, the IO pair (n, f d) ƒ (n, f a) is only a perfect clone in case that fd.guard != "val". If fd.guard == "val", the IO behavior would di er in case of n = 1 (fd(1) ' ae ≠1 while fa(1)' ae 1). According to the definition, at least parts of the behavior need to be similar, capturing complex multidimensional behavioral patterns in IO relationships.
The rationale behind the comparison of IO relationships is one of cause and e ect. If a pair of executables exhibit similar e ects given similar causes, then their computational behavior is identical. Extending this rationale by multiple inputs and outputs leads to partial conditional similarity.

Probabilistic Model
Source Code   The approach represents a rejecting filter pipeline that candidate pairs must traverse in order to be declared a clone. Static-, Dynamic-, and Model Similarity represent filter stages of increasing complexity.
The main contribution of this work is the implementation of a semantic clone detection pipeline on top of PSM. Further, we provide an e ective process of traversing the potentially large search space of candidate pairs. Finally, we show that the behavioral equivalence of model elements generalizes to the semantic equivalence of code elements.

Modeling
Starting from the Source Code in Figure 1, PSM builds a Probabilistic Model (PM) [40] of the program (1). The PM is also called the Inference Graph (IG), which is a cluster graph [22] with Non-Volume Preserving Flows (NVPs) [7] as clusters. SCD-PSM uses this PM as a representation for the clone detection, similar to text-based clone detectors that use text fragments. The PM is the output of PSM and is considered as given in the context of SCD-PSM.
Executable model elements in the PM act as a surrogate to the executables in the program. SCD-PSM pairs these model elements and computes their similarity. If a behaviorally equivalent model element pair is found, then it can be seen as a semantically equivalent code element pair. In conclusion, the SCD-PSM allows for method-level semantic clone detection based on PMs representing the original executables in the program. SCD-PSM conducts method-level semantic clone detection, which operates on multiple abstraction levels. Figure 2 illustrates these levels, starting with the program and ending with the inputs and outputs of an executable. The second step in Figure 1 builds a within-and between-executable space that SCD-PSM searches for clones. The Between-Executable Space (BES) is the set of executable combinations

Search Space
where exa, exb is a candidate pair (or executable pair), and Ex is the set of all executables in the current analysis (illustrated in Figure 2). The theoretical size of the between-executable space are all 2-length combinations without replacement, given by where |·| describes the size of the underlying set. Note that the size of the BES is smaller than the Cartesian product since {a, b} = {b, a}. Figure 1 shows this pairing process in the Search Space aspect (2) from Figure 1. The Within-Executable Space (WES) is the product of IO pairs Figure 2 illustrates the WES and one IO pair from the WES that we also call link. The theoretical size of the within-executable space is For the sake of visualization, IO pairs are not shown in Figure 1 but are abstracted in their executable elements. The maximum theoretical search space is given that wes describes a construction function according to Equation (3), and BES i is the i'th candidate pair. In practice, SCD-PSM evaluates only a fraction of possible combinations because of the skip evaluation. The skip evaluation consists of two search space limiting factors: greedy evaluation and transitive similarity. Greedy evaluation stops the search through the WES once a similar pair is found. The initial detection process only confirms the similarity of a candidate pair. A post-analysis can then extract all possible IO similarities for potential actions. Transitive similarity skips evaluations in the BES, because of a ƒ b ƒ c then also a ƒ c holds. In conclusion, SCD-PSM compares IO pairs of executable model elements and uses skip evaluation to traverse the search space e ciently.

Static Similarity
The static similarity stage is a filter that accepts candidate pairs based on their data type, as shown in Figure 1. Data types in a PSM model are integers, floats, and text.
Input (3) of the stage are the IO pairs W ES ab = wes({a, b}) of a candidate. The filter criteria (4) accepts a candidate pair if at least one link (i.e., IO pair) has a matching data type, i.e., the input but also the output have a matching data type. Output (5) is a boolean decision whether the candidate pair is a clone or not from a static viewpoint. If positive, then the candidate pair is moved to the next pipeline stage, i.e., the Dynamic Similarity evaluation (see Figure 1). If negative, then the candidate pair is marked as being not a clone a " ƒ b and no further processing is conducted. For example, the IO pairs (n, f a) ƒ (n, f b) would be statically accepted as clones as both inputs and outputs have the same data type (integer). A counterexample is given by (n, f a) ƒ (guard, f d) where the input data types are integers and text.
The static similarity indicates that the analyzed program is given in a programming language that allows for static analysis. Programs written in programming languages without static typing can not make use of this filter stage. In conclusion, the static similarity stage filters candidates based on their data type.

Dynamic Similarity
The dynamic similarity stage is a filter that accepts candidate pairs based on the runtime data, as shown in Figure 1. Candidates pairs are accepted if at least one IO pair (6) has an insignificant diverging runtime distribution (7). This boolean decision is evaluated via a Kolmogorov-Smirnov test [30], and determines whether a pair is a clone from a dynamic viewpoint (8). For example, the IO pair (n, f a) ƒ (n, f d) with guard == true would be excluded form the filter given that runtime events with n = 0 reach a majority. In comparison, (n, f a) ƒ (n, f b) would be accepted by the stage.
A requirement is that the candidates use a synthetic trigger. Otherwise, the comparison of the data distributions may fail because of the di erent modus operandi of the program. For example, running fa and fb where n fa = U(0, 4) and n fb = U(5, 10) would cause the dynamic stage to fail even if the implementations are equivalent. Property-based [12] or random testing can be used to generate diverse synthetic inputs.
In conclusion, the dynamic similarity stage pre-filters candidates based on univariate tests on the input and output events.

Model Similarity
The model similarity stage is a filter that accepts candidate pairs based on the models, as shown in Figure 1. This stage conducts a multivariate test by sampling from the executable models and cross evaluating them. This test includes the evaluation of conditional influences caused by elements that are not actively participating in an IO pair. For example, (n, f d) ƒ (n, f a) holds but is conditionally dependent on guard. The model similarity can factor guard into its decision while the dynamic stage can only evaluate the average behavior of an IO pair.
Input (9) are the IO pairs of a candidate W ES ab = wes ({a, b}). The crosswise log-likelihood ratio of the models is computed by (conditional) generation and evaluation. Output is a boolean decision on whether the candidate pair is a clone or not, from a model viewpoint. Figure 1 illustrates the entire process of the model similarity. The roles between the null and alt models are then swapped, and the process is repeated. Both log-likelihood ratios are then combined by a pooling operator to produce the clone decision (14).
The role-swap is needed to avoid sub-model relationships. For example, if M null = N (0, 3) and M alt = N (0, 1) then LL alt will be very high because M alt is a sub-model from M null . Reversing the roles highlights the di erences in the models.
The final decision is based on the Generalized Likelihood Ratio Test (GLRT) [10]. It measures whether the log-likelihoods are significantly di erent from 0, where ⁄ is the test statistic. The null hypothesis is rejected for small ratios ⁄ AE c where c is set to an appropriate false-positive rate. For example, ⁄ < log(0.01) allows 1 out of 100 candidates to be a false-positive, i.e., wrongly rejecting semantic equivalence. The pooling operator combines the link results either via hard or soft pooling. Hard pooling conducts for both links a GLRT yielding a positive decision if both links are positive. Soft pooling averages the link loglikelihoods ratios and then computes the GLRT yielding a positive decision if the joint GLRT is positive. Hard pooling does not allow any sub-model relationships, while soft pooling relaxes this constraint.
In conclusion, the model similarity conducts a multivariate significance test between two models, including possible conditional dependencies.

Study
This study answers the following research questions. Q1 Does behavioral equality between model elements generalize to semantic equality of code elements? Q2 Does the skip evaluation significantly reduce the computational demand of SCD-PSM? Q3 Does the skip evaluation negatively impact the detection performance (i.e., precision, recall, and MCC)? Q1 answers whether semantic clones can be detected via SCD-PSM. Q2 answers whether the search space can be e ciently processed using skip evaluation. Q3 answers how the skip evaluation influences the performance of the detection process. This is important because candidate pairs might be skipped based on false-positives or false-negatives.

Setup
We implemented a prototype for SCD-PSM on top of Gradient [40], a prototype for PSM. The elements and data flow of the detection process are shown in Figures 1 and 2. 1. The input Source Code were 13 di erent clone classes with a total of 108 implementation variants. This includes classical algorithms implemented recursively and iteratively such as bubble sort, as well as hard problems from the programming competition Google Code Jam 1 . Candidates that passed the entire filter pipeline were marked as clones.

Dataset
The study uses three well-known algorithms and 10 Google Code Jam 2017 (GCJ) 1 problems. The total dataset contains 108 implementation variants across 13 clone classes described by Instance. Each clone class was di erentially tested to verify the behavior across instances. Factorial, Fibonacci, and Sort do not need any further explanation. The GCJ problems are well specified complex optimization problems packaged in an everyday theme.
The dataset contains in total 5778 (see Equation (2)) candidate pairs of which 458 are semantic clones and 5320 are not. This yields a positive to negative ratio of 1 : 11.6, indicating a highly imbalanced distribution. An even more pronounced imbalance is to be expected in real-world applications.
Each instance was triggered with input data to allow PSM to model the di erent implementations. Factorial, Fibonacci, and Sort were triggered by sampling from a uniform distribution U(0, 20). GCJ problems were triggered by the input data provided by the competition. Each instance received the same trigger.
GCJ problems read from and write to the standard stream, which is impractical in terms of reproducibility. Our dataset is constructed such that each implementation has a run-method representing the cloned executable. The study results are limited to the run-method even if the solutions use helper methods.
Helper methods may, for example, be methods that compute parts of the final solution, or reorganize the data. This guarantees a proper problem scope, a well-defined recall and precision, and a clearly defined benchmark for future reproducibility.

Controlled Variables
The study controls for the search space An additional fixed parameter is the number of particles. It defines the sample size that is generated during the model similarity |D| = 50.

Response Variables
The

Duration in seconds
Matthews Correlation Coe cient (MCC) measures the quality of the clone detection in the form of a correlation ranging from ≠1 to 1, with 0 being a random selection. The MCC will be the reference performance metric as it is the most robust metric in an imbalanced binary classification setting [3]. It is a correlation coe cient which may be interpreted by the guidelines proposed by Evans [9].

Comparison of Clone Detectors
In total, eight alternative approaches are used to contextualize the performance of SCD-PSM. The alternatives have a wide variety in terms of internal representation and clone detection capabilities as listed in Table 3. ASTNN (8) and ASTNN Leaky (9) are the same approach but have di erent evaluation methods. ASTNN Leaky (9) uses a random split of the dataset as reported by the authors [50]. It overestimates the performance of the approach via a lack of isolation between training and test dataset. For example, fa ƒ fb and fa ƒ fc might be in the train split while fb ƒ fc might be in the test split. ASTNN (8) uses a group-wise Cross Validation (CV), where clone classes are entirely isolated either into the training or test proportion of the dataset. This represents a real-world situation where first the detector is fitted and then applied to a new system with unknown code fragments. Detectors that report lines instead of methods may produce more results (TP, FP, TN, FN) than present in the dataset. A similar situation is given by ASTNN Leaky that runs multiple evaluations via the cross validation.

Experiment Results
Creating the PSM model with Gradient took 2134.38 s, resulting in an average modeling time of 19.75 s for the 195 executables. This includes 87 helper methods. Table 1 contains the aggregate results of the top-5 experiments along with the results of the worst experiment. The bottom line in Table 1 is the average   Table 2), the dynamic stage for 0.106 %, and the model stage for 0.893 %. Table 3 lists the detection results of eight alternative clone detectors. Simian, NiCad, and CCAligner found no clones in the dataset. PMD, SourcererCC, Oreo, and iClones found some clones (< 20) with a low recall (4 %). Each of these detectors has a very weak performance below an MCC of 0.20 ASTNN with the leaky evaluation has a very strong performance with an MCC of 0.976. ASTNN 3-Group CV has a strong performance with an MCC of 0.711. The longest computational duration is given by ASTNN with 1034 min. T e x t1 0.138 0 0 5320 458 0.000 2 NiCad [5] T e x t 3 1.291 0 0 5320 458 0.000 3 CCAligner [47] T Duration in seconds

Discussion
The goal of the study was to provide evidence of whether behavioral equality of model elements generalizes to semantic equality of code elements (Q1). Furthermore, we were interested in the skip evaluation and its performance implications (Q2 and Q3). Table 1 and Table 2 present strong results in favor of Q1. The MCC for the top-5 experiments was very strong with all MCCs being above 0.9. Even the worst experiment still yielded a moderate performance of 0.749. Table 3 provides additional context to the results by presenting the detection results of alternative clone detectors. As expected, tools relying heavily on the textual representation of clones have very low recall (Simian, NiCad, CCAligner, PMD) on the dataset. Most clones found by the alternative tools span only a few lines of code. In contrast, iClones finds large clones that include array accesses and manipulations. ASTNN is the best comparison tool and finds many clones with good precision. The approach is sensitive to hyper-parameters and to the training and test split, leading in some cases to a test performance close to MCC of 0. The low recall for Type 1-3 detectors indicate the high quality of the dataset. The moderate recall for Type 3/4 detectors indicate the high quality of SCD-PSM. Given this evidence, we conclude that Q1 holds. Q1 -Behavioral equality between model elements generalizes to semantic equality of code elements, allowing for semantic clone detection via probabilistic software modeling.

Research Question 2 -Skip Evaluation Scalability
The goal of the static and dynamic stage is to reduce the number of evaluations that the model stage must conduct. Each stage incurs an increasing cost of evaluation per candidate, with the model stage taking the largest share of the evaluation time, 89 %. Every TP has to pass the model stage to be declared a clone (rejecting pipeline). The skip evaluation avoided, on average, the recomputation of 74 % (340) of the TP candidate pairs. The greedy evaluation avoided, on average, the evaluation of 37 % of IO pairs. This o oads most of the evaluation time to the earlier stages, which are computationally inexpensive, while shortcutting the model stage. In comparison to the alternative detectors, SCD-PSM needs substantially more time to compute (1.32 min vs. 29 min). An exception is ASTNN which has a similar runtime as SCD-PSM. Most of the runtime of SCD-PSM is caused by the operational overhead, e.g., loading the model from the database. Optimizing this overhead, as a theoretical maximum, could reduce the overall runtime on the dataset to 6.49 min given the average durations for each stage in Table 2. In conclusion, the skip evaluation reduces the number of model evaluations, which are responsible for most of the evaluation time, down to a quarter.
Q2 -Skip evaluation reduces the number of evaluations for the most expensive stage (model) in the SCD-PSM pipeline significantly.

Research Question 3 -Skip Evaluation E ects
Skip evaluation can cause cascading errors given an FP. Once an FP is introduced, every semantic clone related to the FP has a chance to become an FP in the same (wrong) clone class itself. These cascading FPs are potential sources of serious performance degradation. Skip evaluation experiments are ranked higher and are significantly better than experiments that conducted an exhaustive search. However, the absolute performance gain is only a MCC of 0.056, hinting at a per-chance significance introduced by the small sample size (16 experiments). Nevertheless, given the evidence in Table 1 and Section 5.6, we can conclude that skip evaluation does not a ect the performance of the detector.
Q3 -The skip evaluation has no negative impact on the performance of the detector given low false-positive rates.

Limitations
SCD-PSM inherits the limitations of PSM, such as its need for a runnable program to build the model. PSM only models the application structure and its data, not references. References are changing addresses with no relation to the running program. Hence, they have no meaningful underlying distribution that can be modeled. However, once references are dereferenced, e.g., by accessing a field, their accessed data will be part of the model and therefore usable in SCD-PSM. Nevertheless, algorithms with the sole purpose of manipulating references do not work with SCD-PSM.
PSM explodes lists into singular values, since distributions do not contain any order information. This means executables that change the order of sequences are matched based on the values, not their order. As a consequence, an ascending and descending sorting algorithm are semantically equivalent, leading to a falsepositive. Extending PSM to distributions of sequences alleviates the issue but is not a trivial task.
SCD-PSM cannot detect Type 2-3 clones since textual similarities represent a di erent problem set. A proof can easily be constructed by adding an arbitrary number of statements that do not influence the behavior of the program but mislead text based detectors. Inversely, changing one character, e.g., a multiplication to a division, may alter the entire behavior while preserving the general textual similarity.
We employed a controlled laboratory evaluation strategy that allowed us to exactly evaluate the performance metrics and fairly compare them between di erent clone detectors. This follows a recent trend [38,46,48] in the light of some criticism of opportunistic evaluations on arbitrary open source projects. The controlled laboratory evaluation provides purely functional performance results given a fixed and controlled sample of programs. The generalizability of results obtained from laboratory evaluations is limited; Using an opportunistic evaluation strategy avoids this problem. However, the strategy is prone to biases caused by the human oracles (often the authors themselves) or proxy oracles that evaluate the clones. Moreover, a fair comparison between detectors is hardly possible because the true recall of clones is in general unknown. A combination of both evaluation strategies may yield precise and generalizable results. The extension to this study is part of our future work.

Threats to Validity
A threat to validity in any semantic clone detection study is given by the programs and code fragments used in the evaluation. Semantic clones may not exhibit the same functional behavior or share too many lexicographical similarities. This study tested every clone class on its behavioral equality. Furthermore, we evaluated text-, token-, graph-and model-based detectors capable of detecting Type 1-3 clones. The low performance of Type 1-3 detectors confirmed the high quality of semantic clones in the benchmark.

Related Work
We started this article by defining what semantic clones means in the context of our approach (Section 3). While our definition is motivated by the capabilities of our approach, we can see strong similarities to the definition of Juergens [19]. Both definitions define behavioral similarity via IO relationships. Also, Juergens already discussed a notion of partial and conditional similarity. This understanding of Type 4 clones can be seen in multiple more recent studies [8,6,27]. In that, we see the progress of the community in terms of Type 4 clones as the definition becomes more specific.
Many studies evaluated textual clones. However, only a few studies have reported results on semantic clones without relaxing the definition of Type 4. Rattan [34] et al. provided a review of clone detection studies including approaches focused on Type 4 clones. They concluded that some approaches solve approximations (i.e., complex Type 3 clones) of Type 4 clones.
Test-based methods randomly trigger the execution of candidates and measure whether equal inputs cause similar outputs. Jiang and Su [18] were able to find semantically equivalent methods without any syntactic similarities. A similar approach was presented by Deissenboeck et al. [6]. One issue with test-based clone detection is that candidates need a similar signature. Di erences in data types or the number of parameters can not be e ectively handled. SCD-PSM works similarly to test-based methods in that it observes the runtime and compares the resulting behavior. However, SCD-PSM builds generative models from the observed behavior, capable of generating, conditioning, and evaluating data. This allows SCD-PSM to bridge signature mismatches by imputing missing code elements and the using a generalized type system. Zhao and Huang [51] developed DeepSim, which phrases the problem as a binary classification task. DeepSim uses neural networks to learn encodings of the control and data flow without observing the program's runtime. PSM also uses neural networks but learns an underlying representation of the data flow and runtime. DeepSim was also evaluated on a Google Code Jam dataset. It reached an F1 score of 0.76 on the GCJ 2016 competition, while SCD-PSM reached 0.967 on the GCJ 2017. While not entirely comparable, the results are a good approximation given the similarity in the datasets.

Conclusions and Future Work
In this article, we presented Semantic Clone Detection via Probabilistic Software Modeling (SCD-PSM). PSM builds a Probabilistic Model (PM) from a program that can be used to simulate or evaluate a program. We used these PMs to detect semantic clones in programs that have 0 % syntactic similarity.
We discussed the representation, search space, static-, dynamic-, and modelsimilarity stages forming the main aspects of SCD-PSM. The study evaluated SCD-PSM in great detail resulting in an average MCC greater than 0.9. Also, the study showed the capability to control the false-positive rate, which is important for an industry adoption. Finally, we concluded that behavioral equality of model elements generalizes to semantic equality of code elements.
Our future work focuses on constructing a comprehensive benchmark covering controlled and real-world systems for improved generalizability of clone detection studies. Furthermore, semantic clone detection has the potential to enable new methods for fault localization applications [42].
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.