Selecting fault revealing mutants

Titcheu Chekam, Thierry; Papadakis, Mike; Bissyandé, Tegawendé F.; Le Traon, Yves; Sen, Koushik

doi:10.1007/s10664-019-09778-7

Selecting fault revealing mutants

Open access
Published: 18 December 2019

Volume 25, pages 434–487, (2020)
Cite this article

Download PDF

You have full access to this open access article

Empirical Software Engineering Aims and scope Submit manuscript

Selecting fault revealing mutants

Download PDF

Thierry Titcheu Chekam ORCID: orcid.org/0000-0002-5295-1831¹,
Mike Papadakis¹,
Tegawendé F. Bissyandé¹,
Yves Le Traon¹ &
…
Koushik Sen²

4321 Accesses
43 Citations
6 Altmetric
Explore all metrics

Abstract

Mutant selection refers to the problem of choosing, among a large number of mutants, the (few) ones that should be used by the testers. In view of this, we investigate the problem of selecting the fault revealing mutants, i.e., the mutants that are killable and lead to test cases that uncover unknown program faults. We formulate two variants of this problem: the fault revealing mutant selection and the fault revealing mutant prioritization. We argue and show that these problems can be tackled through a set of ‘static’ program features and propose a machine learning approach, named FaRM, that learns to select and rank killable and fault revealing mutants. Experimental results involving 1,692 real faults show the practical benefits of our approach in both examined problems. Our results show that FaRM achieves a good trade-off between application cost and effectiveness (measured in terms of faults revealed). We also show that FaRM outperforms all the existing mutant selection methods, i.e., the random mutant sampling, the selective mutation and defect prediction (mutating the code areas pointed by defect prediction). In particular, our results show that with respect to mutant selection, our approach reveals 23% to 34% more faults than any of the baseline methods, while, with respect to mutant prioritization, it achieves higher average percentage of revealed faults with a median difference between 4% and 9% (from the random mutant orderings).

A New Mutant Generation Algorithm Based on Basic Path Coverage for Mutant Reduction

Testing and debugging: an empirical evaluation of integrated approaches

Article 06 June 2020

FTScMES: A New Mutation Execution Strategy Based on Failed Tests’ Mutation Score for Fault Localization

1 Introduction

Mutation testing has been shown to be one of the most effective techniques with respect to fault revelation (Titcheu Chekam et al. 2017). Researchers typically use mutation as an assessment mechanism (measuring effectiveness) for their techniques (Papadakis et al. 2018a), but it can be used as every other test criterion. To this end, mutation can be used to assess the effectiveness of test suites or to guide test generation (Ammann and Offutt 2008; Fraser and Zeller 2012; Petrovic and Ivankovic 2018; Papadakis et al. 2018b; Titcheu Chekam et al. 2017).

Unfortunately, mutation testing is expensive. This is due to the large number of mutants that require analysis. An important cost parameter is the so-called equivalent mutants, which are mutants forming equivalent program versions (Papadakis et al. 2015; Ammann and Offutt 2008). These need to be manually inspected by testers since their automatic identification is not always possible (Budd and Angluin 1982).

While the problem of the equivalent mutants have been partly addressed by recent methods such as the Trivial Compiler Equivalence (TCE) (Papadakis et al. 2015), the problem of the large number of mutants remains challenging. Yet, addressing this problem will in return contribute to addressing the equivalent mutant problem: any approach that is effective in reducing the large number of mutants, would indirectly reduce the equivalent mutant problem since less equivalent mutants will be available.

Nevertheless, producing a large number of mutants is impractical. The mutants need to be analyzed, compiled, executed and killed by test cases. Perhaps, more importantly testers need to manually analyse them in order to design effective test cases. The scalability, or lack thereof, of mutation testing, with respect to the number of mutants to be processed, is thus a key factor that hinders its wide applicability and large adoption (Papadakis et al. 2018a). Consequently, if we can find a lightweight and reasonably effective way to diminish the number of mutants without sacrificing the power of the method, we would then manage to significantly improve the scalability of the method. Since the early days of mutation testing, researchers attempted to find such solutions by forming many mutant reduction strategies (Papadakis et al. 2018a), such as selective mutation (Offutt et al. 1993; Wong and Mathur 1995a) and random mutant selection (T Acree et al. 1979).

Our goal is to form a mutant selection technique that identifies killable mutants that are fault revealing, prior to any mutant execution. We consider as fault revealing, any mutant (i.e. test objective) that leads to test cases capable of revealing the faults in the program under test. We argue that such mutants are program specific and can be identified by a set of static program features. In this respect, we need features that are simultaneously generic, in order to be widely applicable, and powerful to approximate well the program and mutant semantics.

We advance in this research direction by proposing a machine learning-based approach, named FaRM , which learns on code and mutants’ properties, such as mutant type and mutation location in program control-flow graphs, as well as code complexity and program control and data dependencies, to (statically) classify mutants as likely killable/equivalent and likely fault revealing. This approach is inspired by the prediction modelling line of research, which has recorded high performance by using machine learning to triage likely error-prone characteristics of code (Menzies et al. 2007; Kamei and Shihab 2016).

The use case scenario of FaRM is a standard testing scenario where mutants are used as test objectives, guiding test generation. To achieve this, we train on a set of faulty programs that have been tested with mutation testing, prior to any testing or test case design for the particular system under analysis. Then, we predict the killable and fault revealing mutants based on which we test the particular system under analysis. The training corpus can include previously developed projects (related to the targeted application domain) or previous releases of the tested software. In a sense, we train on system(s), say x, and select mutants on the system under test, say y, where x≠y.

Experimental results using 10-Fold cross validation on 1,692 + 45 faulty program versions show a high performance of FaRM in yielding an adequately selected set of mutants. In particular our method achieves statistically significantly better results than the random, selective mutation and defect prediction (mutating the areas predicted by defect prediction), mutant selection baselines by revealing 23% to 34% more faults than any of the baselines. Similarly, our mutant prioritization method achieves statistically significant higher Average Percentage of Faults Detected (APFD) (Henard et al. 2016) values than the random prioritisation (4% to 9% higher in the median case). With respect to test execution, we show that our selection method requires less execution time (than random).

We also demonstrate that our method is capable of selecting killable (non-equivalent) mutants. In particular, by building an equivalent classification method, using our features, we achieve an AUC value of 0.88 and 95%, 35% precision and Recall. These results indicate drastic reductions on the efforts required by the analysis of equivalent mutants. A combined approach, named FaRM* , achieves similar to FaRM fault revelation, but potentially at a lower cost (lower number of equivalent mutants), indicating the capabilities of our method.

In summary, our paper makes the following contributions:

It introduces the fault revealing mutant selection and fault revealing mutant prioritization problems.
It demonstrates that the killability and fault revealing utility of mutants can be captured by simple static source code metrics.
It presents FaRM , a mutant selection technique that learns to select and rank mutants using standard machine learning techniques and source code metrics.
It provides empirical evidence suggesting that FaRM outperforms the current state-of-the-art mutant selection and mutant prioritization methods by revealing 23% to 34% more faults and achieving 4% to 9% higher average percentage of revealed faults, respectively.
It provides a publicly available dataset of feature metrics, kill and fault revelation matrices that can support reproducibility, replication and future research.

The paper is organized as follows. Section 2 provides background information on mutation testing, the mutant selection problem and defines the targeted problem(s). Section 3 overviews the proposed approach. Evaluation research questions are enumerated in Section 4, while experimental setup is described in Section 5 and experimental results are presented in Section 6. A detailed discussion on the applicability of our approach and the threats to validity are given in Section 7, and related work is discussed in Section 8. Section 9 concludes this work.

2 Context

2.1 Mutation Testing

Mutation testing (DeMillo et al. 1978) is a test adequacy criterion that sets the revelation of artificial defects, called mutants, as the requirements of testing. As every test criterion, mutation assists the testing process by defining test requirement that should be fulfilled by the designed test cases, i.e., defining when to stop testing.

Software testing research has shown that designing tests that are capable of revealing mutant-faults results in strong test suites that in turn reveal real faults (Frankl et al. 1997; Li et al. 2009; Titcheu Chekam et al. 2017; Papadakis et al. 2018a; Just et al. 2014b) and are capable of subsuming or almost subsuming all other structural testing criteria (Offutt et al. 1996b; Frankl et al. 1997; Ammann and Offutt 2008).

Mutants form artificially-generated defects that are introduced by making changes to the program syntax. The changes are introduced based on specific syntactic transformation rules, called mutation operators. The syntactically changed program versions form the mutant-faults and pose the requirement of distinguishing their observable behaviour from that of the original program. A mutant is said to be killed, if its execution distinguishes it from the original program. In the opposite case it is said to be alive.

Mutation quantifies test thoroughness, or test adequacy (DeMillo et al. 1978, 1991; Frankl and Iakounenko 1998), by measuring the number of mutants killed by the candidate test suites. In particular, given a set of mutants, the ratio of those that are killed by a test suite is called mutation score. Although all mutants differ syntactically from the original program, they do not always differ semantically. This means that there are some mutants that are semantically equivalent to the original program, while being syntactically different (Offutt and Craft 1994; Papadakis et al. 2015). These mutants are called equivalent mutants (DeMillo et al. 1978; Offutt and Craft 1994) and have to be removed from the test requirement set.

Mutation score denotes the degree of achievement of the mutation testing requirements (Ammann and Offutt 2008). Intuitively, the score measures the confidence on the test suites (in the sense that mutation score reflects the fault revelation ability). Unfortunately, previous research has shown that the relation between killed mutants and fault revelation is not linear (Frankl et al. 1997; Titcheu Chekam et al. 2017) as fault revelation improves significantly only when test suites reach high mutation score levels.

2.2 Problem Definition

Our goal is to select among the many mutants the (few) ones that are fault revealing, i.e., mutants that lead to test cases that reveal existing, but unknown, faults. This is a challenging goal since only 2% (according to our data) of the killable mutants are fault revealing.

The fault revealing mutant selection goal is different from that of the “traditional” mutant reduction techniques, which is to reduce the number of mutants (Offutt et al. 1996a; Wong and Mathur 1995b; Ferrari et al. 2018; Papadakis et al. 2018a). Mutant reduction strategies focus on selecting a small set of mutants that is representative of the larger set. This means, that every test suite that kills the mutants of the smaller set, also kills the mutants of the large set. Figure 1 illustrates our goal and contrasts it with the “traditional” mutant reduction problem. The blue (and smallest) rectangle on the figure represents the targeted output for the fault revealing mutant selection problem.

Previous research (Papadakis et al. 2018b, c) has shown that the majority of the mutants, even in the best case, are “irrelevant” to the sought faults. This means that testers need to analyse a large number of mutants before they can find the actually useful ones (the fault revealing ones), wasting time and effort. According to our data, 17% of the minimal mutants (ideal mutant reduction), i.e., subsuming mutants (a set of mutants with minimal overlap that are sufficient for preserving test effectiveness Jia and Harman 2009; Kintis et al. 2010; Ammann et al. 2014) is fault revealing. This also indicates that the majority of the mutants, even in the best case, are “irrelevant” to the sought faults. We therefore claim that mutation testing should be performed only with the mutants that are most likely to be fault revealing. This will make possible the best effort application of the method.

Formally, we consider two aspects of this selection problem: the mutant selection one and the mutant prioritization one.

The fault revealing mutant selection problem is defined as:

Given::: A set of mutants M for program P.
Problem::: Subset selection. Select a subset of mutants, S ∈ M, such that F(S) = F(M) and (∀m ∈ S), (F(S −{m})≠F(M)).

S represents a subset of M; F(X) represents the number of faults in P that are revealed by the test suites that kill all the mutants of the set X. In practice, the challenge is to approximate well S, statically and prior to any test execution, by finding a relatively good trade-off between the number of selected mutants (to minimise) and the number of faults revealed by their killing (to maximize).

Similarly, the fault revealing mutant prioritization problem is defined as:

Given::: A set of mutants, M and the set of permutations of M, PM for program P.
Problem::: Find \(Pm^{\prime } \in PM\) such that \((\forall Pm^{\prime \prime }) (Pm^{\prime \prime } \in PM)\)\((Pm^{\prime \prime } \ne Pm^{\prime })\)\([f(Pm^{\prime }) \geq f(Pm^{\prime \prime }))]\)

PM represents the set of all possible mutant orderings of M, and f(X) represents the average percentage of faults revealed by the test cases that kill the selected mutants in the given order X (measures the area under the curve representing the faults revealed by the killing of each one of the mutants in the order). The challenge is to statically and prior to any test execution, rank the mutants so that the fault revealing potential is maximized when killing any (arbitrary) number of them. The idea is that fault revelation is maximized whenever the tester decides to stop killing mutants.

2.3 Mutant Selection

In the literature many mutant selection methods have been proposed (Papadakis et al. 2018a; Ferrari et al. 2018) by restricting the considered mutants according to their types, i.e., applying one or more mutant operators. Empirical studies (Kurtz et al. 2016; Deng et al. 2013), have shown that the most successful strategies are the statement deletion (Deng et al. 2013) and the E-Selective mutant set (Offutt et al.1993, 1996a). We therefore compare our approach with these methods. We also consider the random mutant selection (T Acree et al. 1979) since there is evidence demonstrating that it is particularly effective (Zhang et al. 2010; Papadakis and Malevris 2010b).

2.3.1 Random Mutant Selection

Random mutant sampling (T Acree et al. 1979) forms the simplest mutant selection technique, which can be considered as a natural baseline method. Interestingly, previous studies found it particularly effective (Zhang et al. 2010; Papadakis and Malevris 2010b). Therefore, we compare with it.

We use two random selection techniques, named as SpreadRandom and DummyRandom. SpreadRandom iteratively goes through all program statements (in random order) and selects mutants (one mutant among the mutants of each statement), while DummyRandom selects them from the set of all possible mutants. The first approach is expected to select mutants residing on most of the program statements, while the second one is expected to make a uniform selection.

2.3.2 Statement Deletion Mutant Selection

Mutant selection based on statement deletion is a simple approach that, as the name suggests, deletes every program statement (once at a time). To avoid introducing compilation issues (mutants that do not compile) and introduce relatively strong mutants, the statement deletion is usually applied on parts of a statement (deleting parts of expressions, i.e., the expression a + b becomes a or b ). Empirical studies have shown that statement deletion mutant selection is powerful (achieves a very good trade-off between the number of selected mutants and test effectiveness) and has the advantage of introducing few equivalent mutants (Deng et al. 2013).

2.3.3 E-selective Mutant Selection

E-Selective refers to the 5 operator mutant set introduced by Offutt et al. (1993, 1996a). This set is the most popular operator set (Papadakis et al. 2018a) that is included in most of the modern mutation testing tools. This set includes the mutants related to relational, logical (including conditional), arithmetic, unary and absolute mutations. According to the study of Offutt et al. (1996a) this set has the same strengths as a much larger comprehensive set of operators. Although there is empirical evidence demonstrating that the E-Selective set has lower strengths than a more comprehensive set of operators (Kurtz et al. 2016), it still provides a very good trade-off beetween selected mutants and strengths (Kurtz et al. 2016).

2.4 Mutant Prioritization

Mutant prioritization has received little or even no attention in literature (refer to the Related Work Section 8 for details). Given the absence of other methods, we compare our approach with the random baselines. We also consider alternative schemes, such as Defect Prediction prioritization.

2.4.1 Random Mutant Prioritization

Random mutant prioritization forms a natural baseline for our approach. Comparing with random orderings is a common practice in test case prioritization studies (Rothermel et al. 2001; Henard et al. 2016) and shows the ability of the prioritization method to systematically order the sought elements. Similarly to mutants selection, we applied two random ordering techniques, the SpreadRandom and DummyRandom. SpreadRandom orders mutants by iteratively going through all program statements (in random order) and selects one mutant among the mutants of each statement (statement-based orders), while DummyRandom orders them from the mutant set (uniform orders).

2.4.2 Defect Prediction Mutant Prioritization

Naturally, one of the main attributes determining the utility of the mutants is their location. Thus, instead of selecting mutants based on other properties, one could select them based on their location. To this end, we form a prioritization method that predicts and orders the error-prone code locations, i.e., code parts that are most likely to be faulty. Then, we mutate the predicted code areas and form a baseline method. Such an approach is in sense equivalent to the application of mutation testing on the results of defect prediction. Moreover, such a comparison demonstrates that mutants depend on the attributes (features) we train on not solely on their location.

3 Approach

Our objective is to select mutants that lead to effective test cases. In view of this, we aim at selecting and prioritizing mutants so that we reveal most of the faults by analysing the smallest possible number of mutants.

We conjecture that mutant selection strategies should account for the properties that make them killable and fault revealing. Defect prediction studies (Menzies et al. 2007; Kamei and Shihab 2016) investigated properties related to error-prone code locations, but not related to mutants. Mutation testing is a behaviour oriented criterion and requires mutants introducing small and useful semantic deviations. Therefore, we propose building a model, which captures the essential properties that make mutants valuable (in terms of their utility to reveal faults).

Figure 2 depicts the FaRM approach, which learns to rank mutants according to their fault revealing potential (likelihood to reveal (unknown) faults). Initially, FaRM applies supervised learning on the mutants generated from a corpus of faulty program versions, and builds a prediction model. This model is then used to predict the mutants that should be used to test the particular instance of the program under test. This means that at the time of testing and prior to any mutant execution, testers can use and focus only on the most important mutants.

Regarding FaRM supervised learning training phase (when the prediction model is built), the faulty programs mutants’s features are extracted and used as training data’s features and, their utilities are computed and used as training data’s expected output. The mutant’s utility for fault revealing and killable mutant prediction is respectively the mutants’ fault revealing and killability information. Regarding the validation phase, features of the mutants of the program under test are extracted and used as validation data’s features to predict the mutants’ utilities with the trained model. Mutants with high predicted utility are the useful ones.

Definition 1

For a given problem, we define as classifier’s performance the prediction performance of the classifier, which is the accuracy of the predictions (precision, recall, F-measure and Area Under Curve metrics that are detailed in Section 5.3) of the classifier for the given problem.

ML-based measurement of mutant utility.

The selection process in FaRM is based on training a predictor for assessing the probability of a mutant to reveal faults. To that end, we explore the capability of several features, which are designed to reflect specific code properties which may discriminate a useful mutant from another. Let us consider a mutant M associated to a code statement S_M on which the mutation was applied. This mutant can be characterized from various perspectives with respect to (1) the complexity of the relevant mutated statement, (2) the position of the mutated code in the control-flow graph, (3) dependencies with other mutants, (4) the nature of the code block where S_M is located.

ML features for characterizing mutants.

Recently, the studies of Wen et al. (2018), Just et al. (2017), and Petrovic and Ivankovic (2018) found a strong connection between mutants’ utility and the surrounding code (captured by the AST father and child nodes). Therefore, in addition to the mutant types, typically considered by selective mutation approaches (Offutt et al. 1996a; Namin et al. 2008; Papadakis et al. 2018a), we also considered the information encoded on the program AST. We include three such features, the Data type at the mutant location, the parent AST node (of the mutant expression) and the child AST node (of the mutant expression), in our machine learning classification scheme.

Let B_M be the control-flow graph (CFG) basic block associated to a mutated statement S_M containing the mutated expression E_M. Table 1 provides the list of all 28 features that we extract from each mutant. The features named TypeAstParent, TypeMutant, TypeStmtBB, AstParentMutantType, OutDataDepMutantType, InDataDepMutantType, OutCtrlDepMutantType, InCtrlDepMutantType, DataTypesOfOperands and DataTypesOfValue are categorical. We represented them using one hot encoding. Besides the categorical features listed above, all other features are numerical. The values of numerical features are normalized between 0 and 1 using feature scaling, more precisely min-max normalization/scaling.

Table 1 Description of the static code features

Full size table

A demonstrating example on how mutant features are computed is given in the following subsection (Section 3.2). After extracting feature values, we feed them to a machine learning classification algorithm along with the killable and fault revealing information for each mutant for a set of faults. The training process then produces two classifiers (one for the equivalent and one for the fault revealing mutants) which, given the feature values of a given mutant, they are capable of computing the utility probabilities for this mutant, i.e., its probability to be killable and its probability to be fault revealing.

By using these two classifiers we form three approaches, two of them using each one of the classifiers alone and one of them by combining them. The first two, named FaRM and PredKillable , respectively classify mutants according to their probability to be fault revealing and killable. The third one, named FaRM* , divides the mutant set in two subsets, likely killable and likely equivalent (based on PredKillable predictions), separately ranks them according to their fault revealing probability and concatenates them by putting the likely killable subset first. Figure 3 show an example of mutant ranking by FaRM*. The motivation for FaRM* results from the hypothesis that equivalent mutants could be noise to FaRM and, PredKillable performs better at filtering equivalent mutants (or predicting killable mutants). Given that fault revealing mutants are killable, we expect them to have higher predicted utility value with both FaRM and PredKillable. Therefore, FaRM* gives priority to the most likely fault revealing mutants that are also most likely killable.

We implement a prioritization scheme by considering the ranking of all mutants in accordance to the values of the developed probability measure. This forms our mutant prioritization approaches. Our mutant selection strategy sets a threshold probability value (e.g., 0.5) or a cut-off point according to the number of the top ranked mutants to keep only mutants with higher utility probability scores in the selected set. This forms our mutant selection approach. For the combined approach (FaRM* ) we divide the mutant set in the killable and equivalent subsets by using a cut-off point of 0.5.

3.1 Implementation

We implemented FaRM as a collection of tools in C++. We leverage stochastic gradient boosting (Friedman 2002) (decision trees) to perform supervised learning. Gradient boosting is a powerful ensemble learning technique which combines several trained weak models to perform classification. Unlike common ensemble techniques, such as random forests (Breiman 2001), that simply average models in the ensemble, boosting methods follow a constructive strategy of ensemble formation where models are added to the ensemble sequentially. At each particular iteration, a new weak, base-learner model is trained with respect to the error of the whole ensemble learnt so far (Natekin and Knoll 2013). We use the FastBDT (Keck T 2016) implementation by setting the number of trees to 1,000 and the trees depth to 5.

3.2 Demonstrating Example

Here we provide an example on how the features of Table 1 are computed. We consider the program in Fig. 4 (extracted from the Codeflaws benchmark, ID: 598-B-bug-17392756-17392766), on which mutation is applied. We present the feature extraction for a mutant M, which is created by replacing the right side decrement operator by the right side increment operator on line 16 (m −− becomes m + +). We also present in Fig. 5a the mutant, the abstract syntax tree (AST) of the mutated statement (while condition) in Fig. 5b and c the control flow graph (CFG) of the function containing the mutated statement.

The features, for mutant M, are computed as following:

The complexity feature value is the number of mutants generated on the statement containing the mutant M (Line 16). In this case 72 mutants. Thus, the complexity is 72.
The CfgDepth feature value is the minimum number of basic blocks to follow, along the CFG, from main function’s entry point to the basic block containing M (BB2). In this case 1 basic block as shown in Fig. 5c. Thus, the CfgDepth is 1.
The CfgPredNum feature value is the number of basic blocks directly preceding the basic block containing M (BB2) on the control flow graph. In Fig. 5c there are 2 basic blocks (BB1 and BB3). Thus, the CfgPredNum is 2.
The CfgSuccNum feature value is the number of basic blocks directly following the basic block containing M (BB2) on the control flow graph. In Fig. 5c there are 2 basic blocks (BB3 and BB4). Thus, the CfgSuccNum is 2.
The AstNumParents feature value is the number of AST parents of the mutated expression. In this case, the only AST parent is the relational expression, in Fig. 5b, whose sub-tree is rooted on the greater than sign (>). Thus the feature value is 1.
The NumOutDataDeps feature value is the number of mutants on expressions data dependent on the mutated expression. In this case, looking at Fig. 4, the value of variable m written in the mutated expression m −− is only used in the same expression. Thus the feature value is the number of mutants on the mutated expression m −−.
The NumInDataDeps feature value is the number of mutants on expressions on which the mutated expression is data dependent. In this case, looking at Fig. 4, the value of variable m used on the mutated expression m −− is either written on the scanf statement at line 15, or in the same expression. Thus the feature value is the sum of the number of mutants on the statement at line 15 and the number of mutants on the mutated expression m −−.
The NumOutCtrlDeps feature value is the number of mutants on statements control dependent on the mutated expression. In this case, looking at Fig. 4, no statement is control dependent on the mutated expression m −−. Thus the feature value is 0.
The NumInCtrlDeps feature value is the number of mutants on expressions on which the mutated statement is control dependent. In this case, looking at Fig. 4, no expression controls the mutated expression. Thus the feature value is 0.
The NumTieDeps feature value is the number of mutants on the right decrement expression (mutated expression).
The AstParentsNumOutDataDeps feature value is the number of mutants on expressions data dependent on the AST parent of the mutated expression. In this case, looking at Figs. 4 and 5b, the value of the relational expression (AST parent of m −−) is not used in other expressions. Thus the feature value is 0.
The AstParentsNumInDataDeps feature value is the number of mutants on expressions on which the AST parent of the mutated expression is data dependent. In this case, looking at Figs. 4 and 5b, the value of the relational expression (AST parent of m −−) only depends on the value of expression m −−. Thus the feature value is the number of mutants on expression m −−.
The AstParentsNumOutCtrlDeps feature value is the number of mutants on statements control dependent on the AST parent of the mutated expression. In this case, looking at Figs. 4 and 5b, all the statements in basic block BB3 are control dependent on the relational expression (AST parent of m −−). Thus the feature value is the sum of the number of mutants in lines 17, 18 and 19 of the code in Fig. 4.
The AstParentsNumInCtrlDeps feature value is the number of mutants on expressions on which the AST parent of the mutated expression is control dependent. In this case, looking at Figs. 4 and 5b, no expression controls the relational expression (AST parent of the mutated expression m −−). Thus the feature value is 0.
The AstParentsNumTieDeps feature value is the number of mutants on the relational expression, AST parent of the mutated right decrement expression. The feature value here is the number of mutants of the relational expression of operator greater than.
The TypeAstParents feature value is AST type of the AST parent expression of the mutated expression. Here, that is the AST type of the relational expression with operator greater than.
The TypeMutant feature value is the type of the mutant as a string representing the matched and replaced pattern. The feature value is “\(()-- \rightarrow ()++\)”.
The TypeStmtBB feature value is the type of the basic block containing the mutated statement. The feature value here is the type of BB2 (see Fig. 5c), which is ”While Condition”.
The AstParentMutantType feature value is the aggregation of types of the mutants on the AST parents of the mutated expression. That is the aggregation of the mutants types of the relational expression whose sub-tree is rooted on the greater than sign (>) as shown in Fig. 5b. The aggregation of a set of mutant types is performed by summing up the one encoding vectors of the mutants types, allowing each mutant type to be represented in the encoding.
The OutDataDepMutantType feature value is the aggregation (as computed for AstParentMutantType) of the mutant types of the mutants counted to compute NumOutDataDeps.
The InDataDepMutantType feature value is the aggregation (as computed for AstParentMutantType) of the mutant types of the mutants counted to compute NumInDataDeps.
The OutCtrlDepMutantType feature value is the aggregation (as computed for AstParentMutantType) of the mutant types of the mutants counted to compute NumOutCtrlDeps.
The InCtrlDepMutantType feature value is the aggregation (as computed for AstParentMutantType) of the mutant types of the mutants counted to compute NumInCtrlDeps.
The AstChildHasIdentifier feature value is the Boolean value representing whether the mutated expression has an identifier as operand. In this case, the mutated expression has the identifier m as operand. Thus, the value of the feature is 1 (True).
The AstChildHasLiteral feature value is the Boolean value representing whether the mutated expression has a literal as operand. In this case, the mutated expression does not have the literal as operand. Thus, the value of the feature is 0 (False).
The AstChildHasOperator feature value is the Boolean value representing whether the mutated expression has an operator. In this case, the mutated expression has the operator right decrement operator −−. Thus, the value of the feature is 1 (True).
The DataTypesOfOperands feature value is the datatype of the operand of the right decrement operation −−. That is the datatype of m which is “int”.
The DataTypeOfValue feature value is the datatype of the value of the mutated expression, Which is “int” as the data type of m.

4 Research Questions

When building prediction methods, the first thing to investigate is their prediction ability. Thus, our first question can be stated as:

RQ1::: How well does our machine learning method predict the killable mutants?

Similarly, our second question can be stated as:

RQ2::: How well does our machine learning method predict the fault revealing mutants?

After demonstrating that our classification method predicts satisfactorily the fault revealing mutants, we continue by investigating its ability to practically support mutant selection with respect to the actual measure of interest, the revealed faults, and with respect to the random baseline techniques. Therefore, we investigate:

RQ3::: How do our methods compare against the random strategies with respect to the fault revealing mutant selection problem?

In addition to the random strategies, we also compare with the current state-of-the-art mutant selection methods. Thus, we ask:

RQ4::: How do our methods compare against the E-Selection and SDL with respect to the fault revealing mutant selection problem?

As we already discussed an alternative mutant cost reduction technique is mutant prioritization. Hence, we ask:

RQ5::: How do our methods compare against the random strategies with respect to the fault revealing mutant prioritization problem?

In addition to the random strategies, we also compare with the defect prediction mutant prioritization baseline. Therefore, we ask:

RQ6::: How do our methods compare against the defect prediction mutant prioritization method?

Finally, by demonstrating the benefits of our approach, we turn to investigate the generalization ability of our approach on larger and complex programs. Therefore we conclude by asking:

RQ7::: How well do our method generalise its findings on independently selected programs that are much larger and complex?

5 Experimental Setup

5.1 Benchmarks: Programs and Fault(s)

For the purposes of our study we need a large number of programs that are not trivial and are accompanied with real faults. The fault set has to be large and of diverse types. Unfortunately, mutation testing is costly and its experimentation requires generating strong test suites (Titcheu Chekam et al. 2017). Therefore, there are two necessary tradeoffs, between the number of faults to be considered, the strengths of the test suites to be used and the size of the used programs.

To account for these requirements, we used the Codeflaws benchmark (Tan et al. 2017). This benchmark consists of 7,436 programs (among wich 3,902 are faulty) selected from the Codeforces^{Footnote 1} online database of programming contests. These contests consist of three to five problems, of varied difficulty levels. Every user submits its programs that resolve the posed problems. In total, the benchmark involves programs from 1,653 users “with diverse level of expertise” (Tan et al. 2017).

Every fault in this benchmark has two program instances: the rejected ‘faulty’ submission and the accepted ‘correct’ submission. Overall, the benchmark contains 3,902 faulty program versions of 40 different defect classes. It is noted that every faulty program instance in our dataset is unique, meaning that every program we use is different from the others (in terms of implementation). To the best of our knowledge, this is the largest number of faults used in any of the mutation testing studies. The size of the programs varies from 1 to 322 with an average of 36 lines of code. Applying mutation testing on Codeflaws yielded 3,213,543 mutants and required a total of 8,009 CPU days for all computations.

To strengthen our results and demonstrate the ability of our approach to handle faults made by actual developers, we also used the CoREBench (Böhme and Roychoudhury 2014) benchmark. CoREBench includes real-world complex faults that have been systematically isolated from the history of C open source projects. These programs are of 9-83 KLoC and are accompanied by developer test suites. It is noted that every CoREBench fault forms a single fault instance (it differs from the other faults).

We used the available test suites augmented by KLEE (Cadar et al. 2008). Although these test suites greatly increased the cost of our experiment, we considered their use of vital importance as otherwise our results could be subject to “noise effects” (Titcheu Chekam et al. 2017).

Due to the very high cost of the experiments and technical difficulties to reproduce some faults, we conducted our analysis on 45 faults (22 in Coreutils, 12 in Find and 11 in Grep). Applying mutation testing on these 45 versions yielded 1,564,614 mutants and required a total of 454 CPU days of computation (without considering the test generation and machine learning computations and evaluations). Test generation resulted in a test pool composed of 122,261 and 22,477 test cases related to Codeflaws, CoREBench.

The goal of our study is to evaluate the fault revealing ability of the mutants we select. However, approximately half of our faults are trivial ones (triggered by most of the test cases), and their inclusion in our analysis would artificially inflate our results. Thus, we restrict our analysis on the faults that are revealed by less than 25% of the test cases involved in our test suites. Taking such a threshold is usual in fault injection studies (SiR 2018), but it ensures that the targeted faults and our focus is on faults that are hard enough to find. Practically, taking a lower threshold will significantly reduce the number of faults to be considered hindering our ability to train, while taking a higher threshold will make all the approaches perform similarly, as the faults will be easy to reveal. Overall, from the Codeflaws benchmark we consider 1,692 out of the 3,902 ones (1,692 are the non-trivial faults) and from the CoreBench benchmark 45 faults.

Figure 6 shows the distribution of number of problems by number of implementations for the considered faulty programs from Codeflaws. We observe that 85% of the problems have at most 3 implementations.

Despite that Codeflaws benchmark faults were mined from a programming contest, the faults nevertheless are relatively small syntactical mistakes. We observe on Fig. 7 that 82% of the faults are fixed by modifying a single line of source code. This ensures that we are compatible with the competent programmer hypothesis^{Footnote 2}, which is one of the basic assumptions of mutation testing (DeMillo et al. 1978).

5.2 Automated Tools Used

We used KLEE (Cadar et al. 2008) to support the test generation. We used KLEE with a relatively large timeout limit, equal to two hours per program, the Random Path search strategy, with Randomize Fork Enabled, Max Memory 2048 MB, Symbolic Array Size 4096 elements, Symbolic Standard input size 20 Bytes and Max Instruction Time of 30 seconds. This resulted in 26,229 and 1,942 test cases for CodeFlaws and CoREBench. Since the automatically generated test cases do not include any test oracle, we used the programs’ fixed version as oracle. We considered as failing, every test case that resulted in different observable program output when executed in the ‘faulty’ from that in the ‘correct’-fixed one. Similarly, we used the program output to identify the killed mutants. We deemed a mutant as killed if it resulted in a different output than in the original program.

We built a mutation testing tool^{Footnote 3} that operates on LLVM bitcode. Actually all our metrics and analysis were performed on the LLVM bitcode. Our tool implements 18 operators, composed of 816 transformation rules. These include all those that are supported by modern mutation testing tools (Offutt et al. 1996a; Papadakis et al. 2018a; Coles et al. 2016) and are detailed in Table 2.

Table 2 Mutant types

Full size table

Each mutation operation consists of matching an instruction type (original instruction type) and replacing with another instruction type (mutated instruction type). Thus, a mutation operator is defined as pair of original instruction type and mutated instruction type. The instruction types are defined as following (p refers to pointer values and s refers to scalar values):

ANY STMT refers to matching any type of statement (only original instruction type).
TRAPSTMT refers to a trap, which cause the program to abort its execution (only mutated instruction type).
DELSTMT refers to statement deletion, i.e., replacing by the empty statement which is equivalent to deleting the original statement (applies only on the mutated instruction type).
CALL STATEMENT refers to a function call.
SWITCH STATEMENT refers to a C language like switch statement.
SHUFFLEARGS can only be a mutated instruction type and, when the orignal instruction type is a function call. It refers to the same function call as the original but with arguments of, same type, swapped. (e.g. \(f(a,b)\rightarrow f(b,a)\))
SHUFFLECASESDEST can only be used as mutated instruction type and, when the orignal instruction type is a switch statement. It refers to the same switch statement as the original but with the basic blocks of the cases swapped. (e.g. {\(case\ a:B_{1};\ case\ b:B_{2};\ default:B_{3};\} \rightarrow \{case\ a:B_{2};\ case\ b:B_{1};\ default:B_{3};\}\))
REMOVECASES can only be used as mutated instruction type and, when the orignal instruction type is a switch statement. It refers to the same switch statement as the original but with some cases deleted (the corresponding values will lead to execute the default basic block). (e.g. {\(case\ a:B_{1};\ case\ b:B_{2};\ default:B_{3};\} \rightarrow \ \{case\ a:B_{2};\ default:B_{3};\}\))
SCALAR.ATOM refers to any non pointer type variable or constant (only original instruction type).
POINTER.ATOM refers to any pointer type variable or constant (only original instruction type).
SCALAR.UNARY refers to any non pointer unary arithmetique or logical operation (e.g. abs(s), − s, !s, s + +...).
POINTER.UNARY refers to any pointer unary arithmetique operation (e.g. p + +, −−p...).
SCALAR.BINARY refers to any non pointer binary arithmetique, relational or logical operation (e.g. s₁ + s₂, s₁&&s₂, s₁ >> s₂, s₁ <= s₂...).
POINTER.BINARY refers to any pointer binary arithmetique or relational operation (e.g. p + s, p₁ > p₂...).
DEREFERENCE.UNARY refers to any combination of pointer dereference and scalar unary arithmetic operation, or combination of pointer unary operation and pointer dereference (e.g. (∗p) −−, ∗ (p −−)...).
DEREFERENCE.BINARY refers to any combination of pointer dereference and scalar binary arithmetic operation, or combination of pointer binary operation and pointer dereference (e.g. (∗p) + s, ∗ (p + s) ...).

Applying mutation testing on CodeFlaws and CoREBench yielded 3,213,543 and 1,564,614 mutants.

To reduce the influence of redundant and equivalent mutants, we applied TCE (Papadakis et al. 2015; Hariri et al. 2016; Kintis et al. 2018). Since we operate on LLVM bitcode we compared the mutated optimized LLVM codes using the llvm-diff utility. llvm-diff is a tool like the known Unix diff utility but for LLVM bitcode. TCE Detected 1,457,512 and 715,996 mutant equivalences on CodeFlaws and CoREBench. Note that the equivalent and redundant mutants detected by TCE are removed from the mutants set and neither executed nor considered in the experiments.

The execution of the mutants post TCE resulted in killing the 87% and 54% of the mutants for CodeFlaws and CoREBench. It is important to note that our tool applies mutant test execution optimizations by recording the coverage and program state at the mutation points avoiding the execution of mutants that do not infect the program state (Papadakis and Malevris 2010a). This optimization enables huge test execution reductions and forms the current state of the art at the test execution optimizations (Papadakis et al. 2018a). Despite these optimization our tool required a total of 8,009 and 454 CPU days of computations for CodeFlaws and CoREBench indicating the large amount of computation resources required to perform such an experiment.

5.3 Experimental Procedure

To answer our research questions we performed an experiment composed of three parts. The first part regards the prediction ability of our classification method, answer RQ1 and RQ2, the second regards the fault revealing ability of the approaches, answer RQ3-RQ6, and the third regards the fault revealing ability of our approach on large independently selected programs, answer RQ7. To account for our use case scenario, in our experiments we always train and evaluate our approach on a different sets of programs (CodeFlaws) or program versions (CoREBench).

As a first step we used KLEE to generate test cases for all the programs we study and formed a pool of test cases by joining the generated and the available test cases. We then constructed a mutation-fault matrix, which records for every test case the mutants that it kills and whether it reveals the fault or not (we construct a matrix for every single fault we study). We also record the execution time needed to execute every mutant-test pair so that we can simulate the execution cost of the approaches. We make the data available^{Footnote 4}.

To measure fault revelation we mutated the faulty program versions. This is important in order to avoid making any assumption related to the interaction of mutants and faults, aka Clean Program Assumption (Titcheu Chekam et al. 2017). Based on this matrix we compute the fault revealing ratio for each mutant. The fault revealing ratio is the ratio of tests that kill the mutant and reveal the fault to the total number of tests that kill the mutant.

First experimental part: The first task of prediction modeling is to evaluate the contribution of the used features. We computed the information gain values for each one of the used features. Higher information gain values represent more informative features for decision trees. Demonstrating the importance of our features helps us understand what is the most important factors affecting the utility of mutants. Having measured information gain, we then measure the prediction ability of our classification method by evaluating its ability to predict killable and fault revealing mutants. For this part of the experiment we considered as fault revealing the mutants that have fault revealing ratio equal to 1. We relax this constraint in the second part of the experiment.

We evaluate the trained classifiers using four typically adopted metrics such as the precision, recall, F-measure and Area Under Curve (AUC). The precision of a classifier is defined as the number of items that are truly relevant among the items that the classifier predicted to be relevant. The recall of a classifier is defined as the number of items that are predicted to be relevant by the classifier among all the truly relevant items. The F-measure of a classifier is defined as the weighted harmonic mean of the precision and recall, it is also named F1 score. The Area Under Curve (AUC) of a classifier is the area under the Receiver Operating Characteristic (ROC) curve (The ROC curve shows how many true positive classifications can be gained as more and more false positives are allowed) (Zheng 2015). Precision represents the ratio of the identified killable and fault revealing mutants out of those classified as such. Recall represents the ratio of the identified killable and fault revealing mutants out of all existing ones. In classification usually recall and precision are competitive metrics in the sense that higher values of one imply lower values for the other. To better compare classifiers researchers use the F-measure and AUC metrics. These measure the general classification accuracy of the classifier. Higher values denote better classification.

To reduce the risk of overfitting, we applied a 10-fold cross validation by partitioning our program set into 10 parts and iteratively train on 9 parts and evaluation on one. We report the results for all the partitions.

This experiment part was performed on the Codeflaws programs.

Second experimental part: Our analysis requires comparing mutation-based strategies with respect to the actual value of interest, the number of faults revealed. Given that killing a mutant does not always result in revealing a fault, we train the classifier in accordance with the actual fault revealing ratios (i.e., the ratio of tests that kill a mutant and also reveal faults).

We then select and prioritise our mutants. To evaluate and compare the studied approaches with respect to fault revelation, we follow a typical procedures (Titcheu Chekam et al. 2017; Kurtz et al. 2016; Namin et al. 2008) by randomly selecting test cases, from the formed test pools, that kill the selected mutants. In case none of the available test cases on our test pool kills the mutant we treat it as equivalent. We repeat this process for each one of the studied approaches. As done in the first part of the experiment we report results using a 10-fold cross validation.

For the mutant selection problem we randomly pick a mutant and then randomly pick a test case that kills it. Then we remove all the killed mutants and pick another one. If the mutant is not killed by any of the test cases on our test pool we treat it as equivalent. We repeat this process 100 times and compute the probability of revealing each one of the faults.

For the mutant prioritisation case we follow the mutant order by picking test cases that kills each mutant. We do not attempt to kill a mutant twice. Again, we repeat this process 100 times and compute the Average Percentage of Faults Detected (APFD) values, which is typical metric used test case prioritization studies (Henard et al. 2016). Again we align the compared approaches with respect to their cost (number of mutants need manual analysis) and compare their effectiveness.

To account for coincidental results and the stochastic selection of test cases and mutants we used the Wilcoxon test, which is a non-parametric test, to determine whether the Null Hypothesis (that there is no difference between the studied methods) can be rejected. In case the Null Hypothesis is rejected, then we have evidence that our approach outperforms the others. Even when the null hypothesis does not hold, the size of the differences might be small. To account for this effect we also measured the Vargha Delaney effect size \(\hat {\text {A}}_{12}\) (Vargha and Delaney 2000), which quantifies the size of the differences (aka statistical effect size). \(\hat {\text {A}}_{12} = 0.5\) suggests that the data of the two samples tend to be the same. \(\hat {\text {A}}_{12} > 0.5\) values indicate that the first dataset has higher values, while \(\hat {\text {A}}_{12} < 0.5\) indicate the opposite.

This experiment part was performed on the Codeflaws programs.

Third experimental part: To further evaluate the fault revealing ability of our approach, we applied it on the CoreBench programs. We also adopted the 10-fold cross validation as for the experiments on Codeflaws. We report results related to both fault revelation and APFD values. The CoreBench corpus is small in size and hence FaRM might not be particularly important. However, in case the signal of our features is strong, we will be able to experience the benefits of our method even with those few data.

5.4 Mutant Selection and Effort Metrics

When comparing methods, a comparison basis is required. In our case we measure fault revelation and effort. While measuring fault revelation based on the fault set we use is direct, measuring effort/cost is hard. Effort/cost depends on a large number of uncontrolled parameters, such as the followed procedure, level of automation, skills, underlying infrastructure and the learning curve. Therefore, we have to account for different scenarios. and we adopt three frequently used metrics; the number of selected mutants, the number of test cases generated and the number of mutants requiring analysis.

The first metric (selected mutants) represents the number of mutants that one should use when applying mutation testing. This is a direct and intuitive metric as it suggest that developers should select a particular set of mutants to generate (form an actual executable codes), execute and analyse. Although such a metric conforms to our working scenario, it does not focus on the required test generation effort involved. Generating test cases is mostly a manual task (due to the test oracle problem) and so, we also consider a second metric, the number of test cases that can be generated based on a selected set of mutants.

We also adopt a third metric, the number of mutants that need to be analysed (equivalent mutants and those we pick, i.e., analysed in order to generate test cases). This metric somehow reflects the effort a tester needs to put in order to kill or identify as equivalent the selected mutants (under the assumption that equivalent mutants require the same effort as the test generation).

To fairly compare the random selection methods, we select mutants until we analyse the same number of mutants as analysed by our selection method. This establishes a fixed cost point for all the approaches and compare their effectiveness.

There are other cost factors, such as the mutant-test execution cost and the analysis of equivalent mutants (for the first two metrics) that we investigate separately. The reason for that is that we would like to see if our approaches are also faster to execute and require reasonably less equivalent mutants.

6 Results

6.1 Assessment of Killable Mutant Prediction (RQ1 and RQ2)

To check the prediction performance of our classifier we performed a 10-Fold cross-validation for three different selected sets. These were the results of applying PredKillable to predict killable mutants and selecting the 5%, 10% and 20% of the top ranked mutants. The PredKillable classifier achieves 98.8% 5.7%, 10.7% precision, recall and F-measure when selecting the 5% of the mutants. With respect to 10% and 20% sets of mutants, it achieves 98.8% and 98.7% (precision), 11.4% and 22.8% (recall), 20.4% and 37.0% F-measures. These values are higher than those that one can get by randomly sampling the same number of mutants. In particular the PredKillable has 12.3%, 12.2% and 12.1% higher precision, and 0.7%, 1.4% and 2.8% higher recall values for the 5%, 10% and 20% sets of mutants.

When using PredKillable to predict non killable mutant, the classifier achieves 95.1% 35.0%, 51.2% precision, recall and F-measure when selects the 5% of the mutants. With respect to 10% and 20% sets of mutants, it achieves 79.1% and 49.3% (precision), 58.6% and 73.2% (recall), 67.3% and 58.9% F-measures. These values are higher than those that one can get by randomly sampling the same number of mutants. In particular the PredKillable has 81.6%, 65.7% and 35.8% higher precision, and 30.1%, 48.7% and 53.3% higher recall values for the 5%, 10% and 20% sets of mutants.

To train our models, approximately 48 CPU hours were required, while to perform the evaluation (perform mutant selection) it required less than a second. Since training should only happen once in a while, the training time is considered acceptable. Of course the cost of selecting and prioritizing mutants is practically negligible.

The Receiver operating characteristic (ROC) shown in Fig. 8 further illustrates performance variations of the classifier in terms of true positive and false positive rates when the discrimination threshold changes: the higher the area under curve (AUC), the better the classifier. Our classifier achieves an AUC of 88%. These results establish that the code properties that were leveraged as features for characterizing mutants provide, together, a good discriminative power for assessing the fault revealing potential of mutants.

6.2 Assessment of Fault Revelation Prediction

ML prediction performance

Similarly to Section 6.1 we performed a 10-Fold cross-validation for three different selected sets in order to check the prediction performance of our classifier. These were the results of applying FaRM and selecting the 5%, 10% and 20% of the top ranked mutants. The FaRM classifier achieves 5.7% 12.8%, 7.8% precision, recall and F-measure when selects the 5% of the mutants. With respect to 10% and 20% sets of mutants, it achieves 4.9% and 3.9% (precision), 22.0% and 35.1% (recall), 8.0% and 7.0% F-measures. These values are higher than those that one can get by randomly sampling the same number of mutants. In particular FaRM has 3.5%, 2.7% and 1.7% higher precision, and 7.8%, 12.1% and 15.1% higher recall values for the 5%, 10% and 20% sets of mutants.

The cost of training and evaluation are same as those reported in Section 6.1.

The Receiver operating characteristic (ROC) shown in Fig. 9 further illustrates performance variations of the classifier in terms of true positive and false positive rates when the discrimination threshold changes: the higher the area under curve (AUC), the better the classifier. Our classifier achieves an AUC of 62%.

We believe that such a result is encouraging due to the nature of the developer mistakes. As developers make mistakes in an non-systematic way, for the same problem, some may make mistakes while some others may not, the only thing we can hope for is to form good heuristics, i.e., identify mutants that maximsize the chances to reveal faults. Therefore, it is hard to get much higher AUC values. Nevertheless, we expect future research to built on and improve our results by forming better predictors.

Overall, the above results demonstrate that the code properties that were leveraged as features for characterizing mutants provide, together, a discriminative power to assess the fault revealing potential of mutants.

Considered features

We provide in Fig. 10 the distribution of information Gain values for the various features considered in this work. Information gain (IG) measures how much “information” a feature gives us about the class we want to predict. The IG values are computed by the supervised learning algorithm during the training process. These data enable the assessment of the potential contribution of every feature to a prediction model. Experimental training process provides evidence in Fig. 10 that the suggested features (in bold) contribute significantly less than several other features that we have designed for FaRM . Interestingly, together with complexity, the features related to control and data dependencies are the most informative ones. Here we should note that IG values do not suggest which features to select and which not. Actually our results show that we need all the features.

6.3 Mutant Selection

6.3.1 Comparison with Random (RQ3)

Figure 11 shows the distribution of the fault revelation of the mutant selection strategies when selecting the 2%, 5% and 10% of the top ranked mutants. As can be seen from the plot, both FaRM* and FaRM outperforms both DummyRandom and spreadRandom. Both DummyRandom and spreadRandom outperform PredKillable . When selecting 2% of the mutants the difference, for both FaRM and FaRM* , of the median values is 22% and 24% for the DummyRandom and SpreadRandom respectively. This difference is increasing when selecting the 5% of the mutants and goes to 34% and 34% for FaRM and, 24% and 24% for FaRM* . When selecting 10% of the mutants the difference becomes 20% and 17% for both FaRM and FaRM* . Regarding PredKillable , the difference with DummyRandom and SpreadRandom at the 2% mutant selection threshold is 23% and 21% respectively. This difference increase for the 5% to 37% and 37%. For the 10% threshold is 43% and 46%.

To check whether the differences are statistically significant we performed a Wilcoxon rank-sum test, which is a non-parametric test that measures whether the values of one sample are higher than those of the second sample. We adopt a statistically significant level a < 0.01 below of which we consider the differences as statistically significant. We also computed the Vargha Delaney \(\hat {A}_{12}\) effect size value between the approaches.

The statistical test showed that FaRM and FaRM* outperforms both DummyRandom and SpreadRandom with statistically significant difference. both DummyRandom and SpreadRandom outperform PredKillable with statistically significant difference. As expected the differences between DummyRandom and SpreadRandom are not significant. It is noted that all comparisons are aligned with respect to the number of mutants that need analysis, which as we already explained represents the manual effort involved. The Vargha Delaney \(\hat {A}_{12}\) value between the approaches show that for the 2% threshold, FaRM is better than DummyRandom and SpreadRandom in 60% and 63% of the cases respectively. These values are slightly higher for FaRM* where it is better than DummyRandom and SpreadRandom in 62% and 65% of the cases respectively. DummyRandom and SpreadRandom are respectively better than PredKillable in 84% and 82% of the cases. For the 5% threshold, FaRM is better than DummyRandom and SpreadRandom in 66% of the cases. FaRM* is better than DummyRandom and SpreadRandom in 64% and 65% of the cases respectively. DummyRandom and SpreadRandom are respectively better than PredKillable in 88% and 84% of the cases. For the 10% threshold, FaRM is better than DummyRandom and SpreadRandom in 65% and 63% of the cases respectively. FaRM* is better than DummyRandom and SpreadRandom in 64% and 61% of the cases respectively. DummyRandom and SpreadRandom are respectively better than PredKillable in 87% and 85% of the cases.

Regarding the test execution time of the involved methods, our approach has an advantage but this is minor. The median difference between FaRM and DummyRandom and SpreadRandom was 12 and 39 seconds per program respectively. This means that FaRM required 12 and 29 seconds less execution time than the random baselines. While these differences are considered as minor they demonstrate that FaRM has significantly higher fault revelation ability than the compared baselines without introducing any major overhead.

Overall, our results suggest that FaRM and FaRM* significantly outperforms the random baselines with practically significant differences, i.e., improvements on the ratios of revealed faults were between 4% to 34%. PredKillable is outperformed by all the approaches.

6.3.2 Comparison with SDL & E-selective (RQ4)

This section aims to compare the fault revelation of our approach with that of the SDL and the E-Selective mutants sets.

In order to compare our approach with SDL selection, the selection size is set to the number of SDL mutants. In the Codeflaws subjects, SDL and E-SELECTIVE mutants represent in median respectively 2% and 38% of all mutant as seen in Fig. 12.

Our analysis is designed as following. For each subject, the |SDL| top ranked mutants of FaRM are selected (where |SDL| is the total number of SDL mutants). We also select the |SDL| top ranked mutants with the random approaches. Then, the fault revelation of each approach’s selected mutants set is computed for comparison and presented in Fig. 13. We observe that FaRM and FaRM* respectively have 30% and 27% higher median fault revelation than SDL. PredKillable has 25% lower median fault revelation than SDL. We also observe that SDL has similar fault revelation with the random selections (respectively 3% and 2% lower than DummyRandom and SpreadRandom).

We also performed the Wilcoxon rank-sum test as in Section 6.3. The statistical test showed that both FaRM and FaRM* outperform SDL, and SDL outperforms PredKillable . The difference between SDL and DummyRandom and SpreadRandom is not statistical significant. We also computed the Vargha Delaney \(\hat {A}_{12}\) value between the approaches and found that FaRM and FaRM* are respectively better than SDL in 54% and 55% of the cases. SDL is better than PredKillable in 79% of the cases.

Similar to the experiment performed above to compare our approach with SDL, we perform another experiment to compare FaRM with E-Selective selection. The fault revelation results are presented in Fig. 14. We observe that for a selection size equal to the number of E-Selective mutants, all selection approaches except PredKillable and DummyRandom achieve the highest median fault revelation. Given that E-Selective mutants are roughly 38% of all the mutants, which is relative large set, we make the comparison with the E-Selective set for smaller selection size namely 5% and 15% thresholds of the top ranked mutants (w.r.t all mutants). The E-Selective mutants of the given sizes are randomly selected from the whole E-Selective mutant set. The fault revelation results are presented in Figs. 15 and 16. We can observe that FaRM and FaRM* respectively have 31% and 22% higher median fault revelation than E-Selective for thresholds 5%. For the 15% threshold, both have 9% higher median fault revelation. PredKillable has 38% and 47% lower median fault revelation than E-Selective for thresholds 5% and 15% respectively. We also observe that E-Selective has similar fault revelation with the random selections (respectively 2% and 1% higher than DummyRandom and SpreadRandom for selection size 5% and respectively 3% and 0% higher than DummyRandom and SpreadRandom for selection size 15% ).

The Wilcoxon rank-sum statistical test shows that both FaRM and FaRM* outperform E-Selective, and E-Selective outperforms the PredKillable . The difference between E-Selective and the random approaches is not statistical significant. We also computed the Vargha Delaney \(\hat {A}_{12}\) effect size value between the approaches and found that for the 5% and 15% thresholds, FaRM is better than E-Selective in 64% and 63% of the cases respectively. FaRM* is better in 62% and 61% of the cases respectively, and PredKillable is worse in 86% and 82% of the cases respectively.

6.4 Mutant Prioritization

6.4.1 Comparison with Random (RQ5)

Selected Mutants Cost Metric

Figure 17 shows the distributions of APFD (Average Percentage of Faults Detected) values for all faults, using the five approaches under evaluation. While FaRM and FaRM* respectively yield an APFD median of 98% and 97%, and PredKillable yields an APFD median of 72%, DummyRandom and SpreadRandom reach median APFD values of 93% and 94% respectively. These results reveal that the general trend is in favour to our approach. As our approaches FaRM and FaRM* are better than the random baseline, when the main cost factor (number of mutants that need analysis) is aligned, we can infer that it is generally better with practically important differences (of 4%). Note that the highest possible improvement over the random baseline is 6% (DummyRandom has a median APFD value of 94%). Nonetheless, PredKillable is worse than the random baseline.

To account for the stochastic nature of the compared methods and increase the confidence on our results, we further perform a statistical test. Wilcoxon test results yielded p-values much lower than our significance level for the compared data, i.e., samples of FaRM and DummyRandom, FaRM and SpreadRandom, FaRM* and DummyRandom, FaRM* and SpreadRandom, PredKillable and DummyRandom, and PredKillable and SpreadRandom respectively. Therefore, we can definitively conclude that FaRM and FaRM* outperform random mutant selection with statistically significance while random mutant selection outperform PredKillable . On the other hand, as expected, the Wilcoxon test revealed that there is no statistical difference between the performance of DummyRandom and that of SpreadRandom.

When examining mutant selection strategies there are two main parameters that influence the application cost. These are the killable and equivalent mutants that testers need to analyse. When analysing a killable mutant our ability to select fault revealing ones is important, while increasing the chance to get a killable mutant is also important. Therefore, it could be that our FaRM is better simply because it selects killable mutants and not fault revealing ones. To account for this factor we removed all non-killable mutants from our sets and recompute our results. This helped eliminating the influence of non-killable mutants, from both approaches.

Our results show that the performance improvement of FaRM and FaRM* over SpreadRandom and DummyRandom is also effective when considering only killable mutants (approximated by our test suites). Figure 18 shows the relevant distributions of APFD, which are visibly similar to the distributions for all mutants (all values are slightly higher when considering only killable mutants). This result suggest that FaRM and FaRM * are indeed capable of identifying fault revealing mutants, independent of the equivalent mutants involved.

To provide a general view of the trends, Fig. 19 illustrates the overall (median) effectiveness of the mutant prioritization by FaRM , FaRM* and PredKillable in comparison with random strategies. We note that for all percentages of mutants, FaRM and FaRM* outperforms random-based prioritization while PredKillable is outperformed by the random-based prioritization. Overall, we observe that the fault revelation benefit of FaRM over the random approaches is above 20% (maximum difference is 34%) when selecting 2% to 8% of mutants. FaRM reaches a plateau around 5% of mutants, as the median fault revelation is maximal. This suggests that a hint for the mutant selection size for FaRM could be 5% of the mutants.

Finally, we examined the differences between the approaches in terms of execution time. Although we do not explicitly aim at reducing the test execution cost, we expect some benefits due to our methods’ ability to prioritise the mutants, which results in a reduced execution time (Zhang et al. 2013). Figure 20 illustrate, in a box-plot form, the overall execution time differences between the FaRM and the random baselines with respect to the attained fault revelation, measured in seconds. Although the differences can be significant in some (rare) cases, the expected (median values) ones are -58,167 and -29,373 seconds (-16 and -8 hours) for DummyRandom and SpreadRandom. This result indicates that our approach has also an advantage with respect to test execution, which sometimes becomes significant.

Conclusively, our results demonstrate that FaRM is indeed effective as it is statistically superior to random baselines, independent of the equivalent mutants involved. It provides 4% higher APFD values, which means that when testers analyse mutants (to strengthen their test suites) they get a 4% improvement on their fault revelation ability. Note that the highest possible improvement over the random baseline is 6% (DummyRandom has a median APFD value of 94%).

Required Tests Cost Metric

Figure 21 shows the distributions of APFD (Average Percentage of Faults Detected) values for all faults, using the five approaches under evaluation. While both FaRM and FaRM* yield an APFD median of 81%, and PredKillable yields an APFD median of 76%, DummyRandom and SpreadRandom reach median APFD values of 77%. These results reveal that the general trend is in favour to our approach. As our approaches FaRM and FaRM* are better than the random baseline, when the main cost factor (number of test that need to be designed and executed) is aligned, we can infer that it is generally better with practically important differences (of 4%). The PredKillable performs quite similarly to the random baseline.

To account for the stochastic nature of the compared methods and increase the confidence on our results, we further perform a statistical test. Wilcoxon test results yielded p-values much lower than our significance level for the compared data, i.e., each of FaRM and FaRM* compared with each of PredKillable , DummyRandom and SpreadRandom. Therefore, we can definitively conclude that FaRM and FaRM* outperform random baseline with statistically significance. On the other hand, the Wilcoxon test revealed that there is no statistical difference between the performance of PredKillable , DummyRandom and SpreadRandom.

The results of the Vargha Delaney effect size show that FaRM is better than DummyRandom, SpreadRandom and PredKillable in 58%, 61% and 60% of the cases respectively. FaRM* is better than DummyRandom, SpreadRandom and PredKillable in 58%, 61% and 59% of the cases respectively.

To provide a general view of the trends, Fig. 22 illustrates the overall (median) effectiveness of the required test prioritization by FaRM , FaRM* and PredKillable in comparison with random strategies. We note that for all percentages of tests, FaRM and FaRM* outperforms random-based prioritization while PredKillable is outperformed by the random-based prioritization. Overall, we observe that the fault revelation benefit of FaRM over the random approaches is above 10% (maximum difference is 15%) for the 20% to 45% top ranked tests.

Analysed Mutants Cost Metric

The analysed mutants cost metric measures the minimum number of mutants that need to be analysed, including equivalent mutants, following a mutant prioritization approach, before the fault is revealed. A good mutant prioritization approach will minimized the analysed mutants cost. Following, we compare the analysed mutants cost metric between our approaches and the random baselines. The analysed mutants cost metric is calculated for each approach and for each bug of the benchmark. We compare the approaches statistically with Wilcoxon rank-sum test and the Vargha Delaney effect size.

The results show that FaRM , FaRM* and PredKillable are better than DummyRandom and SpreadRandom with statistical significance displayed by a p-value much lower than the significance level. FaRM is better than DummyRandom and SpreadRandom in 57% and 61% of the cases respectively. The performance difference is higher for FaRM* where it is better than DummyRandom and SpreadRandom in 60% and 64% of the cases respectively. PredKillable is better than DummyRandom and SpreadRandom in 60% and 65% of the cases respectively.

FaRM* shows a larger improvement than FaRM over the random baseline, but there is no statistical significance difference between FaRM and FaRM* . Furthermore, FaRM* outperforms PredKillable with statistical significant difference, and is better in 53% of the cases. There is no statistical significant difference between FaRM and PredKillable .

Conclusively, our results demonstrate that FaRM and FaRM* are indeed effective as they are statistically superior to random baselines.

6.4.2 Comparison with Defect Prediction (RQ6)