1 Introduction

Search-based software engineering has been applied to numerous different software development activities (Harman et al. 2012), and software testing is one of the most successful of these (McMinn 2004; Ali et al. 2010). One particular software testing task for which search-based techniques are well suited is the automated generation of unit tests. For example, there are search-based tools like AUSTIN for C programs (Lakhotia et al. 2010) or EvoSuite for Java programs (Fraser and Arcuri 2011).

In search-based software testing, the testing problem is cast as a search problem. For example, common scenarios are to generate a set of test cases maximizing their code coverage or maximizing their fault detection capability. A code coverage criterion describes a set of typically structural aspects of the system under test (SUT) which should be exercised by a test suite, for example all statements, lines or branches. Mutation testing is traditionally used to assess the fault detection capability of a test suite: artificial faults are inserted in the SUT one at a time and the ability of the test suite to detect such faults is measured. For both cases, the search space would consist of all possible data inputs for the SUT. A search algorithm (e.g., a genetic algorithm) is then used to explore this search space to find the input data that maximize the given objective (e.g., cover as many branches as possible or achieve the highest possible mutation score).

Traditionally, to achieve this testing objective a search is carried out on each individual coverage goal (McMinn 2004) (e.g., a branch). To guide the search, the fitness function exploits information like the approach level (Wegener et al. 2001) and branch distance (Korel 1990). It may happen that during the search for a coverage goal there are other goals that can be “accidentally” covered, and by keeping such test data one does not need to perform search for those accidentally covered goals. However, there are several potential issues with such an approach:

  • Search budget distribution: If a coverage goal is infeasible, then all search effort to try to cover it would be wasted (except for any other coverage goals accidentally covered during the search). Unfortunately, determining whether a goal is feasible or not is an undecidable problem. If a coverage goal is trivial, then it will typically be covered by the first random input. Given a set of coverage goals and an overall available budget of computational resources (e.g., time), how to assign a search budget to the individual goals to maximize the overall coverage?

  • Coverage goal ordering: Unless some smart strategies are designed, the search for each coverage goal is typically independent, and potentially useful information is not shared between individual searches. For example, to cover a nested branch one first needs to cover its parent branch, and test data for the latter could be used to help the search for the nested branch (instead of starting from scratch). In this regard, the order in which coverage goals are sought can have a large impact on final performance.

To overcome these issues, previous work introduced the whole test suite approach (Fraser and Arcuri 2013b, 2015). Instead of searching for a single test for each individual coverage goal in sequence, the search problem is changed to a search for a set of tests that covers all coverage goals at the same time; accordingly, the fitness function guides to cover all goals. The advantage of such an approach is that both the questions of how to distribute the available search budget between the individual coverage goals, and in which order to target those goals, disappear. With the whole test suite approach, large improvements have been reported for both branch coverage (Fraser and Arcuri 2013b) and mutation testing (Fraser and Arcuri 2015).

Despite this evidence of higher overall coverage, the question remains of how the use of whole test suite generation influences individual coverage goals. In particular:

  • Even if the whole test suite approach covers more goals, those are not necessarily going to be a superset of those that the traditional approach would cover. Is the higher coverage due to more easy goals being covered? Is the coverage of other goals adversely affected? Are there goals that the traditional one goal at a time approach can cover and the whole approach can not? Although higher coverage might lead to better regression test suites, for testing purposes some coverage goals might be more “valuable” than others. So, from a practical point of view, preferring the whole test suite approach over the traditional one may not necessarily lead to improvement in testing effectiveness.

  • When generating individual tests, once a coverage goal is satisfied, it is no longer involved in test generation, and resources are invested only on uncovered goals. In whole test suite generation, all goals, including those already covered during the search, are part of the optimization until the search ends. For example, after mutation in the genetic algorithm a test suite may cover a new branch, but if the mutation meant that two already covered goals are “lost” by that test suite, the fitness evaluation would not consider this new test an improvement. Does this affect whole test suite optimization in practice? An easy solution to overcome this problem would be to keep a test “archive” for the already covered goals, and focus the search only on those goals not yet covered.

In this paper, we aim to empirically study these two aspects in detail. We investigate whether there are specific coverage goals for which the traditional approach is better and, if that is the case, we attempt to characterise those scenarios. Based on an empirical study performed on 100 Java classes, our study shows that indeed there are cases in which the traditional approach provides better results. However, those cases are rare (nearly one hundred times less) compared to the cases in which only the whole test suite approach is able to cover the goals. Using an archive does improve performance on average, but it also has negative side-effects on some testing targets.

This paper is an extension to earlier work (Arcuri and Fraser 2014). In particular, in this paper we analyze three different coverage criteria, namely line coverage, branch coverage and weak mutation, instead of just branch coverage. Furthermore, we investigate the effects of using a test archive during whole test suite generation by replicating the previous experiments (Arcuri and Fraser 2014) and performing another set of experiments on a sample of more complex classes.

The structure of this paper is as follows. Section 2 provides background information, whereas the whole test suite approach is discussed in details in Section 3. The performed empirical study is presented in Section 4. A discussion on the threats to the validity of the study follows in Section 5. Finally, Section 6 concludes the paper.

2 Background

Search-based techniques have been successfully used for test data generation (Ali et al. 2010) and McMinn (McMinn 2004) amply surveyed this topic). The application of search for test data generation can be traced back to the 70s (Miller and Spooner 1976), and later the key concepts of branch distance (Korel 1990) and approach level (Wegener et al. 2001) were introduced to help search techniques in generating the right test data.

More recently, search-based techniques have also been applied to test object-oriented software (e.g., (Tonella 2004; Fraser and Zeller 2012; Wappler and Lammermann 2005; Ribeiro 2008)). One specific issue that arises in this context is that test cases are sequences of calls, and their length needs to be controlled by the search. Since the early work of Tonella (Tonella 2004), researchers have tried to deal with this problem, for example by penalizing the length directly in the fitness function. However, longer test sequences can lead to achieve higher code coverage (Arcuri 2012), yet properly handling their growth/reduction during the search requires special care (Fraser and Arcuri 2013a).

Most approaches described in the literature aim at generating test suites that achieve as high as possible branch coverage. In principle, any other coverage criterion is amenable to automated test generation. For example, mutation testing (Jia and Harman 2009) is often considered a worthwhile test goal, and has been used in a search-based test generation environment (Fraser and Zeller 2012).

When test cases are sought for individual goals in such coverage-based approaches, it is important to keep track of the accidental collateral coverage of the remaining goals. Otherwise, it has been proven that random testing would fare better under some scalability models (Arcuri et al. 2012). Recently, Harman et al. (2010) proposed a search-based multi-objective approach in which, although each coverage goal is still targeted individually, there is the secondary objective of maximizing the number of collateral goals that are accidentally covered. However, no particular heuristic is used to help covering these other coverage goals.

All approaches mentioned so far target a single test goal at a time – this is the predominant method. There are some notable exceptions in search-based software testing. The works of Arcuri and Yao (2008) and Baresi et al. (2010) use a single sequence of function calls to maximize the number of covered branches while minimizing the length of such a test case. A drawback of such an approach is that there can be conflicting testing goals, and it might be impossible to cover all of them with a single test sequence regardless of its length.

The whole test suite approach (Fraser and Arcuri 2013b, 2015) was devised to overcome those issues. In this approach, instead of evolving individual tests, whole test suites are evolved, with a fitness function that considers all the coverage goals at the same time. Promising results were obtained for both branch coverage (Fraser and Arcuri 2013b) and mutation testing (Fraser and Arcuri 2015).

As an alternative to the whole test suite approach, Panichella et al. (2015) recently reformulated the generation of test suites for branch coverage as a many-objective optimization problem. The idea is to simultaneously minimize the distance between a test case and each uncovered branch in the class under test. To this end, an archive of solutions is used to store test cases which cover new branches, continuing the search with only uncovered branches as target goals. To which extent the reported benefits stem from the many-objective reformulation of the problem or from the use of this archiving mechanism remains unclear. The archive-based whole test suite approach presented in this paper opens the way for further empirical evaluations.

3 Whole Test Suite Generation

To make this paper self-contained, in this section we provide a summarized description of the traditional approach used in search-based software testing, the whole test suite approach, and the archive-based extension of the whole test suite approach. For more details on the traditional approach, the reader can for example refer to McMinn (2004) and Wegener et al. (2001). For the whole test suite approach, the reader can refer to Fraser and Arcuri (2013b, 2015).

Given a SUT, assume X to be the set of coverage goals we want to automatically cover with a set of test cases T (i.e., a test suite). Coverage goals could be for example branches if we are aiming at branch coverage, or any other element depending on the chosen coverage criterion (e.g., mutants in mutation testing). In this paper, we consider three coverage criteria: The dominant coverage criterion in the literature is branch coverage, hence we include it as well. In practice, statement coverage often serves as a simpler alternative when practitioners measure the coverage of their tests. However, many modern bytecode-based tools (Li et al. 2013) measure coverage on lines of code as a proxy for statement coverage. Consequently, we use line coverage as the second criterion in this paper. Finally, the third criterion we consider is weak mutation testing (Fraser and Arcuri 2015), where each mutants is expected to lead to a state change.

3.1 Generating Tests for Individual Coverage Goals

Given |X| = n coverage goals, traditionally there would be one search for each of them. To give more gradient to the search (instead of just counting “yes/no” on whether a goal is covered), usually the approach level \(\mathcal {A}(t,x)\) and branch distance d(t,x) are employed for the fitness function (McMinn 2004; Wegener et al. 2001). The approach level \(\mathcal {A}(t,x)\) for a given test t on a coverage goal xX is used to guide the search toward the target goal. It is determined as the minimal number of control dependent edges in the control dependency graph between the target goal and the control flow represented by the test case. The branch distance d(t,x) is used to heuristically quantify how far a predicate in a branch x is from being evaluated as true. In this context, the considered predicate x c is taken for the closest control dependent branch where the control flow diverges from the target branch.

Branch Coverage

The Branch Coverage fitness function to minimize the approach level and branch distance between a test t and a branch coverage goal x is defined as:

$$ f(t,x) = \mathcal{A}(t,x) + \nu(d(t,x_{c})) \ , $$
(1)

where ν is any normalizing function in the [0,1] range (Arcuri 2013). For example, consider this trivial function:

figure e

With a test case t 50 having the value z=50, the execution would diverge at the second if-condition, hence the resulting fitness function for the target x z>200 would be

$$ f(t_{50},x_{z>200}) = 1 + \nu(|50-100|+1) = 1 + \nu(51) \ , $$
(2)

where the first component is the approach level (i.e., 1) and the second component determines the distance to executing the “then” branch of the second if-condition. Now, the test with z=50 would have higher fitness (i.e., it would be worse) than the following test case which uses z=101, due to the lower approach level (i.e., 0) and the normalization of the branch distance values:

$$ f(t_{101},x_{z>200}) = 0 + \nu(|101-200|+1) = 0 + \nu(100)\;. $$
(3)

Line Coverage

Once the control flow graph of the SUT is constructed, the problem of determining the closeness to a line being covered boils down to determining how close the basic block it belongs to is from being covered. Hence, the definition of Line Coverage fitness function is identical to the one for Branch Coverage.

Weak Mutation Testing

Mutation testing consists in applying small changes, one at a time, in the code of the SUT and then checking if a test distinguishes between the original SUT and the changed versions, called mutants. Weak mutation considers a mutant covered (“killed”) if the execution of the test on the mutant is observably different from its execution on the original SUT, that is, if state infection is reached by the test. Otherwise, the mutant remains uncovered (“alive”). The Weak Mutation Testing fitness function for a test t and a mutant μ is hence given by the sum of the approach level, branch distance and the minimal infection distance function d inf (t,μ):

$$ f(t,\mu) = \mathcal{A}(t,\mu) + \nu(d(t,\mu_{c})) + \nu(d_{\textit{inf}}(t,\mu))\ , $$
(4)

where the d inf (t,μ) estimates the distance to an execution of the mutant in which state infection occurs. Intuitively, if the mutation (i.e., the specific instruction where the mutant differs from the original SUT) is not executed, the infection distance is 1.0 (maximum). If the mutation is executed, on the other hand, the minimal state infection distance is specific to each mutation operator (Fraser and Zeller 2012). For simple operators such as for instance Delete Statement and Insert Unary Operator, the minimal infection distance solely depends on the execution distance (that is, it is 0.0 whenever the approach level and branch distances are 0.0 as well). For other operators, the infection distance depends on the comparison of the executions of the original and mutated code; for example, the minimal infection distance for the Replace Arithmetic Operator is 0.0 only when the outcome of the original arithmetic operation and the outcome of the mutation differ, and 1.0 when the outcomes are the same.

While implementing this traditional approach, we tried to derive a faithful representation of current practice, which means that there are some optimizations proposed in the literature which we did not include:

  • New test cases are only generated for goals that have not already been covered through collateral coverage of previously created test cases. However, we do not evaluate the collateral coverage of all individuals during the search, as this would add a significant overhead, and it is not clear what effects this would have given the fixed timeout we used in our experiments.

  • When applying the one goal at a time approach, a possible improvement could be to use a seeding strategy (Wegener et al. 2001). During the search, we could store the test data that have good fitness values on coverage goals that are not covered yet. These test data can then be used as starting point (i.e., for seeding the first generation of a genetic algorithm) in the successive searches for those uncovered goals. However, we decided not to implement this, as Wegener et al. (2001) do not provide sufficient details to reimplement the technique, and there is no conclusive data regarding several open questions; for example, potentially a seeding strategy could reduce diversity in the population, and so in some cases it might in fact reduce the overall performance of the search algorithm.

  • The order in which coverage goals are selected might also influence the result. As in the literature usually no order is specified (e.g., Tonella 2004; Harman et al. 2010), we selected the branches in random order. However, in the context of procedural code approaches to prioritize coverage goals have been proposed, e.g., based on dynamic information (Wegener et al. 2001). However, the goal of this paper is neither to study the impact of different orders, nor to adapt these prioritization techniques to object-oriented code.

  • In practice, when applying a single goal strategy, one might also bootstrap an initial random test suite to identify the trivial test goals, and then use a more sophisticated technique to address the difficult goals; here, a difficult, unanswered question is when to stop the random phase and start the search.

3.2 Whole Test Suite Generation

For the whole test suite approach, we used exactly the same implementation used by Fraser and Arcuri (2013b, 2015). In the Whole approach, the approach level \(\mathcal {A}(t,x)\) is not needed in the fitness function, as generally all control dependencies are included in the optimization target, and thus the approach level is optimized to 0 for all cases automatically.

The Branch Coverage fitness function to minimize for a set of test cases T on a set of branches B is:

$$ f_{BC}(T,B) = \sum\limits_{b \in B}d(T,b)\;, $$
(5)

where d(T,b) is defined as:

$$ d(T,b) = \left\{ \begin{array}{ll} 0 & \; \text{if branch \textit{b} has been covered,}\\ \nu(d_{min}(t \in T,b)) & \; \text{if the predicate has been}\\ & \; \text{executed at least twice,}\\ 1 & \; \text{otherwise.} \end{array} \right. $$
(6)

Similarly, the Line Coverage fitness function to minimize for a set of test cases T on a set of source code lines L is:

$$ f_{LC}(T,L) = \nu(| L | - | L_{\textit{Covered}} |) + \sum\limits_{b_{k} \in B_{\text{CUT}}}d(T,b)\ , $$
(7)

where L is the set of all non-comment lines of code in the CUT, L Covered is the subset of non-comment lines of code covered with the execution of all the tests in T, and B CUT is the set of branches that are control dependencies (i.e., branches that have child nodes in the control dependence tree).

The Weak Mutation fitness function optimizes test suites for weak mutation score:

$$ f_{\textit{WM}}(T,\mathcal{M}) = \sum\limits_{\mu \in \mathcal{M}}d_{w}(T,\mu)\ , $$
(8)

where \(\mathcal {M}\) is the set of all mutants generated for the SUT using a set of mutation operators (Fraser and Arcuri 2015) and d w (T,μ) guides the search by calculating the state infection distance to a mutant μ:

$$ d_{w}(T,\mu) = \left\{ \begin{array}{ll} \nu(d_{\textit{inf}}(T,\mu)) & \text{if mutant \(\mu\) was reached,}\\ 1 & \text{otherwise.} \end{array} \right. $$
(9)

Note that the total set of coverage goals for any of the fitness functions defined could be considered as different objectives. Instead of linearly combining them in a single fitness score, a multi-objective algorithm could be used. However, a typical class can have hundreds if not thousands of objectives (e.g., branches), making a multi-objective algorithm not ideal due to scalability problems. Specialized many-objective fitness functions may be applicable (Panichella et al. 2015), but have only been considered for branch coverage so far.

3.3 Archive-based Whole Test Suite Generation

The fitness functions shown above all assume that all coverage goals are targeted at the same time. That is, even when we already have a test for a coverage goal, it will still influence the search, as an optimal test suite needs to consist of tests for all goals. This potentially can have adverse effects on the effectiveness of the search: A share of the search budget will be devoted to covering goals that have already been covered in the past, and the search may at times focus less on exploring new, completely uncovered code, but more on exploiting existing coverage. For example, even if a newly covered branch is an important control dependency required to cover a large section of the code, if the mutation that led to coverage of this branch “lost” coverage of several other branches already covered in the past, then the fitness function will not reward this individual. On the other hand, a change that reduces the branch distance towards several uncovered goals may result in a better fitness value even if this reduces coverage (this effect can be observed by EvoSuite’s progress bar not increasing monotonically, but sometimes jumping back to lower coverage values).

These issues can be overcome by using a concept that is common in multi-objective optimization: an archive of individuals. During fitness evaluation, each time we discover that a new branch (or line or mutant) has been covered, we add the covering test and the covered goal to an archive. The fitness function is modified to no longer take these covered goals into account. However, it is important that this modification of the fitness function is not done in the middle of the fitness evaluation of one test suite, or the creation of a new population in a standard genetic algorithm approach, as it would make fitness values between individuals inconsistent. Therefore, we modify the fitness function only at the end of an iteration, after the fitness of all individuals has been evaluated. The modification simply consists of removing already covered goals. For example, for branch coverage the fitness function becomes:

$$ f_{BC}(T,B) = \sum\limits_{b \in B \backslash C}d(T,b) \ , $$
(10)

where C is the set of goals already covered in the archive.

The impact of not taking already covered goals into account for fitness computation can be demonstrated in the following example. Assume there are two test suites T 1 and T 2 and a set of branch goals \(B=\left \{b_{1},b_{2}\right \}\), such that d(T 1, b 1)=0, d(T 1, b 2)=0.7, and d(T 2, b 1)=1, d(T 2, b 2)=0.1. Using the fitness function defined in (5), we have f B C (T 1, B)=0+0.7=0.7 and f B C (T 2, B)=1+0.1=1.1. Hence, T 1, with a lower fitness value, is a fitter individual than T 2 and has higher chances of remaining in the population in further generations than T 2, even though T 2 is much closer than T 1 to covering branch b 2. In contrast, by using the archive-based fitness function definition (10), the branch b 1 is omitted in the fitness computation and we have f B C (T 1, B)=0.7 and f B C (T 2, B)=0.1. Therefore, T 2 is a fitter individual, and by selecting it the search is more likely to evolve a test suite covering b 2 in future generations.

At the end of the search, by definition the best individual of the genetic algorithm population will not cover any of the remaining coverage objectives (it may cover any number of other goals already covered in the archive). Thus, the test suite used by EvoSuite for its postprocessing steps (e.g., minimization, assertion generation) is not taken from the search population, but is a test suite consisting of all the tests in the archive.

The simplicity of the fitness functions for whole test suite generation is based on the assumption that all coverage goals are targeted at the same time. That is, the approach level is not included in the fitness function because all control dependencies are part of the optimization. If this assumption no longer holds and the fitness function only targets a subset of the coverage goals, then this potentially results in a lack of guidance, for example if some of the control dependencies of code not yet covered are no longer optimized for. There are different ways to address this issue; for example, the approach level could be included in the fitness function to provide the missing guidance. A simpler solution is provided by ensuring that the genetic material covering the control dependencies is not lost, even if the coverage goals are removed from the fitness function. EvoSuite achieves this in two ways. In the simplest case, EvoSuite keeps the tests in the population even after removing a coverage goal. Moreover, EvoSuite ’s search operators further try to exploit the archive by creating, with a certain probability, new tests by mutating tests from the archive rather than always adding new random tests. These two simple mechanisms suffice, although more intelligent ways of exploiting the archive, e.g., alternative seeding strategies, could be explored in the future.

Besides the fitness function, the search operators (e.g., mutation of tests) are also intended to optimize for all coverage goals. For example, in EvoSuite, if a new statement is inserted during mutation or generation of a random test, then with a probability of 50 % this is a call on an existing object in the test, else the new statement is a call to a method of the CUT. The choice of CUT method is made randomly with a uniform distribution, based on the assumption that all methods need to be covered. However, once all branches (or lines or mutants) in a method have been covered, then inserting a call to the method will not lead to an improvement of the fitness value (unless the call leads to a state change, for which the other 50 % probability of call insertion is intended). Therefore, as an optimization of the search operators, we modify the selection of CUT methods to a random choice out of only those methods that are not yet fully covered.

4 Empirical Study

In this paper, we carried out an empirical study to compare the traditional one goal at a time approach (OneGoal ), the whole test suite approach (Whole) and the whole test suite approach using the archive (Archive). For each of the test coverage criteria Branch, Line and Mutation, we aim at answering the following research questions:

RQ1::

Are there coverage goals in which OneGoal performs better than Whole ?

RQ2::

How many coverage goals found by Whole get missed by OneGoal ?

RQ3::

Which factors influence the relative performance of Whole and OneGoal ?

RQ4::

How does using an archiving solution, Archive , influence the performance of Whole ?

4.1 Experimental Setup

For the experiments, we randomly chose 100 Java classes from the SF100 corpus (Fraser and Arcuri 2012), which is a collection of 100 projects randomly selected from the SourceForge open source software repository. We randomly selected from SF100 to avoid possible bias in the selection procedure, and to have higher confidence to generalize our results to other Java classes as well. The domain of the selected classes varies; for example, there are classes dealing with lexical analysis (XPathLexer in project 24_saxpath), visual components (e.g., SiteListPanel in 35_corina), multi-threading (BlockThread in 78_caloriecount) and complex differential geometry operations (LocalDifferentialGeometry in 89_jiggler). In total, the selected 100 classes contain 2,383 branches, 3,811 lines and 12,473 mutants, which we consider as test goals.

The SF100 corpus contains more than 11,000 Java classes. We only used 100 classes instead of the entire SF100 due to the type experiments we carried out: a large number of classes, multiple configurations and a large number of repetitions. In particular, for each class in the selected sample we ran EvoSuite in three modes:

  • (a)  The one goal at a time approach (OneGoal )

  • (b)  The whole test suite approach (Whole )

  • (c)  The archive-based whole test suite approach (Archive)

The three modes were used in combination with the three test coverage criteria under study, i.e, Branch, Line and Mutation, which are implemented as fitness functions in EvoSuite , giving a total of nine configurations. Each experiment consists in running one particular configuration on one of the sampled classes. To take randomness into account, each experiment was repeated 500 times, for a total of 100×9×500=450,000 runs of EvoSuite .

When choosing how many classes to use in a case study, there is always a tradeoff between the number of classes and the number of repeated experiments. On one hand, a higher number of classes helps to generalize the results. On the other hand, a higher number of repetitions helps to better study in detail the differences on specific classes. For example, given the same budget to run the experiments, we could have used 10,000 classes and 5 repetitions. However, as we want to study the “corner cases” (i.e., when one technique completely fails while the other compared one does produce results), we gave more emphasis on the number of repetitions to reduce the random noise in the final results.

Each experiment was run for up to two minutes (the search on a class was also stopped once 100 % coverage was achieved). Therefore, in total the entire case study had an upper bound of 450,000×2/(24×60)=625 days of computational resources, which required a large cluster to runFootnote 1. When running the OneGoal approach, the search budget (i.e., the two minutes) is equally distributed among the coverage goals in the SUT. When the search for a coverage goal finishes earlier (or a goal is accidentally covered by a previous search), the remaining budget is redistributed among the other goals still to cover.

Observe that the experimental setup used in this journal extension differs from the one used for the experiments reported in the original paper (Arcuri and Fraser 2014). First, the number of configurations under study increased from two (OneGoal vs Whole, both for Branch Coverage only) to nine ({O n e G o a l, W h o l e, A r c h i v e}×{B r a n c h C o v e r a g e, L i n e C o v e r a g e, W e a k M u t a t i o n S c o r e}). To accommodate for this setup, less repetitions were run for each configuration, namely 500 instead of 1,000. Furthermore, the search budget used for each run was reduced from three to two minutes. Importantly, the EvoSuite tool has evolved, bugs were fixed and improvements were made, hence it is expected that the exact coverage values may vary slightly with respect to the earlier study.

Given the nature of the Archive approach, it is expected that it will deliver larger benefits when applied to more complex classes. To investigate whether that is actually the case, we conducted a second experiment of reduced magnitude targeting a more challenging set of classes. We selected from each project in the SF100 corpus the class with the highest number of branches (full list with details is in Table 8 in the Appendix). The three approaches were evaluated for each of the test criteria on this case study as well, but only 30 repetitions were run. As discussed before, the results obtained for a lower number of repetitions may be influenced by randomness. However, in this case even this small number of repetitions is expected to suffice in demonstrating the effects of using the Archive approach, as they are only used to provide more support to our main results. In total, this added a further 100×9×30=27,000 runs of EvoSuite.

To properly analyse the randomized algorithms used in this paper, we followed the guidelines proposed by Arcuri and Briand (2014). In particular, when branch coverage values were compared, statistical differences were measured with the Wilcoxon-Mann-Whitney U-test, where the effect size was measured with the Vargha-Delaney \(\hat {A}_{12}\). An \(\hat {A}_{12}=0.5\) means no difference between the two compared algorithms.

When checking how often a goal was covered, because whether or not a goal is covered is a binary variable, we used the Fisher exact test. As effect size, we used the odds ratios, with a δ=1 correction to handle the zero occurrences. When there is no difference between two algorithms, then the odds ratio is equal to one. Note, in some of the graphs we rather show the natural logarithm of the odds ratios, and this is done only to simplify their representation.

4.2 RQ1: Are There Coverage Goals in Which OneGoal Performs Better than Whole ?

Tables 12 and 3 show the average coverage results obtained for each of the test criteria on the selected 100 Java classes. The results in Table 1 confirm previous results (Fraser and Arcuri 2013b): the Whole test suite approach leads to higher branch coverage. In this case, the average branch coverage increases from 63 % to 78 %, with a 0.67 effect size. The Whole approach leads to significantly higher branch coverage for 45 % of the classes, and to lower coverage in none of them.

Table 1 For each class with statistically significant results, the table reports the average Branch Coverage obtained by the OneGoal approach and by the Whole approach
Table 2 For each class with statistically significant results, the table reports the average Line Coverage obtained by the OneGoal approach and by the Whole approach
Table 3 For each class with statistically significant results, the table reports the average Weak Mutation Score obtained by the OneGoal approach and by the Whole approach

Results for Line Coverage are also positive: Table 2 shows that the Whole approach leads to a 16 % average increase, from 62 % to 78 % with a 0.69 effect size. We observed an individual statistically significant improvement for 47 classes. Whereas there is no class with significantly lower line coverage, class JSListSubstitution in project 85_shop is worth looking into since it shows a decrease in coverage with an effect size of 0.47. Figure 1 lists the relevant code from this class.

Fig. 1
figure 1

Excerpt of class JSListSubstitution: Using test suite sizes as a secondary optimization objective makes it harder for Whole to generate a data structure that satisfies the condition of the if-statement, hence often failing to cover the “then” branch

By default, EvoSuite uses the length of the test suite as a secondary optimization objective for Whole test suite generation. The general advantage is that resulting test suites tend to be shorter using this secondary objective. However, optimizing the test suite’s size can sometimes have undesired effects, e.g., less randomness and less diversity. For class JSListSubstitution in particular, the use of the test suite size as secondary objective seems to hamper the generation of a data structure that can satisfy the condition if(s1!=null), as the fitness function in general provides no incentive to add more values into the vector data-structure. To verify this conjecture, we ran EvoSuite 30 times for this configuration (class JSListSubstitution, mode Whole and criterion Line Coverage ) with and without the secondary objective. The result of this comparison indeed indicates that not using the secondary objective leads to higher Whole Line Coverage on this class, with a 0.6 effect size and 0.04 p-value.

The largest increase in coverage is observed for the Weak Mutation Testing criterion. As shown in Table 3, Whole significantly outperforms OneGoal on 69 classes, leading to an average increase of 0.41 (0.77−0.36) with an effect size of 0.83. Recall that OneGoal has an inherent limitation related to the number of goals and the search budget distribution. The overall search budget, two minutes in our experiments, must be evenly distributed to target each coverage goal at a time. When the number of goals is high, as in our sample where 125 mutants are generated per class on average, the small time budget assigned to each goal is often not sufficient for a genetic algorithm to find any solution at all, i.e., zero coverage.

To study the difference between OneGoal and Whole at a finer grained level, Table 4 shows on how many coverage goals (i.e., respectively branches, lines and mutants) one approach is better than the other. These results indicate that there are only three goals for which OneGoal leads to better Branch Coverage results; however, all of them are covered by Whole at least once. A closer look reveals that all three goals are in the same class JSONStringer, which in fact is fully covered by both techniques in all their runs; but there are six datapoints for Whole missing, which is sufficient for the Fisher test to claim a significant difference. This may happen with experiments of this type, e.g., if there are problems on the cluster computer, or competing processes. This is the reason why we used as many as possible repetitions for our analysis. As this is a random effect, one would expect same odds regardless of the technique, which means that the number of cases where Whole is better because of crashes in OneGoal runs should be the same on average.

Table 4 For each criterion, we report how often (number of goals) the Whole approach is better (higher effect size) than OneGoal, how often they are equivalent, and how often it is OneGoal that is better

The results for Line Coverage and Weak Mutation Testing are similar. For Line Coverage , OneGoal is significantly better than Whole at covering four lines, but in this case Whole never manages to cover two of them. Also for Weak Mutation Testing, there is one mutant that is only covered by the OneGoal approach. These lines and the mutant are both in the class BlockThread (where all techniques resulted in 500 runs), which only has a single conditional branch, all other methods contain just sequences of statements (in EvoSuite , a method without conditional statements is counted as a single branch, based on the control flow graph interpretation). However, the class spawns a new thread, and several of the methods synchronize on this thread (e.g., by calling wait() on the thread). EvoSuite uses a timeout of five seconds for each test execution, and any test case or test suite that contains a timeout is assigned the maximum (worst) fitness value, and not considered as a valid solution in the final coverage analysis. In BlockThread, many tests lead to such timeouts, and a possible conjecture for the worse performance of the Whole approach may be that the chances of having an individual test case without timeout are simply higher than the chances of having an entire test suite without timeouts.

figure f

4.3 RQ2: How Many Coverage Goals Found by Whole Get Missed by OneGoal ?

Let us now quantify the goals on which Whole outperformed OneGoal . There are 11,851 goals (out of 18,666) in which Whole gives statistically better results: 1,239 branches, 1,978 lines and 8,634 mutants. For 4,055 of them (385 branches, 468 lines and 3,202 mutants), the OneGoal approach never managed to generate any results in any of the 500 runs. In other words, even if there are some (i.e., three) goals that only OneGoal can cover, there are many more goals that only Whole does cover.

figure g

4.4 RQ3: Which Factors Influence the Relative Performance of Whole and OneGoal ?

Having showed that the Whole approach leads to higher coverage, it is important to investigate the conditions in which this improvement is obtained. For each coverage criterion and for each goal, we calculated the odds ratio between Whole and OneGoal (i.e., we quantified what are the odds that Whole has higher chances to cover the goal compared to OneGoal). For each odds ratio, we studied its correlation with three different properties: (1) the \(\hat {A}_{12}\) effect size between Whole and OneGoal on the class the goal belongs to; (2) the raw average coverage obtained by OneGoal on the class the goal belongs to; and, finally, (3) the size of the class, measured as number of coverage targets (branches, lines, mutants) in it. Table 5 shows the results of these correlation analyses for the three coverage criteria.

Table 5 For each criterion, correlation analyses between the odds ratios for each goal and three different properties

We used Pearson’s r correlation coefficient, as well as Kendall’s τ and Spearman’s ρ. The strengths of correlation are interpreted as follows: negligible (.01 to .19), weak (.20 to .29), moderate (.30 to .39), strong (.40 to .69), very strong (.70 to 1), and similarly on the negative range (−1 to −0.1) (Kotrlik and Williams 2003). All three correlation coefficients are generally in agreement, except for the case of correlation with the OneGoal coverage for Branch Coverage and Line Coverage. There, it is weak/moderate for Pearson’s r and negligible for the other two.

There is a positive correlation between the odds ratios and the \(\hat {A}_{12}\) effect sizes for the three criteria. This is expected: on a class in which the Whole approach obtains higher coverage on average, it is also more likely that Whole will have higher coverage on each goal in isolation. This correlation is moderate for Branch Coverage and Line Coverage , at 53 % and 40 % respectively, and weak for Weak Mutation Testing, at only 23 %.

On classes with many infeasible branches (or too difficult to cover for both Whole and OneGoal ), one could expect higher results for Whole (as it is not negatively affected by infeasible branches (Fraser and Arcuri 2013b)). It is not possible to determine for all branches if they are feasible or not. However, we can estimate the difficulty of a class by the obtained code coverage for the considered coverage criterion. Furthermore, one would expect better results of the Whole approach on larger, more complex classes. This is confirmed by the negative correlation of the odds ratios with the obtained average OneGoal coverage for Branch Coverage and Line Coverage. However, this is not the case for Weak Mutation Testing, which is the criterion with most targets. Similarly, the higher number of targets (so the class would likely be more difficult) leads to better performance for Whole. This is shown by a positive 42 % correlation for Branch Coverage and 17 % for Line Coverage. However, again we see the opposite trend for Weak Mutation Testing, i.e., a negative −19 % correlation.

The analyses presented in Table 5 numerically quantify the correlations between the odds ratios and the different studied properties. To study them in more details, we present scatter plots for the \(\hat {A}_{12}\) effect sizes, for the OneGoal average coverage and for the number of target goals. Figure 2 presents said plots for Branch Coverage, Fig. 3 for Line Coverage and Fig. 4 for Weak Mutation Testing.

Fig. 2
figure 2

For Branch Coverage , scatter plots of the (logarithm of) odds ratios compared to the \(\hat {A}_{12}\) effect sizes (a), average class coverage obtained by OneGoal (b), and the total number of coverage goals (c)

Fig. 3
figure 3

For Line Coverage , scatter plots of the (logarithm of) odds ratios compared to the \(\hat {A}_{12}\) effect sizes (a), average class coverage obtained by OneGoal (b), and the total number of coverage goals (c)

Fig. 4
figure 4

For Weak Mutation Testing , scatter plots of the (logarithm of) odds ratios compared to the \(\hat {A}_{12}\) effect sizes (a), average class coverage obtained by OneGoal (b), and the total number of mutants (c)

Figures 2a, 3a, 4a are in line with the positive correlation values shown in Table 5 for Whole vs. OneGoal . In these figures, there are clear major clusters at \(\hat {A}_{12}\) close to 1 for logarithms of odds greater than 0. These are classes where Whole is very likely to lead to better results, although the number of coverage goals on which it improves are only small.

Regarding the comparisons with the difficulty of a class, Fig. 2b for Branch Coverage and Fig. 3b for Line Coverage show a similar trend. Most classes are located for low values of OneGoal coverage, with high odds ratios. For higher coverage, odds ratios decrease. However, in both cases, there is a consistent number of classes for which OneGoal achieves high coverage (clusters in the top-left borders in those figures). This is not the case for Weak Mutation Testing, as shown in Fig. 4b, where, such a cluster does not appear. This explains the correlation values in Table 5. For Weak Mutation Testing, there is no simple class for which both OneGoal and Whole achieve very good results, and that would skew the correlations by creating that kind of cluster. Furthermore, for many classes, OneGoal achieves very low coverage (recall Table 4). By looking at the number of targets in Fig. 4c, we can see there are many classes with high number of mutations. On these classes, we can hence infer that OneBranch achieves low coverage, regardless of whether those targets are difficult or not. This is due to how the search budget is split: trying to give same budget to all targets would result in very little budget per target if those are many, so low that even simple targets would not be covered. It would likely be more effective to use a higher budget per target, but addressing just a subset of them. In other words, the effectiveness of OneGoal on Weak Mutation Testing on complex classes is not a good indicator of how Whole will perform on those.

Still, there is a negative -19 % correlation between the odds ratios and the number of targets for Weak Mutation Testing (recall Table 5). In Fig. 4c, there are three main clusters: (a) odds ratios between 0 and 10 for less than 1000 mutations, (b) similar odds ratios but for approximately 4000 mutations, and (c) very high odds ratios (above 30) for low number of targets. The odds ratio in cluster (b) are slightly smaller than in (a) and, considering (c), that would lead to the negative correlation. Although the number of mutants in these classes does not seem particularly high, it seems that nevertheless the overall search budget per mutant is too low in the OneGoal approach. For example, even a lower number of mutants can be problematic if the time EvoSuite requires to generate the tests is high, or if the execution time of the tests is high. For example, class wheel.components.Block has only 46 mutants, yet EvoSuite struggles to create dependency classes: When creating dependency objects, EvoSuite calls methods/constructors to create these objects, and then recursively creates objects for parameters of these calls. In the case of wheel.components.Block the depth of this recursion frequently hits the maximum (which is 10 by default), in which case EvoSuite aborts the recursion and starts over with creating the same object. As a result, in the time given per mutant when targeting individual mutants, EvoSuite barely manages to generate sufficient tests to even fill the initial population, whereas Whole can spend more time on generating the initial population.

figure h

To summarize in a graphical way the results obtained in the experiments conducted to address RQ1-3, Fig. 5 presents the boxplots of the effect sizes in the comparison between OneGoal and Whole for Branch Coverage , Line Coverage and Weak Mutation Testing , which show that Whole overall performed better than OneGoal for the three criteria.

Fig. 5
figure 5

For each criterion, boxplot of effect size in the comparison between OneGoal and Whole

4.5 RQ4: How Does Using an Archiving Solution, Archive, Influence the Performance of Whole ?

In order to answer RQ4, we now compare the Archive and the Whole test suite generation approaches; Table 6 shows the results of this comparison. There are more cases where using Archive is significantly better than Whole, demonstrating the beneficial effects of the archive. However, there are also many cases where using Archive achieves worse results compared to Whole. The fact that this is less often the case for Weak Mutation Testing suggests that the benefit achieved by Archive is larger for more complex classes. To verify this conjecture, we replicated our experiments comparing Archive and Whole for a selection of classes with higher complexity than the random sample used so far. The results of these analyses are shown in Table 7, and here the number of cases where Archive is significantly better is much higher in general. This is also confirmed by Fig. 6, which summarizes in boxplots the effect sizes between Archive and Whole . Clearly, Archive is generally better, and for larger classes the benefit is larger. More per class details are in the Appendix, in Tables 910111213 and 14.

Table 6 For each criterion, we report how often (number of goals) the Archive approach is better (higher effect size) than Whole, how often they are equivalent, and how often it is Whole that is better
Table 7 For each criterion, we report how often (number of goals) the Archive approach is better (higher effect size) than Whole, how often they are equivalent, and how often it is Whole that is better

The observation that there are cases where using an archive leads to a negative effect suggests that the search operators need to be further optimized in order to accommodate for the archive. In particular, when mutating test suites, EvoSuite either changes existing tests, or adds new tests. New tests are generated randomly, and over time the search would be expected to focus on more difficult coverage goals when using an archive. However, random tests are less likely to cover these goals. The minimization used as a secondary objective would thus gradually remove these additional tests, and the search may prematurely converge on sub-optimal individuals. Using specialized search operators helps in alleviating this problem by sampling tests from the archive and mutating them instead of constantly generating random tests. Other optimisations, like for instance applying seeding strategies, could help increase the benefits of using the archive.

Fig. 6
figure 6

Boxplots of effect sizes in the comparisons between Whole and Archive for the two selection of classes (random sample of 100 classes, and selection of 100 clases, one per SF100 project, with highest number of branches)

Extended Search Budget

Intuitively, it is conceivable that the improvements observed for Archive compared to Whole might be less apparent with an increased search budget. The Archive approach builds up an artificial test suite by collecting test cases across evolving test suites. Given a more generous amount of time, could Whole converge towards a best individual with similar coverage to the one produced by Archive? To address this notion, we ran an experiment on a selection of the 10 classes with the highest number of branches in SF100 (ten largest classes in Table 8 in the Appendix). For this experiment, we used an extended search budget of 10 minutes and 50 repetitions. Figure 7 compares the performance of Archive and Whole on Branch Coverage , Line Coverage and Weak Mutation Testing over time. The plots show that the improvement of performance by the archive persist over time for Branch Coverage and Line Coverage and increases in the case of Weak Mutation Testing . It is important to notice, however, that coverage on these large classes is not saturated even with a 10 minutes budget. A plausible conjecture is that Whole would have the opportunity to catch up once Archive achieves full coverage; before that, though, a steady advantage can be expected with the Archive approach.

Fig. 7
figure 7

Comparison of the performance of Whole and Archive for the 10 classes with the highest number of branches overall (at most one per SF100 project), using an increased search budget of 10 min

Figure 7 shows the manifest improvements in coverage achieved by the Archive approach, but where does this overall increase in performance stem from? While the data we collected does not comprise specific timing information about the phases of the search, we take the number of generations and fitness evaluations as proxy metrics. In this experiment with extended search budget we observed that on average the Archive approach evolved more generations (4,186 vs. 3,020) and performed more fitness evaluations (209,370 vs. 151,005) than the Whole approach. Our interpretation of these results is that in the Archive approach, individuals tend to be smaller and their fitness evaluation cheaper, allowing for a more exhaustive search than in the Whole approach.

figure i

5 Threats to Validity

Threats to internal validity might come from how the empirical study was carried out. To reduce the probability of having faults in our testing framework, it has been carefully tested. But it is well known that testing alone cannot prove the absence of defects. Furthermore, randomized algorithms are affected by chance. To cope with this problem, we ran each experiment 500 times, and we followed rigorous statistical procedures to evaluate their results. To enable fair comparisons and to avoid possible confounding factors when different tools are used, the three approaches, OneGoal, Whole and Archive , were implemented in the same tool (i.e., EvoSuite ). Furthermore, the same default values were used for all relevant parameters of the tool, e.g., population size, mutation rates and test length.

When addressing the difficulty of covering certain goals in our experiments, we rely on the underlying assumption that a high number of branches in a class implies that it contains more difficult coverage goals. Often in practice, the larger a class, the more complex behaviour it represents, and the more complex behaviour it represents, the more difficult it gets for an automated technique to to recreate specific states and configurations to cover certain branches. Nevertheless, since this assumption may not always hold, it represents a threat to the construct validity of our study.

There is the threat to external validity regarding the generalization to other types of software, which is common for any empirical analysis. Because of the large number of experiments required (in the order of hundreds of days of computational resources), we only used 100 classes for our in depth evaluations. These classes were randomly chosen from the SF100 corpus, which is a random selection of 100 projects from SourceForge. We only experimented for Java software using branch, line and mutation coverage. Whether our results do generalize to other programming languages and testing criteria is a matter of future research.

6 Conclusions

Existing research has shown that the whole test suite approach can lead to higher code coverage (Fraser and Arcuri 2013b, 2015). However, there was a reasonable doubt on whether it would still perform better on particularly difficult coverage goals when compared to a more focused approach.

To shed light on this potential issue, in this paper we performed an in-depth analysis to study if such cases do indeed occur in practice. Based on a random selection of 100 Java classes in which we aim at automating test generation for different testing criteria (branch, line and mutation coverage) with the EvoSuite tool, we found out that there are indeed coverage goals for which the whole test suite approach leads to worse results. However, these cases are very few compared to the cases in which better results are obtained (nearly two orders of magnitude in difference), and all coverage goals that were covered by OneGoal but not covered by Whole at all turned out to be special cases, rather than general deficiencies of the approach.

Our experiments also showed that the use of an archive does lead to better results on average, but it may have some negative side-effects on some testing targets, as the use of an archive would require specialized search operators that use the archive; designing these search operators will require further research. Furthermore, the incorporation of the archive as part of the Whole approach raises the question of whether it can still be regarded as evolution of “test suites”. Whereas from the practical point of view we have demonstrated the usefulness of this optimization, there might be theoretical implications related to constructing the resulting test suite incrementally from multiple test suites instead of just producing the fittest individual in the final population. Since these aspects escape the scope of this paper, we plan to investigate them in future work.

The results presented in this paper provide more support to the validity and usefulness of the whole test suite approach in the context of test data generation. Whether such an approach could be successfully adapted also to other search-based software engineering problems and whether the whole approach is more suitable than the traditional per-goal approach for industrial testing practitioners will be a matter of future research.

To learn more about EvoSuite , visit our website at: http://www.evosuite.org.