1 Introduction

Various search-based techniques have been introduced to automate different white-box test generation activities, e.g., unit testing (Fraser and Acuri 2013b, 2011), integration testing (Derakhshanfar et al. 2020), or system-level testing Arcuri (2019). Depending on the testing level, each of these approaches utilizes dedicated fitness functions to guide the search process and produce a test suite satisfying given criteria (e.g., line coverage, branch coverage, etc.).

Fitness functions typically rely on control flow graphs (CFGs) to represent the source code of the software under test (McMinn 2004). Each node in a CFG is a basic block of code (i.e., maximal linear sequence of statements with a single entry and exit point without any internal branch), and each edge represents a possible execution flow between two blocks. Two well-known heuristics are usually combined to achieve high line and branch coverages: the approach level and the branch distance (McMinn 2004). The former measures the distance between the execution path of the generated test and a target basic block (i.e., a basic block containing a statement to cover) in the CFG. The latter measures, using a set of rules, the distance between an execution and the coverage of a true or false branch of a particular predicate in a branching basic block of the CFG.

Both approach level and branch distance assume that only a limited number of basic blocks (i.e., control dependent basic blocks Allen 1970) can change the execution path away from a target statement (e.g., if a target basic block is the true branch of a conditional statement). However, basic blocks are not atomic due to the presence of implicit branches (Borba et al. 2010) (i.e., branches occurring due to the exceptional behavior of instructions). As a consequence, any basic block between the entry point of the CFG and the target basic block can impact the execution of the target basic block. For instance, a generated test case may stop its execution in the middle of a basic block with a runtime exception thrown by one of the statements of that basic block. In these cases, the search process does not benefit from any further guidance from the approach level and branch distance.

Fraser and Arcuri (Fraser and Arcuri 2015a) introduced testability transformation for unit testing, which instruments the code to guide the unit test generation search to cover implicit exceptions happening in the class under test. However, this approach does not guide the search process in scenarios where an implicit branch happens in another class called by the class under test. This is due to the extra cost added to the search process stemming from the calculation and monitoring of implicit branches in all the classes coupled to the class under test. For instance, the class under test may be heavily coupled with other classes in the project, thereby finding implicit branches in all of these classes can be expensive.

In contrast, other test case generation scenarios, like crash reproduction, aim to cover only a limited number of paths, and thereby we only need to analyse a limited number of basic blocks (Chen and Kim 2015; Xuan et al. 2015; Nayrolles et al. 2015; Rößler et al. 2013; Soltani et al. 2018). Current crash reproduction approaches rely on information about a reported crash (e.g., a stack trace, a core dump, etc.) to generate a crash reproducing test case.

Among these approaches, search-based crash reproduction (Rößler et al. 2013; Soltani et al. 2018) takes as input a stack trace to guide the generation process. More specifically, the statements pointed to by the stack trace act as target statements for the approach level and branch distance. Hence, current search-based crash reproduction techniques suffer from a lack of guidance in cases where the involved basic blocks contain implicit branches (which is common when trying to reproduce a crash).

In our prior work we have introduced a novel secondary objective called Basic Block Coverage (BBC) to address the guidance problem in crash reproduction (Derakhshanfar et al. 2020). The secondary objective guides the search process to differentiate two generated tests with the same fitness values (here, same approach level and branch distance). This paper extends our prior work on BBC to the more general unit test case generation context.

BBC helps the search process to compare two generated test cases with the same distance (according to approach level and branch distance) to determine which one is closer to the target statement. In this comparison, BBC analyzes the coverage level, achieved by each of these test cases, of the basic blocks in between the closest covered control dependent basic block and the target statement.

To assess the impact of BBC on search-based unit test generation, we implemented BBC in EvoSuite (Fraser and Arcuri 2011), the state-of-the-art tool for search-based unit test generation, and evaluate its performance against the classical DynaMOSA (Panichella et al. 2018b) for various activation probabilities of BBC (11 configurations in total). We applied these eleven configurations to 219 classes under test selected from the last version of Defects4J v.2.0.0 (Just et al. 2014), a collection of existing faults. We compare the performance in terms of effectiveness for branch coverage, weak mutation score, output coverage, and real fault detection capabilities.

Our results show that BBC improves the branch coverage of the generated tests when activating BBC as a secondary objective in DynaMOSA. Although small on average (from 74.5% for DynaMOSA up to 76.1% for BBC), this improvement in the branch coverage leads to an increase of the average output domain coverage (from 54.2% for DynaMOSA up to 55.5% for BBC) and implicit runtime exception coverage (from 75.1% when using DynaMOSA up to 80.3% for BBC), and of the diversity of runtime states (denoted by an increase of the average weak mutation score from 73.2% for DynaMOSA, up to 74.6% for BBC). Our statistical analysis confirms that this improvement is systematic across all BBC configurations. Activating BBC also significantly improves with a large effect the fault detection rate for 3 real faults out of 92.

Similarly, to assess the impact of BBC on search-based crash reproduction, we re-implemented the existing STDistance (Rößler et al. 2013) and WeightedSum (Soltani et al. 2018) fitness functions and empirically compared their performance with and without using BBC (4 configurations in total). We applied these four crash reproduction configurations to 124 hard-to-reproduce crashes introduced in JCrashPack (Soltani et al. 2020), a crash benchmark used by previous crash reproduction studies (Derakhshanfar et al. 2020). We compare the performance in terms of effectiveness in crash reproduction ratio (i.e., percentage of times that an approach can reproduce a crash) and efficiency (i.e., time required by for reproducing a crash).

Our results show that BBC significantly improves the crash reproduction ratio over the 30 runs in our experiment for respectively 10 and 4 crashes when compared to using STDistance and WeightedSum without any secondary objective. Also, BBC helps these two fitness functions to reproduce 3 (for STDistance) and 3 (for WeightedSum) crashes that could not be reproduced without secondary objective. Besides, on average, BBC increases the crash reproduction ratio of STDistance and WeightedSum by 9% and 4.5%, respectively. Applying BBC also significantly reduces the time consumed for crash reproduction guided by STDistance and WeightedSum in 56 (45.1% of cases) and 54 (43.5% of cases) crashes, respectively. In cases where BBC has a significant impact on efficiency, this secondary objective improves the average efficiency of STDistance and WeightedSum by 71.7% and 68.7%, respectively.

The remainder of this paper is organized as follow: Section 2 reports the background and related work on CFG-based guidance. Section 3 describes our novel BBC secondary objective and how it can be used for search-based crash reproduction and search-based unit test generation. Section 4 describes our evaluation to assess the importance of implicit branches (RQ 0) and the impact of BBC on search-based unit test generation (RQ 1) and search-based crash reproduction (RQ 2). Section 5 presents our results on 219 classes under test selected from the last version of Defects4J and 124 hard-to-reproduce crashes from JCrashPack. Sections 6 and 7 discuss our results and their implications for search-based test case generation, and Section 8 concludes the paper.

2 Background and Related Work

2.1 Coverage Distance Heuristics

Listing 1
figure d

Method fromMap from XWIKI version 8.1 (Soltani et al. 2020)

Many structural-based search-based test generation approaches mix the branch distance and approach level heuristics to achieve a high line and branch coverage (McMinn 2004). These heuristics measure the distance between a test execution path and a specific statement or a specific branch in the software under test. For that, they rely on the coverage information of control dependent basic blocks, i.e., basic blocks that have at least one outgoing edge leading the execution path toward the target basic block (containing the targeted statement) and at least another outgoing edge leading the execution path away from the target basic block. As an example, Listing 1 shows the source code of the method fromMap from XWIKIFootnote 1, and Fig. 1 contains the corresponding CFG. In this graph, the basic block 409 is control dependent on the basic block 407-408 because the execution of line 409 is dependent on the condition at line 408 (i.e., line 409 will be executed only if elements of array formvalues are String).

Fig. 1
figure 1

CFG for method fromMap

The approach level is the number of uncovered control dependent basic blocks for the target basic block between the closest covered control dependent basic block and the target basic block. The branch distance is calculated from the predicate of the closest covered control dependent basic block, based on a set of predefined rules. Assuming that the test t covers only line 403 and 418, and our target line is 409, the approach level is 2 because two control dependent basic blocks (404-406 and 407-408) are not covered by t. The branch distance for the predicate in line 403 (the closest covered control dependency of node 409) is measured based on the rules from the establised technique (McMinn 2004).

To the best of our knowledge, there is no related work studying the extra heuristics helping the combination of approach level and branch distance to improve the coverage. Most related to our work, Panichella et al. (2018b) and Rojas et al. (2015) introduced two heuristics called infection distance and propagation distance, to improve the weak mutation score of two generated test cases. However, these heuristics do not help the search process to improve the general statement coverage (i.e., they are effective only after covering a mutated statement).

In this paper, we introduce a new secondary objective to improve the statement coverage achieved by fitness functions based on the approach level and branch distance, and analyze the impact of this secondary objective on search-based unit test generation and search-based crash reproduction.

2.2 Search-based Unit Test Generation

Search-based software test generation (SBST) algorithms use meta-heuristic optimization search techniques (e.g., genetic algorithm) to automate the test generation tasks in different testing levels. One of these levels is unit testing, where the search algorithm tries to generate tests satisfying various criteria (such as line coverage and branch coverage) for a given class under test (CUT). SBST techniques are widely used for unit test generation. Prior studies showed that the tests generated by these techniques achieve a high code coverage (Panichella et al. 2018a; Campos et al. 2018) and real-bug detection (Almasi et al. 2017), hence complementing the hand-written test cases.

Dynamic many-objective sorting algorithm (DynaMOSA).

Panichella et al. have recently introduced an evolutionary-based algorithm, called DynaMOSA, for unit test generation (Panichella et al. 2018b). Their study (Panichella et al. 2018a), independently confirmed by Campos et al. (2018), shows that DynaMOSA outperforms other unit test generation techniques in terms of structural coverage and mutation coverage. This approach is currently used as the default algorithm in EvoSuite, which is the state-of-the-art tool for search-based unit test generation.

DynaMOSA relies on the hierarchy of dependencies between the coverage targets (e.g., lines and branches) to perform a dynamic selection of the objectives during the search process. For instance, by applying DynaMOSA to generate tests for method fromMap (Listing 1), this algorithm, first, try to cover targets that do not have any dependencies. So, first, it tries to generate test cases to cover nodes 403 and 418. After covering node 403, it tries to cover the node 404-406, which is control-dependent on the covered node. DynaMOSA continuously changes the search objectives up to the point that all of the targets are covered.

Since DynaMOSA uses the approach level and branch distance heuristics to guide the search process towards achieving the high line, branch, and weak mutation coverage, BBC may help this technique to cover more targets. This study performs an in-depth experiment and analysis to see whether BBC can improve DynaMOSA.

Testability Transformation (TT).

Testability transformations address the problem of implicit branches in unit test generation (Li and Fraser 2011; Fraser and Arcuri 2015a). This strategy transforms the code to make implicit branches explicit by adding extra branches for error conditions, and bring more guidance for the approach level and branch distance heuristics. For code transformation of each class, TT needs extra bytecode instrumentation. Since instrumenting some classes can be difficult due to several known issues (Fraser and Arcuri 2013a), instrumenting each class, which is coupled with the class under test, may fail. Also, if we limit the testability transformations to the class under test, the search process will not have any extra guidance in cases of facing the implicit branches in the other classes.

2.3 Search-based Crash Reproduction

After a crash is reported, one of the essential steps of software debugging is to write a crash reproducing test case to make the crash observable to the developer and help them in identifying the root cause of the failure (Zeller 2009). Later, this crash reproducing test can be integrated into the existing test suite to prevent future regressions. Despite the usefulness of a crash reproducing test, the process of writing this test can be labor-intensive and time-taking (Soltani et al. 2018). Various techniques have been introduced to automate the reproduction of a crash (Chen and Kim 2015; Xuan et al. 2015; Nayrolles et al. 2015; Rößler et al. 2013; Soltani et al. 2018), and search-based approaches (EvoCrash (Soltani et al. 2018) and ReCore Rößler et al. 2013) yielded the best results (Soltani et al. 2018).

Listing 2
figure e

XWIKI-13377 crash stack trace (Soltani et al. 2020)

EvoCrash.

This approach utilizes a single-objective genetic algorithm to generate a crash reproducing test from a given stack trace and a target frame (i.e., a frame in the stack trace that its class will be used as the class under test). The crash reproducing test generated by EvoCrash throws the same stack trace as the given one up to the target frame. For example, by passing the stack trace in Listing 2 and target frame 3 to EvoCrash, it generates a test case reproducing the first three frames of this stack trace (i.e., thrown stack trace is identical from line 0 to 3).

EvoCrash uses a fitness function, called WeightedSum, to evaluate the candidate test cases. WeightedSum is the sum scalarization of three components: (i) the target line coverage (ds), which measures the distance between the execution trace and the target line (i.e., the line number pointed to by the target frame) using approach level and branch distance; (ii) the exception type coverage (de), determining whether the type of the triggered exception is the same as the given one; and (iii) the stack trace similarity (dtr), which indicates whether the stack trace triggered by the generated test contains all frames (from the most in-depth frame up to the target frame) in the given stack trace.

Definition 1 (WeightedSum Soltani et al. 2018)

For a given test case execution t, the WeightedSum (ws) is defined as follows:

$$ ws(t) = \left\{ \begin{array}{ll} 3 \times d_{s}(t) + 2 \times max(d_{e}) + max(d_{tr}) & \textit{if line not reached}\\ 3 \times min(d_{s}) + 2 \times d_{e}(t) + max(d_{tr}) & \textit{if line reached}\\ 3 \times min(d_{s}) + 2 \times min(d_{e}) + d_{tr}(t) & \textit{if exception thrown} \end{array} \right. $$
(1)

Where ds(t) ∈ [0,1] indicates how far t is from reaching the target line and is computed using the normalized approach level and branch distance: ds(t) = ∥approachLevels(t) + ∥branchDistances(t)∥∥ (∥ indicates the normalized value); de(t) ∈{0,1} shows if the type of the exception thrown by t is the same as the given stack trace (0) or not (1); dtr(t) ∈ [0,1] measures the stack trace similarity between the given stack trace and the one thrown by t. max(f) and min(f) denote the maximum and minimum possible values for a function f, respectively.

In this fitness function, de(t) and dtr(t) are only considered in the satisfaction of two constraints: (i) exception type coverage is relevant only when we reach the target line and (ii) stack trace similarity is important only when we both reach the target line and throw the same type of exception.

As an example, when applying EvoCrash on the stack trace from Listing 2 with the target frame 3, WeightedSum first checks if the test cases generated by the search process reach the statement pointed to by the target frame (line 413 in class BaseClass in this case). Then, it checks if the generated test can throw a ClassCastException or not. Finally, after fulfilling the first two constraints, it checks the similarity of frames in the stack trace thrown by the generated test case against the given stack trace in Listing 2.

EvoCrash uses guided initialization, mutation and single-point crossover operators to ensure that the target method (i.e., the method appeared in the target frame) is always called by the different tests during the evolution process.

According to a recent study, EvoCrash outperforms other non-search-based crash reproduction approaches in terms of effectiveness in crash reproduction and efficiency (Soltani et al. 2018). This study also shows the helpfulness of tests generated by EvoCrash for developers during debugging.

In this paper, we assess the impact of BBC as the secondary objective in the EvoCrash search process.

ReCore

This approach utilizes a genetic algorithm guided by a single fitness function, which has been defined according to the core dump and the stack trace produced by the system when the crash happened. To be more precise, this fitness function is a sum scalarization of three sub-functions: (i) TestStackTraceDistance, which guides the search process according to the given stack trace; (ii) ExceptionPenalty, which indicates whether the same type of exception as the given one is thrown or not (identical to ExceptionCoverage in EvoCrash); and (iii) StackDumpDistance, which guides the search process by the given core dump.

Definition 2 (TestStackTraceDistance Rößler et al. 2013)

For a given test case execution t, the TestStackTraceDistance (STD) is defined as follows:

$$ \mathit{STD}(R,t) = |R| - lcp - (1-StatementDistance(s)) $$
(2)

Where |R| is the number of frames in the given stack trace. And lcp is the longest common prefix frames between the given stack trace and the stack trace thrown by t. Concretely, |R|− lcp is the number of frames not covered by t. Moreover, StatementDistance(s) is calculated using the sum of the approach level and the normalized branch distance to reach the statement s, which is pointed to by the first (the utmost) uncovered frame by t: StatementDistance(s) = approachLevels(t) + ∥branchDistances(t)∥.

Since using runtime data (such as core dumps) can cause significant overhead (Chen and Kim 2015) and leads to privacy issues (Nayrolles et al. 2015), the performance of ReCore in crash reproduction was not compared with EvoCrash in prior studies (Soltani et al. 2018). Although, two out of three fitness functions in ReCore use only the given stack trace to guide the search process. Hence, this paper only considers TestStackTraceDistance + ExceptionPenalty (called STDistance hereafter).

As an example, when applying ReCore with STDistance on the stack trace in Listing 2 with target frame 3, first, STDistance determines if the generated test covers the statement at frame 3 (line 413 in class BaseClass). Then, it checks the coverage of frame 2 (line 615 in class PropertyClass). After covering the first two frames by the generated test case, it checks the coverage of the statement pointed to by the deepest frame (line 45 in class BaseStringProperty). For measuring the coverage of each of these statements, STDistance uses the approach level and branch distance. After covering all of the frames, this fitness function checks if the the generated test throws ClassCastException in the deepest frame.

In this study, we perform an empirical evaluation to assess the performance of crash reproduction using STDistance with and without BBC as the secondary objective in terms of effectiveness in crash reproduction and efficiency.

3 Basic Block Coverage

3.1 Motivating Example

During the search process, the fitness of a test case is evaluated using a fitness function. These fitness functions are different according to the given test criteria. However, one of the main components of these fitness functions is the coverage of specific statements and branches. For instance, one of the main goals in the unit test generation is achieving a high structural coverage (e.g., line and branch coverage). For this goal, the search process seeks to cover all of the statements and branches in the given CUT. Similarly, the fitness functions used in search-based crash reproduction (either WeightedSum or STDistance) require the coverage of specific statements pointed by the given stack trace.

The distance of the test case from the target statement is calculated using the approach level and branch distance heuristics. As we have discussed in Section 2.1, the approach level and branch distance cannot guide the search process if the execution stops because of implicit branches in the middle of basic blocks (e.g., a thrown NullPointerException during the execution of a basic block). As a consequence, these fitness functions may return the same fitness value for two tests, although the tests do not cover the same statements in the block of code where the implicit branching happens.

For instance, assume that one of the objectives of a search process (either for unit test generation or crash reproduction) is covering line 413 in method fromMap (appeared in Listing 1). This search process generates two test cases T1 and T2 for achieving this objective in a population of solutions. However, T1 stops the execution at line 404 due to a NullPointerException thrown in method getName, and T2 throws a NullPointerException at line 405 because it passes a null value input argument to map. Even though T2 covers more lines, the combination of approach level and branch distance returns the same fitness value for both of these test cases: approach level is 2 (nodes 407-408 and 410), and branch distance cannot be helpful in this case as the last covered predicate does not change the execution path away from covering the target line and also the execution stops before covering the next predicate. This is because these two heuristics assume that each basic block is atomic, and by covering line 404, it means that lines 405 and 406 are covered, as well.

3.2 Secondary Objective

The goal of the Basic Block Coverage (BBC) secondary objective is to prioritize the test cases with the same fitness value (i.e., same approach level and branch distance) according to their coverage within the basic blocks between the closest covered control dependency and the target statement. At each iteration of the search algorithm, test cases with the same fitness value are compared with each other using BBC. Listing 3 presents the pseudo-code of the BBC calculation. Inputs of this algorithm are two test cases T1 and T2, which both have the same approach level and branch distance values (calculated either using crash reproduction or unit test generation fitness functions), as well as line number and method name of the target statement. This algorithm compares the coverage of basic blocks on the path between the last control dependent node executed by both of the given tests and the basic block that contains the target statement (called effective blocks hereafter). If T1 and T2 do not cover any control dependency of the target block, BBC uses the entry point of the CFG of the given method instead as the starting point of the effective blocks’ path. If BBC determines there is no preference between these two test cases, it returns 0. Also, it returns a value < 0 if T1 has higher coverage compared to T2, and vice versa. A higher absolute value of the returned integer indicates a bigger distance between the given test cases.

Listing 3
figure f

BBC secondary objective computation algorithm

In the first step, BBC detects the effective blocks that are fully covered by each given test case (i.e., the test covers all of the statements in the block) and saves them in two sets called FCB1 and FCB2 (lines 4 and 5 in Listing 3). Then, for each of the tests T1 and T2, it detects the closest semi-covered effective block (i.e., the closest basic block to the target statement where the test covers the first line but not the last line of the block) and stores them as SCB1 and SCB2, respectively (lines 6 and 7). The semi-covered blocks indicate the presence of implicit branches.

BBC can prioritize given tests in two scenarios: Scenario 1, both tests get stuck in the middle of the same basic block (i.e., they both have the same closest semi-covered basic block), or, Scenario 2, one of the tests throws an exception in an effective basic block while the other test fully covers this block.

Scenario 1

Line 9 in Listing 3 checks if the first scenario is true by determining two conditions. First, BBC checks if both tests have the same semi-covered basic block. Then, it examines if fully covered basic blocks of one of the given tests are equal or the subset of the other test. If the second condition is not fulfilled, it means that each of these tests has one covered block that the other one does not cover, and thereby they achieve their semi-covered basic block from different paths. In this case, BBC cannot find the better test as we do not know which path can lead to covering the target statement. If these two conditions are fulfilled, BBC checks if one of the tests has a higher line coverage in the identified SCB (lines 10 to 13). If this is the case, BBC will return the number of lines in this block covered only by the winning test case. If the lines covered are the same for T1 and T2 (i.e., coveredLines1 and coveredLines2 have the same size), there is no difference between these two test cases and BBC returns value 0 (line 13).

Scenario 2

Line 14 in Listing 3 checks if the effective blocks covered by one test are a subset of the other one. This is true if all of the fully-covered blocks of one test are a subset of fully covered blocks of the other one. Also, the semi-covered block of this test must be among the fully-covered blocks of the test with more coverage (i.e., winner test). In this case, BBC returns the number of blocks that are only fully covered by the winner test case (line 15). If BBC determines T2 wins over T1, the returned value will be positive, and vice versa.

Finally, if each of the given tests has a unique covered block in the given method (i.e., the tests cover different paths in the method), BBC cannot determine the winner and returns 0 (lines 16 and 17) because we do not know which path leads to the target block. Even if T1 and T2 reach a particular basic block from different paths in the CFG and both throw exceptions in different lines, BBC returns 0 and does not select the one with the more coverage in the closest basic block as the winner. The rationale behind this behavior of BBC is to provide an equal chance for these two tests to evolve as we do not know which path covered by each of these tests has more potential to help the search process to get closer to the target line. If BBC always selects the test with more coverage in the nearest basic block, even if it covers another path, we are negatively impacting the diversity of the tests chosen for the next generation, thereby reducing the search process’s exploration ability.

Example

When giving two tests with the same fitness value (calculated by the primary objective) T1 and T2 from our motivation example to BBC with target method fromMap and line number 413, this algorithm compares their fully and semi-covered blocks with each other. In this example, both T1 and T2 cover the same basic blocks: the fully covered block is 403 and the semi-covered block is 404-406. So, here the conditions in Scenario 1 are fulfilled. Hence, BBC checks the number of lines covered by T1 and T2 in block 404-406. Since T1 stopped its execution at line 404, the number of lines covered by this test is 1. In contrast, T2 managed to execute two lines (404 and 405). Hence, BBC returns size(coveredLines2) − size(coveredLines1) = 1. The positive return value indicates that T2 is closer to the target statement, and therefore, it should have a higher chance of being selected for the next generation.

Branchless Methods

BBC can also be helpful for branchless methods. These methods do not contain any branching statement (e.g., if conditions or for loops), and thereby theoretically, covering the first line in these methods leads to covering all of the other lines, as well. In other words, by ignoring the Entry and Exit nodes, CFGs of branchless methods contain only one node (i.e., basic block) without any edges. For instance, methods from frames 1 and 2 in Fig. 2 are branchless. The absence of branches in these methods means that there are no control dependent nodes in them, and thereby approach level and branch distance cannot guide the search process in these cases if the generated tests throw implicit exceptions in the middle of these methods. However, in contrast with these two heuristics, BBC can guide the search process toward covering the most in-depth statement in these cases. As an example, if tests T1 and T2 both throws implicit branches in the middle of the only basic block (b0) of branchless method m(), BBC enters the Scenario 1 (FCB1 = FCB2 = and SCB1 = SCB2 = {b0}) and examines if one of the tests has more lines covered in b0.

Fig. 2
figure 2

Distribution of the usefulness of BBC activations per fitness evaluations. The usefulness is defined as the number of BBC evaluations returning a non-zero value divided by the number of activations. Grey points denote fitness evaluations without any BBC activation

3.3 Application of BBC

The time complexity of BBC is \(\mathcal {O}(N \times E \times log V)\) where E and V are the numbers of edges and vertices of the CFG of the given method, respectively; and N is the number of semi-covered basic blocks calculated by semiCoveredBlocks method at lines 6 and 7 of Listing 3. This complexity stems from the computation of the closest semi-covered basic blocks in Line 12 of Listing 3. In this procedure, BBC measures the shortest path between each semi-covered basic block and the target basic block (i.e., the block containing the given target line) using Dijkstra’s shortest path algorithm, which has the time complexity of \(\mathcal {O}(E \times log V)\).

Given the complexity of BBC, applying this secondary objective for any generated tests with the same approach level and branch distance may negatively impact the search process’s efficiency. In the following paragraphs, we discuss this potential negative impact on search-based crash reproduction and unit test generation.

3.3.1 Search-Based Crash Reporduction

The crash reproduction search process can be guided by either WeightedSum or STDistance. As discussed in Section 2.3, both of these fitness functions heavily rely on approach level and branch distance. Hence, BBC can be helpful in the crash reproduction search process. Since the crash reproduction search process’s goal is to cover a specific path in the control dependent graph, which is indicated by the given stack trace, we apply BBC without any limitation on any case that includes two test cases with the same (and nonzero) approach level and branch distance.

3.3.2 Search-Based Unit Test Generation

In contrast with crash reproduction, the unit test generation search process has multiple statements and branches to cover simultaneously. In DynaMOSA, each line or branch to cover is an objective of the search. Hence, the number of times that BBC is applied as the secondary objective is higher compared to crash reproduction. Therefore, we should limit the number of times that BBC is applied in this algorithm. We introduce two parameters to bring this limitation: Sleep Time and Usage Rate.

Sleep Time

When DynaMOSA adds a target to the active search objectives, the target will stay active until the search process covers it. Some of the targets are easy to cover, and thereby, approach level and branch distance can simply cover them without BBC. However, BBC can help in harder cases where approach level and branch distance cannot cover them in a certain time. Sleep Time makes sure that BBC is only applied for the hard-to-cover search objectives. If we set this parameter to t seconds, DynaMOSA uses BBC secondary objective only for search objectives that are active for more than t seconds.

Usage Rate

Like any other evolutionary-based algorithm, the unit test generation search process needs to maintain a balance between the exploration and exploitation. The former indicates the diversity in the solutions (i.e., generated tests execute new paths in the code); the latter indicates searching the solutions in the existing ones’ neighborhood (i.e., the search process should generate tests similar to the existing ones). By applying BBC, we improve the exploitation ability of the search process. However, the over-application of BBC may negatively impact the exploration ability of the search process. Usage Rate makes sure that BBC does not hinder this balance. Higher Usage Rate means that there is a higher chance of BBC application during the search process. Assume we set p ∈ [0,1] as our Usage Rate. Any time that the search process generates two test cases with the same approach level and branch distance for a hard-to-cover target (i.e., target which stays as an active objective in DynaMOSA for more than Sleep Time), BBC will be used with the probability of p.

Moreover, by default, EvoSuite has eight types of search objectives (Rojas et al. 2015): line coverage, which aims to cover maximum lines in the given CUT; branch coverage, which aims to cover maximum branches in the CUT; exception coverage, which aims to maximize the number of exceptions captured by the generated tests; weak mutation, which aims to generate tests that kill the maximum number of mutants (in weak mutation, a mutant is considered killed if executing one of the generated tests on the mutant leads to a different state compared to the execution on the given CUT); output coverage, that aims for generating tests that drive the most diverse outputs; method coverage, which aims to cover all of the methods in the given CUT; no-exception Method Coverage, checks if each of the methods in the CUT is called directly by one of the tests and this invocation does not lead to any exception; and direct branch coverage that makes sure that each branch in the public methods of CUT is covered by a direct call from one of the generated tests.

Since BBC aims to help the search process relying on the approach level and branch distance in covering lines and branches that cannot be executed with the tests generated by DynaMOSA, this secondary objective is only triggered when two tests have the same fitness value either for a non-covered line coverage or branch coverage objective. Hence, BBC is not involved in segments of the search process in which two tests are getting the same fitness value for other kinds of objectives such as exception coverage. Thereby, despite the fact that BBC prioritizes tests without throwing implicit exceptions, since this secondary objective is not triggered for objectives other than line coverage and branch coverage, it does not have any negative impact on covering other search objectives (e.g., exception coverage).

4 Empirical Evaluation

Before evaluating the impact of BBC, we want to assess its potential usefulness by answering the following research question:

  • RQ 0 How frequent are implicit branches in a search-based test case generation process?

This research question serves as a preliminary analysis before the full evaluation of the impact of BBC on search-based unit test generation and search-based crash reproduction. To answer it, we consider a special configuration of DynaMOSA, currently the best algorithm for unit test generation, where the executions of the BBC algorithm described in Listing 3 are monitored. We choose DynaMOSA, a many-objectives algorithm, because, unlike search-based crash reproduction, it targets each line and branch of a class under test independently, allowing us to collect more data about the execution of BBC for the different objectives.

To assess the impact of BBC on search-based unit test generation, we perform an empirical evaluation to answer the following research questions:

  • RQ 1 What is the impact of BBC on search-based unit test generation?

    • RQ 1.1 What is the impact of BBC on the structural coverage effectiveness of the unit tests?

    • RQ 1.2 What is the impact of BBC on the output coverage of the unit tests?

    • RQ 1.3 What is the impact of BBC on the fault finding capabilities of the unit tests?

    • RQ 1.4 What is the impact of BBC on the structural coverage efficiency of the unit tests?

In these RQs, we want to evaluate the effect of BBC on DynaMOSA. As for other algorithms, DynaMOSA relies on the approach level and branch distance to evaluate the progress of the search process. Previous research has shown that it outperforms other search-based and guided random approaches (Campos et al. 2018; Devroey et al. 2020; Kifetew et al. 2019; Molina et al. 2018; Panichella et al. 2018a, 2018b). We compare DynaMOSA for 11 different configurations of BBC in terms of structural coverage effectiveness (RQ 1.1). Since a change in the structural coverage of a class might impact the data flow, we also study the outputs produced by the different tests (RQ 1.2). Then, we look at the fault finding capabilities using weak mutation and real faults from the Defects4J collection (RQ 1.3). Finally, we study the structural coverage efficiency of BBC (RQ 1.4).

Similarly, for search-based crash reproduction, we answer the following research questions:

  • What is the impact of BBC on search-based crash reproduction?

    • What is the impact of BBC on the crash reproduction effectiveness?

    • What is the impact of BBC on the crash reproduction efficiency?

In these two RQs, we want to evaluate the effect of BBC on the existing fitness functions, namely STDistance and WeightedSum, from two perspectives: the crash reproduction ratio of the different configurations (RQ 2.1) and the time required to reproduce a crash (RQ 2.2).

In Sections 4.1 and 4.2 we will detail the experimental setup for respectively the study on unit test generation (RQ 0 and RQ 1) and crash reproduction (RQ 2).

4.1 Setup for search-based unit test generation (RQ 0 and RQ 1)

4.1.1 Implementation

We implemented BBC as a secondary objective (called BBCOVERAGE) in EvoSuite (Fraser and Arcuri 2011), the state-of-the-art tool for search-based unit test generation. As discussed in Section 3.3.2, since BBC impacts the exploration-exploitation trade-off and efficiency of the search process, we also defined two additional parameters for Sleep Time (BBC_SLEEP with a default value of 60 seconds) and Usage Rate (BBC_USAGE_PERCENTAGE with a default probability of 0.5). Our implementation is openly available at https://github.com/pderakhshanfar/evosuite.

4.1.2 Classes under test selection

We selected classes under test from the latest version of Defects4J (v.2.0.0) (Just et al. 2014), a collection of reproducible failures coming from open source projects with the identification of the corresponding faulty classes. Defects4J has been used in other studies to assess the coverage and the effectiveness of unit-level test case generation (Ma et al. 2015; Panichella et al. 2018b; Shamshiri et al. 2015), program repair (Smith et al. 2015; Martinez and Monperrus 2016), fault localization (Pearson et al. 2017; Le et al. 2016), and regression testing (Noor and Hemmati 2015; Lu et al. 2016).

We selected the ten most recent bugs from the 17 available projects for a total of 225 faulty classes, used as classes under test in our evaluation. This offers a good balance between the number of repetitions (i.e., statistical power) of each configuration and number of cases (i.e., generalization) (Arcuri and Briand 2014).

Since EvoSuite may face inevitable challenges for generating tests for some particular classes (Xiao et al. 2011; McMinn 2011; Fraser and Arcuri 2014), we performed a trial with default parameters, on all of the classes to filter out the ones for which EvoSuite cannot generate any test, as recommended by related work (Campos et al. 2018; Molina et al. 2018; Panichella et al. 2018b). We filtered out six classes according to our trial experiment results. In three of these classes, EvoSuite could not finish the class instrumentation. For the other two, DynaMOSA could not find any search objective. Finally, EvoSuite failed to generate tests for a class because of missing classes. By filtering these classes, we performed our main experiment on the 219 remaining cases. Table 1 provides more information about the classes selected for the evaluation.

Table 1 Classes under test used for the evaluation of BBC for unit testing (RQ 0 RQ 1): number of classes under test (CUTs), number of non-commented source statements per class (NCSS), number of methods per class (Methods), weighted methods per class (WMC), and cyclomatic complexity per method (CCN)

4.1.3 Parameter settings

To evaluate the impact of BBC secondary objective on search-based unit test generation, first, we should set values for Sleep Time and Usage Rate (explained in Section 3.3.2). To find the optimum Sleep Time, we performed a pre-analysis on a subset of subjects. We have randomly selected 45 classes (20% of our subjects) for this pre-analysis. We ran DynaMOSA on each of the sampled classes for 30 times and collected the time required by the search process for covering each objective. These collected results indicate that DynaMOSA can cover more than 85% of the objectives in 60 seconds. For this reason, we have set Sleep Time to 60 seconds for our experiments.

For our pre-analysis (RQ 0), we have enabled BBC (Usage Rate= 1.0) after 60 seconds (with an additional setting to record the execution results of BBC) to evaluate the number of implicit branches occurring during the search and the number of times BBC could help overcoming those implicit branches. Furthermore, to draw a comparison between setting different Usage Rate, we have used ten different values of this parameter in our main experiment (RQ 1): Usage Rate ∈{0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0}.

Hence, for the main experiment, we have executed DynaMOSA and one plus ten configurations of BBC on 219 classes for 30 rounds of execution with the search budget of 10 minutes. Also, we have executed DynaMOSA on 45 classes with the same number of repetition and search budget for finding the optimum Sleep Time. In total, we ran 80,190 independent executions to answer RQ 0 and RQ 1. These executions took about 12 days overall.

4.1.4 Data collection

To evaluate the potential impact of BBC (RQ 0), we collected for each line and branch objective: the number of times its fitness has been evaluated, and the number of times BBC has been called, activated (i.e., the call effectively led to an evaluation of the BBC, line 13 or 15 in Listing 3), and useful (i.e., the call to BBC has returned a non-zero value). When BBC is useful, it indicates that at one or both of the test throw an implicit exception in the middle of a basic block in the method of search objective (i.e., line or branch coverage objective).

We compare BBC to DynaMOSA using branch coverage for RQ 1.1 and RQ 1.4 for 30 rounds of execution. Branch coverage provides an indication about the structural coverage by looking at the percentage of branches covered by the executions of the test cases in the class under test. We recorded the value of the branch coverage every ten seconds to see how it evolves over time and answer RQ 1.4.

For RQ 1.2, we consider output coverage and implicit exceptions. Output coverage (Alshahwan and Harman 2014) denotes the diversity of the outputs of the different methods of the class under test. It provides information about the data output coverage of the generated tests by looking at how many pre-defined abstract values (i.e., partitions of the output domain) are returned by the methods of the class under test (Rojas et al. 2015). For instance, a method returning integer value has to return negative, zero, and positive values (when the tests are executed) to satisfy the output coverage criterion.

In addition to (expected) outputs, we consider implicit exceptions by looking at the number (e) of top-level methods in the class under test throwing an undeclared (i.e., runtime) exception implicitly (i.e., without any throw new instruction). For one execution, we compute the implicit exception coverage as the ratio between e and the highest value of e among the all the executions of the different BBC configurations for that class.

Since BBC addresses the challenge of handling implicit branches for search-based unit test generation, we expect it to impact both the output coverage and the number of methods throwing an implicit exception.

We rely on weak mutation and real faults to assess the fault findings capabilities of the generated tests (RQ 1.3). Weak mutation score (Howden 1982; Papadakis and Malevris 2011) gives the percentage of mutants (i.e., artificially injected faults) for which at least one test triggers a different program state, compared to the original program, directly after the execution of the mutated statement. Weak mutation is a viable cheaper alternative to strong mutation, which requires an additional propagation of the erroneous state to the output of the program (Offutt and Lee 1994). For our evaluation, weak mutation allows us to assess the diversity of runtime states, allowing to catch more faults, when using BBC. We use the default set of weak mutation operators available in EvoSuite (Fraser and Arcuri 2015b): delete call, delete field, insert unary operator, replace arithmetic operator, replace bitwise operator, replace comparison operator, replace constant, and replace variable.

Additionally, we use real faults from the Defects4J benchmark to compare the effective fault finding capabilities of tests generated using BBC. We executed all of the 11 configurations on the buggy versions of the software, and next, we check if the tests generated by each configuration can throw the same exception as the bug exposing stack traces, which are indicated by Defects4J. The rationale behind running all of the configurations only on the buggy versions, and not the fixed versions, is to have a realistic scenario. In a realistic scenario, developers are neither aware of the bug, nor have access to the fixed version. In this scenario, an automated test generation tool can help developers if it generates tests that throw an exception revealing the bug. Since EvoSuite can detect the assertion-based failures only by running it on the fixed version (Fraser and Arcuri 2015a), we limited our comparison for fault detection only on the 92 faults that a non-assertion error can expose.

4.1.5 Data analysis

For each class under test, we use the Vargha-Delaney Â12 statistic (Vargha and Delaney 2000) to examine the effect size of differences between using and not using BBC for branch, output, and implicit exception coverage, and weak mutation score (RQs 1.1-1.3). For a pair of factors (A,B) a value of Â12 > 0.5 indicates that A is more likely to achieve a higher coverage or mutation score, while a value of Â12 < 0.5 shows the opposite. Also, Â12 = 0.5 means that there is no difference between the factors. We used the standard thresholds (Vargha and Delaney 2000) for interpreting the Â12 magnitude: 0.56 (small), 0.64 (medium), and 0.71 (large). To assess the significance of effect sizes (Â12), we apply the non-parametric Wilcoxon Rank Sum test, with α = 0.01 for the Type I error.

We also rank the different configurations of BBC, based on their coverage and weak mutation score, using Friedman’s non-parametric test for repeated measurements with a significance level α = 0.05 (García et al. 2009) (RQs 1.1-1.3). This test is used to test the significance of the differences between groups (treatments) over the dependent variable (here, coverage and weak mutation score). We further complement the test for significance with Nemenyi’s post-hoc procedure (Japkowicz and Shah 2011; Panichella 2021).

Finally, since fault coverage (RQ 1.3) has a dichotomic distribution (i.e., a generated test exposes the fault or not), we use the Odds Ratio (OR) to measure the impact of each BBC configuration on the real faults coverage. A value OR > 1 in a comparison between a pair of factors (A,B) indicates that the application of factor A increases the fault coverage, while OR < 1 indicates the opposite. Also, a value of OR = 1 indicates that both of the factors have the same performance. We apply Fisher’s exact test, with α = 0.01 for the Type I error, to assess the significance of results.

4.2 Setup for search-based crash reproduction (RQ 2)

4.2.1 Implementation

Since ReCore and EvoCrash are not openly available, we implement BBC in Botsing, an extensible, well-tested, and open-source search-based crash reproduction framework already implementing the WeightedSum fitness function and the guided initialization, mutation, and crossover operators. We also implement STDistance (ReCore fitness function) in this tool. Botsing relies on EvoSuite for code instrumentation and test case generation by using evosuite-client as a dependency. We also implement the STDistance fitness function used as baseline in this paper.

4.2.2 Crash selection

We select crashes from JCrashPack (Soltani et al. 2020), a benchmark containing hard-to-reproduce Java crashes. We apply the two fitness functions with and without using BBC as a secondary objective to 124 crashes, which have also been used in a recent study (Derakhshanfar et al. 2020). These crashes stem from six open-source projects: JFreeChart, Commons-lang, Commons-math, Mockito, Joda-time, and XWiki. For each crash, we apply each configuration on each frame of the crash stack traces. We repeat each execution 30 times to take randomness into account, for a total of 114,120 independent executions. We run the evaluation on two servers with 40 CPU-cores, 128 GB memory, and 6 TB hard drive. In total, these executions took about 5 days.

4.2.3 Parameter settings

We run each search process with five minutes time budget and set the population size to 50 individuals, as suggested by previous studies on search-based test generation (Panichella et al. 2018b). Moreover, as recommended in prior studies on search-based crash reproduction (Soltani et al. 2018), we use the guided mutation with a probability pm = 1/n (n = length of the generated test case), and the guided crossover with a probability pc = 0.8 to evolve test cases. We do note that prior studies do not investigate the sensitivity of the crash reproduction to these probabilities. Tuning these parameters should be undertaken as future work.

4.2.4 Data collection

To evaluate the crash reproduction ratio (i.e., the ratio of success in crash reproduction in 30 rounds of runs) of different assessed configurations (RQ 2.1), we follow the same procedure as previous studies (Derakhshanfar et al. 2020; Soltani et al. 2018): for each crash C, we detect the highest frame that can be reproduced by at least one of the configurations (rmax). We examine the crash reproduction ratio of each configuration for crash C targeting frame rmax.

To evaluate the efficiency of different configurations (RQ 2.2), we analyze the time spent by each configuration on generating a crash reproducing test case. We do note that the extra pre-analysis and basic block coverage in BBC is considered in the spent time. Since measuring efficiency is only possible for the reproduced crashes, we compare the efficiency of algorithms on the crashes that are reproduced at least once by one of the algorithms. We assume that the algorithm reached the maximum allowed budget (5 minutes) in case it failed to reproduce a crash.

4.2.5 Data analysis

As for real fault coverage (RQ 1.3), crash reproduction data (RQ 2.1) has a dichotomic distribution (i.e., an algorithm reproduces a crash C from its rmax or not), we use the Odds Ratio (OR) to measure the impact of each algorithm in crash reproduction ratio for each crash. A value OR > 1 in a comparison between a pair of factors (A,B) indicates that the application of factor A increases the crash reproduction ratio, while OR < 1 indicates the opposite. Also, a value of OR = 1 indicates that both of the factors have the same performance. We apply Fisher’s exact test, with α = 0.01 for the Type I error, to assess the significance of results.

For RQ 2.2, we use the Vargha-Delaney Â12 statistic (Vargha and Delaney 2000) with the non-parametric Wilcoxon Rank Sum test to examine differences between using and not using BBC for efficiency. For a pair of factors (A,B) a value of Â12 > 0.5 indicates that A reproduces the target crash in a longer time, while a value of Â12 < 0.5 shows the opposite. Also, Â12 = 0.5 means that there is no difference between the factors. We used the standard thresholds (Vargha and Delaney 2000) for interpreting the Â12 magnitude: 0.56 (small), 0.64 (medium), and 0.71 (large).

4.3 Replicability

We enable the replicability of our results by providing replication packages on Zenodo (https://zenodo.org) for RQ 0 and RQ 1 (Derakhshanfar and Devroey 2021) and RQ 2 (Derakhshanfar and Devroey 2020). Those replication packages include the classes under test and crashes used for the evaluation, the evaluation infrastructure (including documentation and scripts to re-run the evaluation), and the data analysis procedure used to produce the graphs, tables, and numbers reported in this paper.

5 Results

5.1 Potential impact of BBC (RQ 0)

Table 2 provides the general statistics of the preliminary analysis answering RQ 0 per project. The number of branch and line objectives ranges from 526 for Codec to 8,108 for JacksonCore. In total, the number of fitness evaluations per objective ranges between 1 and 1,143,620 with an average of 30,111.81 evaluations. BBC has been called between 1 and 1,681,329 times per objective with an average of 34,988.58 calls. It is interesting to note that, since the evaluation of an objective may require to compare multiple test cases, BBC can be called multiple times for each fitness evaluation. BBC has been effectively activated up to 1,365,526 (average of 9,472.140) times per objective, and has been useful up to 798,005 (average of 354) times per objective.

Table 2 Statistics about the number of objectives (Obj.), fitness evaluations (Fitness eval.), calls to BBC evaluations (BBC calls), calls effectively leading to an evaluation of the BBC (BBC active), and evaluations returning a non-zero value (BBC useful)

Figure 2 provides a summary of the usefulness of BBC. Each data point corresponds to the percentage of useful calls to BBC per fitness evaluation, measured for one objective and one execution out of 30. On average, BBC has been useful 2.5 times (σ = 3.17 times) per fitness evaluation, with a maximum of 4,0145 times for a single fitness evaluation (which happens when multiple test cases have to be compared).

Summary (RQ 0)

Implicit branches are quite common. Our results show that on average, BBC has been activated (i.e., the call to BBC effectively led to an evaluation) 9,472.140 times with a standard deviation σ = 40,567.40, denoting big variations of the activation among the different objectives. The usefulness rate per activation is 2.39% on average (σ = 12.09%), confirming that not all activations can effectively lead to a distinction between two test cases w.r.t. to their partial coverage of basic blocks. Those results tend to confirm our design choice to parameterize the activation of BBC using an activation probability.

5.2 Search-based unit test generation (RQ 1)

We first discuss the results of applying BBC as a secondary objective for unit test generation using DynaMOSA. Contrarily to crash reproduction, which seeks to cover only a small number of branches, unit test generation targets all the branches in a class under test.

Branch coverage effectiveness (RQ 1.1)

Figure 3a reports the branch coverage of the different classes under test for all the 30 test suites for the different configurations of BBC. Generally, the average branch coverage slightly improves when activating BBC as a secondary objective (from 74.5% for DynaMOSA up to 76.1% for BBC 0.2, 0.4, 0.6, and 1.0). Although small, this improvement is systematic across all BBC configurations according to the effect sizes reported in Fig. 3b. BBC 0.6 gives the best results with a large positive (Â12 > 0.5) effect size for 59 classes under test (against 0 large negative, Â12 < 0.5, effect size), followed by BBC 0.2 with 59 classes (against 1 classes), and BBC 0.8 with 57 classes (against 1 class).

Fig. 3
figure 3

Branch coverage of the tests generated for the 219 classes under test (out of 30 executions) for different configurations of BBC. The square (\(\square \)) denotes the arithmetic mean, the bold line (—) is the median

Figure 4 provides a graphical representation of the ranking (i.e., mean ranks with confidence interval) of the different BBC configurations. According to Friedman’s test, the different treatments BBC 0.1 to 1.0 achieve significantly different branch coverage (p-values < 0.01) compared to DynaMOSA. Furthermore, the differences between the average ranks of BBC 0.1 to 1.0 and the average rank of the baseline are larger than the critical distance CD = 1.375 determined by Nemenyi’s post-hoc procedure. This indicates that BBC 0.1 to 1.0 achieves a significantly higher branch coverage than DynaMOSA.

Fig. 4
figure 4

Non-parametric multiple comparisons of the branch coverage using Friedman’s test with Nemenyi’s post-hoc procedure

We analyzed the correlation between the effect sizes (Â12) of the best performing BBC configuration (according to Friedman’s test with Nemenyi’s post-hoc procedure) and BBC usefulness (presented in RQ 0). The result of this analysis indicates that there is a positive correlation between the number of times that BBC could be useful (i.e., select a winner between two given tests with the same approach level and branch distance) and the effect that this secondary objective has on branch coverage improvement (Spearman’s ρ = 0.4 with a p-value < 0.6e − 10). Hence, in any case that BBC exposes that one generated test is closer to the target line than another test with the same approach level and branch distance (due to the implicit branch occurrence), there is a considerable chance that it helps the search-based test generation process to generate tests with higher branch coverage.

To confirm if this observed correlation stems from the connection between the potential implicit branches in the middle of basic blocks and improvement in the branch coverage, we manually analyzed some cases in which BBC application leads to statistically significant improvement in branch coverage achieved by the generated test. In this manual analysis, we identified multiple potential implicit exceptions before the target lines and branches, which are only covered by tests generated by utilizing BBC as a secondary objective.

Listing 4
figure g

method nextToken() from JacksonDatabind-106

Listing 5
figure h

method iterateChildren() in JacksonDatabind-106

For instance, for the class under test com.fasterxml.jackson.databind.node.TreeTraversingParser in JacksonDatabind-106, we see that tests generated by BBC configurations achieve a higher structural coverage against DynaMOSA. In the majority of runs, the tests generated by BBC managed to cover Lines 6 to 11 in method nextToken() (Listing 4), while DynaMOSA is not successful in covering these lines. By looking at method nodeCursor.iterateChildren() (Listing 5), which is called by nextToken() in line 6 of Listing 4, we see that this method may throw an IllegalStateException at lines 4 and 12. Since DynaMOSA does not have any information about the branches in the other classes other than the class under test, it cannot guide the search process to execute the method iterateChildren() without raising an exception.

Output coverage and implicit exception coverage (RQ 1.2)

The improvement of branch coverage also leads to more output diversity, reported in Fig. 5a: from 54.2% for DynaMOSA up to 55.5% for BBC 0.8. This improvement is also systematic across all BBC configurations according to the effect sizes reported in Fig. 5b. BBC 0.6 give the best results with a large positive (Â12 > 0.5) effect size for 57 classes under test each (against 2 large negative, Â12 < 0.5, effect sizes each), followed by BBC 0.1 and 0.5 with 54 classes (against 2 classes), and BBC 0.4 with 53 classes (against 2 classes).

Fig. 5
figure 5

Output coverage of the tests generated for the 219 classes under test (out of 30 executions) for different configurations of BBC. The square (\(\square \)) denotes the arithmetic mean, the bold line (—) is the median

The two target classes with large negative effect sizes on the output coverage are the same classes for the different configurations of BBC: i.e., different versions of the class org.apache.commons.cli.HelpFormatter in Cli-31 and Cli-32. Interestingly, all BBC configurations achieve a statistically significant higher implicit runtime exception coverage (i.e., undeclared runtime exceptions not explicitly thrown by a throw new instruction) with a large effect size for the same class on the same buggy versions of Cli, indicating that for this particular class, the loss of coverage of output values is compensated by a higher number of methods throwing implicit runtime exceptions.

This could be explained by the fact that BBC favors test cases with a higher coverage of basic blocks, but that are not able to reach the return statements of the methods under test (e.g., if the values used by the test cause implicit runtime exceptions). There is however no general correlation between the output coverage and the implicit exception coverage (Spearman’s ρ = − 0.008 with a p-value < 0.001).

Same as RQ 1.1, we evaluated the correlation between the improvement of BBC in terms of output coverage and BBC usefulness (presented in RQ 0). This analysis shows a positive correlation between these two metrics (Spearman’s ρ = 0.3 with a p-value < 0.1e − 5). As we explained, this observation stems from the correlation between branch coverage and the output coverage achieved by each test: covering more lines and branches increases the chance of seeing more diverse output from CUT. To support this hypothesis, we also checked if there is a correlation between branch coverage and output coverage. Our analysis shows that branch coverage and output coverage are strongly correlated (Spearman’s ρ = 0.6 with a p-value < 0.3e − 16).

Figure 6a reports the implicit runtime exception coverage of the generated tests. Implicit exceptions are not declared in the method under test and are triggered when implicit branches are executed. Results show that the average exception coverage increases when using BBC as a secondary objective: from 75.1% when using DynaMOSA up to 80.3% for BBC 0.1 and 0.6. BBC 0.9 gives the best results with a large positive (Â12 > 0.5) effect size for 67 classes under test (against 8 large negative, Â12 < 0.5, effect size), followed by BBC 0.6 with 66 classes (against 8 classes), and BBC 0.1 with 64 classes (against 7 classes).

Fig. 6
figure 6

Exception coverage of the tests generated for the 219 classes under test (out of 30 executions) for different configurations of BBC. The square (\(\square \)) denotes the arithmetic mean, the bold line (—) is the median

The rankings in Fig. 7 indicate that BBC 0.1 to 1.0 perform well, with an average rank much smaller than the baseline, both for output and exception coverage. The configurations’ average ranks differences with the average rank of the baseline are larger than the critical distance CD = 1.375 determined by Nemenyi’s post-hoc procedure.

Fig. 7
figure 7

Non-parametric multiple comparisons of the coverage using Friedman’s test with Nemenyi’s post-hoc procedure

In contrast with branch coverage and output coverage, Spearman’s test does not show any general correlation between BBC usefulness and implicit exception coverage (Spearman’s ρ = 0.04 with a p-value = 0.5). This result supports our discussion in Section 3: since BBC is only triggered when DynaMOSA compares tests regarding a line or branch coverage search objective, it does not have any negative impact on other search objectives, including the implicit exception coverage of the generated tests. We also analyzed some of the exceptions that are only thrown by the tests generated using BBC. The remainder of this section explains one of these instances.

Listing 6
figure i

An implicit exception in MATH-3 which is thrown significantly more often by tests generated by the search process utilizing BBC secondary objective

Listing 7
figure j

method linearCombination from Apache Commons MATH

Listing 6 shows an example of an implicit exception that is thrown significantly more often when using BBC. DynaMOSA managed to capture this exception in 9 our of 30 runs, while BBC 0.5 captured it in 23 out of 30 runs. This exception occurs in line 846 of method linearCombination (Listing 7). This exception can be triggered only in one specific case where the input arrays (a and b) both contain only one element. If these two parameters do not have the same size, this method throws an explicit exception at line 838 (i.e., this line is formatted as throw new [...]). Since EvoSuite can recognize explicit exception throws in the CUT and convert them to explicit branches while generating the control flow graphs, approach level and branch distance can guide the search process to cover other lines after 839 by prioritizing tests that pass two arrays with the same size to method linearCombination.

However, since the explicit branch was the only control-dependent branch for the target line (line 846), the search process does not have any guidance to cover the following lines (including the target line). Assume that test T1 generates input parameters a and b with size 0. Then, this method throws ArrayIndexOutOfBoundsException in one line before the target line (line 845). This implicit branch will be hidden from the approach level and branch distance heuristics. By adding BBC, the search process can differentiate these two tests and help the search process to generate tests that can cover the following lines more often. By having more tests that can cover the target line, the search process has a higher opportunity to execute the target line, and thereby find the exception in this line.

Weak mutation score and real faults (RQ 1.3)

As for branch and output coverage, activating BBC slightly improves the weak mutation score of the generated tests (reported in Fig. 8a). BBC 0.4, 0.6 and 0.8 achieve the higher average mutation score with 74.6%, compared to 73.2% for the baseline (DynaMOSA). That improvement is also systematic across the different configurations of BBC according to the effect sizes reported in Fig. 8b. BBC 0.5 gives the best results with a large positive (Â12 > 0.5) effect size for 54 classes under test (against 0 large negative, Â12 < 0.5, effect size), followed by BBC 0.2 with 53 classes (against 0 class), and BBC 0.4, 0.6, 0.7 and 0.9 with 51 classes each (against 0 class).

Fig. 8
figure 8

Weak mutation score of the tests generated for the 219 classes under test (out of 30 executions) for different configurations of BBC. The square (\(\square \)) denotes the arithmetic mean, the bold line (—) is the median

Looking at the ranking reported in Fig. 9, BBC 0.1 to 1.0 have an average rank much smaller than the baseline. Those differences are larger than the critical distance CD = 1.375 determined by Nemenyi’s post-hoc procedure.

Fig. 9
figure 9

Non-parametric multiple comparisons of the weak mutation score using Friedman’s test with Nemenyi’s post-hoc procedure

Moreover, we checked if we could find any correlation between the weak mutation score and BBC usefulness (presented in RQ 0). This analysis shows a moderate correlation between these two metrics (Spearman’s ρ = 0.37 with a p-value < 0.3e − 8). One reason for this correlation could be the strong correlation between weak mutation score and branch coverage (Spearman’s ρ = 0.91 with a p-value < 0.3e − 16). Thanks to BBC secondary objective, the search-based test generation process can cover more lines and branches, thereby killing the mutants in these newly covered lines.

Finally, we compare the fault revealing capabilities of the generated tests using Defects4J. Table 3 presents the results for the different configurations of BBC and the baseline (DynaMOSA). In general, the tests reveal between 25 and 28 faults at least once in 30 rounds of executions out of the 92 faults considered (the selection procedure is detailed in Section 4.1). For the faults that are revealed in at least one round, the average coverage frequency (for 30 rounds of execution) varies between 22.25% (for BBC 0.1 and 1.0) and 23.04% (for BBC 0.7). The table also reports the number of faults for which a configuration performed better (odds ratio above 1) or worse (odds ratio below 1) than the DynaMOSA baseline with a significance level of 0.01. The best configuration are BBC 0.4, 0.5, 0.6, 0.8, and 1.0 with 3 faults (against 0).

Table 3 Real faults coverage of the different configurations with the number of faults covered at least once in 30 runs (#) out of 92 faults, the average coverage frequency (\(\overline {freq.}\), σ), and the number of time a configuration performed better (> 1) of worse (< 1) than DynaMOSA with a significance level of 0.01

We manually analyzed the three faults that are captured significantly more often by BBC. In all of them, we identified potential implicit branches before covering the target line (i.e., the line in which the fault happens) that can prevent the classical and approach level from guiding the search process towards covering these failures.

For instance, Listing 8 presents the stack trace that reveals a fault in JFreeChart.Footnote 2 When selecting the XYPlot class as class under test, BBC configurations can throw this exception significantly more often than tests generated by DynaMOSA. This stack trace has five frames that are pointing to a method in the target class (XYPlot): Lines 1, 4, 5, 6, and 7 in Listing 8. By analyzing the methods in these frames, we can see that majority of them are simple methods with one line except the first frame in Line 1 of Listing 8, which points to method getDataRange that has about 100 lines of codes.

As we can see in Listing 9, the target line, in which the NullPointerException occurs (Line 4493), is in an if condition which starts at Line 4472. The target line is directly control-dependent on this condition. Hence, when a test fulfills the condition in line 4472, the approach level and branch distance heuristics assume that the generated test eventually will cover the target line (Line 4494), and thereby these two heuristics do not provide any guidance for the test generation search process afterward. However, by taking a closer look, we can see that even after entering the if condition, a test needs to, first, call the combine method (in one of the Lines 4476, 4479, 4485, or 4488) and also call either findDomainBounds (in Lines 4476 or 4479) or findRangeBounds (in Lines 4485 or 4488) before it can reach the target line. Each of these methods can throw explicit exceptions. Since these methods are not part of the class under test, the search process is unaware of those exceptions. Also, each of these methods calls multiple methods that can also throw exceptions.

Listing 8
figure k

The fault in CHART-4 which is captured significantly more often by tests generated by the search process utilizing BBC secondary objective

Listing 9
figure l

method getDataRange from JFreeChart

BBC can guide the test generation search process to execute these lines without any exception and cover the target line. By covering the target line, the search process has the opportunity to generate a test that throws a NullPointerException in this target line, and thereby captures the fault.

Branch coverage efficiency (RQ 1.4).

Figure 10a presents the tendency of the branch coverage over time using the smoothed conditional means. Overall, BBC 0.5 tends to achieve a higher branch coverage. This is confirmed by the number of classes for which we observe a significant difference (with α = 0.01) in the coverage achieved, reported in Fig. 10b and grouped by effect size (Â12) magnitude. Counts above (resp. below) 0 denote the number of classes for which we observe a positive (resp. negative) effect. After three minutes, BBC 0.4 achieves a large (resp. medium) positive effect size for 34 (resp. 18) classes under test against 1 (resp. 0) large (resp. medium) negative effect sizes. Those numbers slightly decrease over time with 27 (resp. 18) classes under test with a large (resp. medium) effect size after exhaustion of the ten minutes search budget, for 1 (resp. 0) large (resp. medium) classes with a negative effect size.

Fig. 10
figure 10

Evolution of the branch coverage of the tests generated for the 219 classes under test (out of 30 executions) for different configurations of BBC

Summary (RQ 1)

We see an improvement of the branch coverage of the generated tests when activating BBC as a secondary objective in DynaMOSA. This improvement in branch coverage also leads to an increase of the output and exception coverage, and of the diversity of runtime states (denoted by an increase of the weak mutation score). Among the different configurations, BBC 0.5 gives the best results and those results remains stable over time. It also leads to the coverage of three additional faults in Defects4J without any loss compared to the baseline. Giving our results, we can recommend using BBC 0.5 as a secondary objective for unit test generation.

5.3 Search-based crash reproduction (RQ 2)

Crash reproduction effectiveness (RQ 2.1)

Figure 11 presents the crash reproduction ratio of the search processes guided by STDistance (Fig. 11a) and WeightedSum (Fig. 11b), with and without BBC as a secondary objective. This figure shows that, on average, the crash reproduction ratio of WeightedSum improves 3.3% when using BBC. This improvement is higher for crash reproduction using STDistance. On average, the crash reproduction ratio achieved by STDistance + BBC is 9.2% higher than STDistance without BBC. Higher improvement in STDistance was expected as this fitness function relies more on the approach level and branch distance heuristics for covering each of the frames in the given stack trace. Also, in both of the fitness functions, the lower quartile of crash reproduction ratio has been improved by utilizing BBC. These improvements for WeightedSum and STDistance are 19.1% and 31.7%, respectively.

Fig. 11
figure 11

Crash reproduction ratio (out of 30 executions) of fitness functions with and without BBC. The square (\(\square \)) denotes the arithmetic mean and the bold line (—) is the median

Figure 12 depicts the number of crashes, for which BBC has a significant impact on the effectiveness of crash reproduction guided by STDistance (Fig. 12a) and WeightedSum (Fig. 12b). BBC significantly improves the crash reproduction ratio in 10 and 4 crashes for fitness functions STDistance and WeightedSum, respectively. Notably, the application of this secondary objective does not have any significant adverse effect on crash reproduction. Tables 4 and 5 present the odds ratio and p-value in cases that BBC leads to a significant improvement in crash reproduction ratios of WeightedSum and STDistance, respectively. As we can see in these tables, the odds ratio values in all cases are lower or equal to 0.2, indicating the high impact of BBC. Finally, we observed that BBC helps each of the STDistance and WeightedSum to reproduce 3 new crashes that could not be reproduced without this secondary objective.

Fig. 12
figure 12

Pairwise comparison of impact of BBC on each fitness function in terms of crash reproduction ratio with a statistical significance < 0.01

Table 4 Comparing the crash reproduction ratio between crash reproduction using WS and WS + BBC, for cases where one of the configurations has a significantly higher crash reproduction ratio (p-value < 0.01)
Table 5 Comparing the crash reproduction ratio between crash reproduction using STD and STD + BBC, for cases where one of the configurations has a significantly higher crash reproduction ratio (p-value < 0.01)

Crash reproduction efficiency (RQ 2.2)

Figure 13 illustrates the number of crashes, in which BBC significantly affects the time consumed by the crash reproduction search process. As Fig. 13b shows, BBC significantly improves the speed of crash reproduction guided by WeightedSum in 54 crashes (43.5% of cases), while it does not lose efficiency in the reproduction of any crash.

Fig. 13
figure 13

Pairwise comparison of impact of BBC on each fitness function in terms of efficiency with a small, medium, and large effect size \(\textit {\^{A}}_{12} < 0.5\) and a statistical significance < 0.01

Similarly, Fig. 13a shows that BBC has a higher positive impact on the efficiency of the search process guided by STDistance. It significantly reduces the time consumed by the search process in 56 crashes (45.1% of cases), while it had no adverse impact on the reproduction efficiency of any crash.

Figure 14 depicts the average improvements in the efficiency and effect sizes for crashes where the difference in the consumed budget when using BBC or not was significant. According to the right-side plot in Fig. 14a, BBC reduces the time consumed by the search process guided by STDistance up to 98% (being 71.7% on average). Also, the left-side plot indicates that the average effect size of differences between STDistance and STDistance+BBC (calculated by Vargha-Delaney) is 0.102 (lower than 0.5 indicates that BBC improved the efficiency). Figure 14b shows that the average improvement (right-side plot) achieved by using BBC as the second objective of WeightedSum is 68.7%, and the average effect size (left-side plot), in terms of the crash reproduction efficiency, is 0.104.

Fig. 14
figure 14

The effect size and the average improvement achieved by BBC on each of the fitness functions in cases that BBC makes a significant difference in terms of efficiency

Summary (RQ 2)

BBC improves the crash reproduction ratio for both of the WeightedSum and STDistance fitness functions. This imnprovement is higher for STDistance as this fitness function is more relied on approach level and branch distance. Moreover, BBC improves the efficiency of the search process with both of the crash reproduction fitness functions.

6 Discussion

6.1 BBC for unit test generation

Increase in program state and return value diversity

Using BBC as a secondary objective leads to a better branch coverage. Although small on average, the improvement is systematic, as demonstrated by the effect sizes. More interestingly, BBC also leads to a better output and implicit exception coverage. This is particularly interesting in a unit testing context because it allows to capture more diverse returned values (including implicit exceptions) from the methods under test. We observe the same trends for weak mutation, denoting more diverse program states. Although the evaluation of the quality of the generated tests is outside of the scope of this study, we believe that diverse return values and program states can have a positive impact on the quality of the generated assertions, which is one of the known current limitations preventing the large industrial adoption of search-based unit test generation (Almasi et al. 2017).

Adaptive secondary objectives

As explained in Section 3.3, applying BBC can be expensive (\(\mathcal {O}(N \times E \times log V)\)), compared to classical secondary objectives (linear time). Therefore, BBC should be activated only when it can effectively contribute to decide between two test cases with the same fitness value. As shown by our preliminary analysis, this is especially relevant in the context of unit test generation, where each branch should be covered, which could trigger a high number of BBC evaluations. In our implementation of BBC for unit testing (described in Section 3.3.2), we limit the number of activations of BBC, based on the activation time of an objective (Sleep Time) and a user-defined probability (Usage Rate). This approach might however not be optimal. For instance, for classes under test with a high number of implicit branches, activating BBC sooner and more often might improve the search process. In our future work, we will explore how the secondary objective can be dynamically adapted during the search, for instance, based on the evolution of the fitness values of the different objectives in DynaMOSA.

6.2 BBC for crash reproduction

Generally, using BBC as secondary objective leads to a better crash reproduction ratio and higher efficiency in search-based crash reproduction. This improvement is achieved thanks to the additional ability to guide the search process when facing implicit branches during the search. Combining BBC with STDistance shows an important improvement compared to the combination of BBC with WeightedSum. This result was expected, since only one (out of three) component in WeightedSum is allocated to line coverage, and thereby most parts of the fitness function do not use the approach level and branch distance heuristics. In contrast, STDistance uses the approach level and branch distance to cover each of the frames in the given stack trace incrementally.

Our results show that BBC helps the crash reproduction process to reproduce new crashes. For instance, the crash that we used in this study (XWIKI-13377) can be reproduced only by STDistance + BBC.

6.3 BBC and testability transformations

In this study, we tried to evaluate TT in DynaMOSA. However, EvoSuite failed before starting the search process for all the different classes under test. After a deeper investigation, we found out that TT is not compatible with DynaMOSA, which is the default search algorithm in EvoSuite. Moreover, as discussed in Section 2.2, TT faces extra challenges while it needs extra bytecode instrumentation.

In theory, giving the nature of TT and BBC, these two techniques can be applied simultaneously to the search process. Hence, these two approaches can complement each other to achieve high structural coverage and detect more faults. Studying the impact of using both TT and BBC on search-based test generation calls for further implementation and efforts, and thereby, it is part of our future research agenda.

7 Threats to Validity

Internal validity

We cannot guarantee that our implementation of BBC in EvoSuite and Botsing is bug-free. However, we mitigated this threat by testing our implementations and manually examining some samples of the results. Moreover, following the guidelines of the related literature (Arcuri and Briand 2014), we executed each configuration 30 times to take the randomness of the search process into account.

External validity

We cannot ensure that our results are generalizable to all cases. However, for both of our experiments for unit test generation and crash reproduction, we have used two earlier established benchmarks: JCrashPack (Soltani et al. 2020), which is a benchmark for crash reproduction containing 124 hard-to-reproduce crashes provoked by real bugs in a variety of open-source applications, and Defects4J (Just et al. 2014), a collection of real-world Java projects failures containing 835 bugs.

To increase the external validity while maintaining a good balance between the statistical power and the overall execution, analysis, and reporting time, we choose to consider only the ten most recent bugs from the 17 projects available in Defects4J. After filtering out classes that cannot be handled by EvoSuite, we ran our evaluation on 219 classes. Among those 219 classes, 44 come from different versions of the same projects. Although involved in different bugs, those classes might be similar and influence our results. To mitigate this threat, we performed a qualitative analysis to confirm the effect of BBC.

Construct validity

For unit test generation (RQ 1), we left the parameters of DynaMOSA to their default values used by EvoSuite. Those values are commonly used in the literature and it has been empirically shown that they give good results (Panichella et al. 2018a, 2018b; Arcuri and Fraser 2013; Fraser and Arcuri2014). We can, however, not guarantee that these default values are the best when used with BBC. Nevertheless, our results show that BBC can improve search-based unit test generation when using the default parameter values.

For search-based crash reproduction (RQ 2), we used BBC with two different fitness functions and left other parameters to their default values, used in previous studies (Soltani et al. 2018; Derakhshanfar et al. 2020). Those studies do not investigate the sensitivity of search-based crash reproduction to these values, and tuning these parameters should be undertaken as future work. However, as for unit test generation, our results show that BBC can improve search-based crash reproduction with the default parameter values.

Conclusion validity

We based our conclusion on standard statistical analysis for significance (Arcuri and Briand 2014) with α = 0.01. Effects of multiple comparisons are mitigated by adjusting pvalues via Nemenyi’s post-hoc procedure (Japkowicz and Shah 2011; Panichella 2021). Furthermore, we complemented our quantitative analysis with qualitative investigations to confirm the observed effects.

Verifiability

Finally, we openly provide all our implementations: BotsingFootnote 3, as an open-source crash reproduction tool, and implementation of BBC on EvoSuiteFootnote 4. Also, the data and the processing scripts used to present the results are available as two replication packages on Zenodo (Derakhshanfar and Devroey 2020; 2021).

8 Conclusion and Future Work

Approach level and branch distance are two well-known heuristics, widely used by search-based test generation approaches to guide the search process towards covering target statements and branches. These heuristics measure the distance of a generated test from covering the target using the coverage of control dependencies. However, these two heuristics do not consider implicit branches. For instance, if a test throws an exception during the execution of a non-branch statement, approach level and branch distance cannot guide the search process to tackle this exception. In this paper, we extended our previous work on Basic Block Coverage (BBC), a secondary objective addressing this issue. We complemented our previous study into BBC on search-based crash reproduction with an investigation of BBC for unit test generation.

Our results show that BBC improves the branch coverage for unit test generated using DynaMOSA. Although small (\(\sim \)1%), this improvement in the branch coverage is systematic and leads to an increase of the output and implicit runtime exception coverage, and of the diversity of runtime states. BBC also helps STDistance and WeightedSum to reproduce 6 and 1 new crashes, respectively. Finally, BBC significantly improves the efficiency in 26.6% and 13.7% of the crashes using STDistance and WeightedSum, respectively.

An important implication of our work for future research is that we need to investigate secondary search objectives that can be dynamically activated depending on the software under test. In this work, we applied the activation mechanism for secondary search objectives (BBC) based on user-provided (static) meta-parameters. We have seen indications that such a mechanism can both improve the search process and at the same time reduce the computational cost, yet it can be counter-productive in some cases. We envision that BBC and other secondary objectives would benefit from an adaptive activation, depending on the runtime behavior (e.g., if the number of implicit runtime exceptions increases) or structure (e.g., high coupling or deep inheritance hierarchy) of the classes under test.

In our future work, we will investigate the application of BBC for other search-based test generation techniques (such as testability transformations, and system and integration testing), as well as the implications of an increase of the diversity of program states in the generated unit tests (e.g., for assertions generation). We will also investigate how BBC can be dynamically activated using an adaptive secondary objectives approach to reduce the computational overload on the search process.