Basic block coverage for search-based unit testing and crash reproduction

Search-based techniques have been widely used for white-box test generation. Many of these approaches rely on the approach level and branch distance heuristics to guide the search process and generate test cases with high line and branch coverage. Despite the positive results achieved by these two heuristics, they only use the information related to the coverage of explicit branches (e.g., indicated by conditional and loop statements), but ignore potential implicit branchings within basic blocks of code. If such implicit branching happens at runtime (e.g., if an exception is thrown in a branchless-method), the existing fitness functions cannot guide the search process. To address this issue, we introduce a new secondary objective, called Basic Block Coverage (BBC), which takes into account the coverage level of relevant basic blocks in the control flow graph. We evaluated the impact of BBC on search-based unit test generation (using the DynaMOSA algorithm) and search-based crash reproduction (using the STDistance and WeightedSum fitness functions). Our results show that for unit test generation, BBC improves the branch coverage of the generated tests. Although small (∼\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\sim $\end{document}1.5%), this improvement in the branch coverage is systematic and leads to an increase of the output domain coverage and implicit runtime exception coverage, and of the diversity of runtime states. In terms of crash reproduction, in the combination of STDistance and WeightedSum, BBC helps in reproducing 3 new crashes for each fitness function. BBC significantly decreases the time required to reproduce 43.5% and 45.1% of the crashes using STDistance and WeightedSum, respectively. For these crashes, BBC reduces the consumed time by 71.7% (for STDistance) and 68.7% (for WeightedSum) on average.


Introduction
Various search-based techniques have been introduced to automate different whitebox test generation activities (e.g., unit testing [20,22], integration testing [13], or system-level testing [4]).Depending on the testing level, each of these approaches utilizes dedicated fitness functions to guide the search process and produce a test suite satisfying given criteria (e.g., line coverage, branch coverage, etc.).
Fitness functions typically rely on control flow graphs (CFGs) to represent the source code of the software under test [35].Each node in a CFG is a basic block of code (i.e., maximal linear sequence of statements with a single entry and exit point without any internal branch), and each edge represents a possible execution flow between two blocks.Two well-known heuristics are usually combined to achieve high line and branch coverage: the approach level and the branch distance [35].The former measures the distance between the execution path of the generated test and a target basic block (i.e., a basic block containing a statement to cover) in the CFG.The latter measures, using a set of rules, the distance between an execution and the coverage of a true or false branch of a particular predicate in a branching basic block of the CFG.
Both approach level and branch distance assume that only a limited number of basic blocks (i.e., control dependent basic blocks [1]) can change the execution path away from a target statement (e.g., if a target basic block is the true branch of a conditional statement).However, basic blocks are not atomic due to the presence of implicit branches [8] (i.e., branches occurring due to the exceptional behavior of instructions).As a consequence, any basic block between the entry point of the CFG and the target basic block can impact the execution of the target basic block.For instance, a generated test case may stop its execution in the middle of a basic block with a runtime exception thrown by one of the statements of that basic block.In these cases, the search process does not benefit from any further guidance from the approach level and branch distance.
Fraser and Arcuri [24] introduced testability transformation for unit testing, which instruments the code to guide the unit test generation search to cover implicit exceptions happening in the class under test.However, this approach does not guide the search process in scenarios where an implicit branch happens in another class called by the class under test.This is due to the extra cost added to the search process stemming from the calculation and monitoring of implicit branches in all the classes coupled to the class under test.For instance, the class under test may be heavily coupled with other classes in the project, thereby finding implicit branches in all of these classes can be expensive.
In contrast, other test case generation scenarios, like crash reproduction, aim to cover only a limited number of paths, and thereby we only need to analyse a limited number of basic blocks [10,38,47,52,55].Current crash reproduction approaches rely on information about a reported crash (e.g., a stack trace, a core dump, etc.) to generate a crash reproducing test case.Among these approaches, search-based crash reproduction [47,52] takes as input a stack trace to guide the generation process.More specifically, the statements pointed to by the stack trace act as target statements for the approach level and branch distance.Hence, current search-based crash reproduction techniques suffer from a lack of guidance in cases where the involved basic blocks contain implicit branches (which is common when trying to reproduce a crash).
In our prior work we have introduced a novel secondary objective called Basic Block Coverage (BBC ) to address the guidance problem in crash reproduction [17].The secondary objective guides the search process to differentiate two generated tests with the same fitness values (here, same approach level and branch distance).This paper extends our prior work on BBC to the more general unit test case generation context.BBC helps the search process to compare two generated test cases with the same distance (according to approach level and branch distance) to determine which one is closer to the target statement.In this comparison, BBC analyzes the coverage level, achieved by each of these test cases, of the basic blocks in between the closest covered control dependent basic block and the target statement.
To assess the impact of BBC on search-based unit test generation, we implemented BBC in EvoSuite [20], the state-of-the-art tool for search-based unit test generation, and evaluate its performance against the classical DynaMOSA [43] for various activation probabilities of BBC (11 configurations in total).We applied these eleven configurations to 219 classes under test selected from the last version of Defects4J v.2.0.0 [29], a collection of existing faults.We compare the performance in terms of effectiveness for branch coverage, weak mutation score, output coverage, and real fault detection capabilities.
Our results show that BBC improves the branch coverage of the generated tests when activating BBC as a secondary objective in DynaMOSA.Utilizing this secondary objective improves the average branch coverage achieved by Dyna-MOSA (74.5% average branch coverage with standard deviation 28%) to 76.1% with standard deviation 27.5%.Despite the slight improvement in the average branch coverage, this increase in branch coverage is systematic, as indicated by the static analysis performed in this study: for 59 target classes, BBC improves the branch coverage achieved by DynaMOSA significantly (p−value < 0.01) with a large effect size.This improvement in the branch coverage leads to an increase of coverages and scores achieved by tests generated by the unit test generation process in terms of output domain (i.e., the number of pre-defined partitions of the output values domain) coverage, implicit runtime exception coverage, and the diversity of runtime states (denoted by the weak mutation score).BBC increases the average output domain coverage of the generated tests from 54.2% (with standard deviation 26.6%) up to 55.5% (with standard deviation 26.2%).The improvement achieved by this secondary objective is statistically significant and has a large effect size in 57 classes under test.Moreover, BBC improves the average implicit runtime exception coverage when using DynaMOSA from 75.1% (with standard deviation 22.8%) up to 80.3% (with standard deviation 21%).Besides, this secondary objective significantly improves the implicit runtime exception coverage with large effect size in 67 classes.Also, BBC improves the weak mutation score achieved by the tests generated by DynaMOSA from 73.2% (with standard deviation 30.1%) up to 74.6% (with standard deviation 29.6%).Finally, our static analysis shows that activating BBC also significantly improves with a large effect the fault detection rate for 3 real faults out of 92.
Similarly, to assess the impact of BBC on search-based crash reproduction, we re-implemented the existing STDistance [47] and WeightedSum [52] fitness functions and empirically compared their performance with and without using BBC (4 configurations in total).We applied these four crash reproduction configurations to 124 hard-to-reproduce crashes introduced in JCrashPack [50], a crash benchmark used by previous crash reproduction studies [16].We compare the performance in terms of effectiveness in crash reproduction ratio (i.e., percentage of times that an approach can reproduce a crash) and efficiency (i.e., time required by for reproducing a crash).
Our results show that BBC significantly improves the crash reproduction ratio over the 30 runs in our experiment for respectively 10 and 4 crashes when compared to use STDistance and WeightedSum without any secondary objective.Also, BBC helps these two fitness functions to reproduce 3 (for STDistance) and 3 (for WeightedSum) crashes that could not be reproduced without the secondary objective.Besides, on average, BBC increases the crash reproduction ratio of STDistance and WeightedSum from 70.5% (with standard deviation 38.1%) to 79.7% (with standard deviation 37.3%) and from 74.8% (with standard deviation 38.1%) to 78.1% (with standard deviation 36.1%),respectively.Applying BBC also significantly reduces the time consumed for crash reproduction guided by STDistance and WeightedSum in 56 (45.1% of cases) and 54 (43.5% of cases) crashes, respectively.In cases where BBC has a significant impact on efficiency, this secondary objective improves the average efficiency of STDistance and Weighted-Sum by 71.7% (with standard deviation 36%) and 68.7% (with standard deviation 28.9%), respectively.
The remainder of this paper is organized as follow: Section 2 reports the background on CFG-based guidance.Section 3 describes our novel BBC secondary objective and how it can be used for search-based crash reproduction and searchbased unit test generation.Section 4 describes our evaluation to assess the importance of implicit branches (RQ 0) and the impact of BBC on search-based unit test generation (RQ 1) and search-based crash reproduction (RQ 2).Section 5 presents our results on 219 classes under test selected from the last version of De-fects4J and 124 hard-to-reproduce crashes from JCrashPack.Sections 6 and 7 discuss our results and their implications for search-based test case generation, Section 8 discusses related work, and Section 9 concludes the paper.

Coverage distance heuristics
Many structural-based search-based test generation approaches mix the branch distance and approach level heuristics to achieve a high line and branch coverage [35].These heuristics measure the distance between a test execution path and a specific statement or a specific branch in the software under test.For that, they rely on the coverage information of control dependent basic blocks, i.e., basic blocks that have at least one outgoing edge leading the execution path toward the target basic block (containing the targeted statement) and at least another outgoing edge leading the execution path away from the target basic block.As an example, Listing 1 shows the source code of the method fromMap from XWIKI1 , and Figure 1 contains the corresponding CFG.In this graph, the basic block 409 is control dependent on the basic block 407-408 because the execution of line 409 is dependent on the condition at line 408 (i.e., line 409 will be executed only if elements of array formvalues are String).
Listing 1 Method fromMap from XWIKI version 8.1 [50] public BaseC ollecti on fromMap ( Map <[.The approach level is the number of uncovered control dependent basic blocks for the target basic block between the closest covered control dependent basic block and the target basic block.The branch distance is calculated from the predicate of the closest covered control dependent basic block, based on a set of predefined rules.Assuming that the test t covers only line 403 and 418, and our target line is 409, the approach level is 2 because two control dependent basic blocks (404-406 and 407-408) are not covered by t.The branch distance for the predicate in line 403 (the closest covered control dependency of node 409) is measured based on the rules from the establised technique [35].
To the best of our knowledge, there is no related work studying the extra heuristics helping the combination of approach level and branch distance to improve the coverage.Most related to our work, Panichella et al. [43] and Rojas et al. [46] introduced two heuristics called infection distance and propagation distance, to improve the weak mutation score of two generated test cases.However, these heuristics do not help the search process to improve the general statement coverage (i.e., they are effective only after covering a mutated statement).
In this paper, we introduce a new secondary objective to improve the statement coverage achieved by fitness functions based on the approach level and branch distance, and analyze the impact of this secondary objective on search-based unit test generation and search-based crash reproduction.[43].Their study [42], independently confirmed by Campos et al. [9], shows that DynaMOSA outperforms other unit test generation techniques in terms of structural coverage and mutation coverage.This approach is currently used as the default algorithm in EvoSuite, which is the state-of-the-art tool for search-based unit test generation.
DynaMOSA relies on the hierarchy of dependencies between the coverage targets (e.g., lines and branches) to perform a dynamic selection of the objectives during the search process.For instance, by applying DynaMOSA to generate tests for method fromMap (Listing 1), this algorithm, first, tries to cover targets that do not have any dependencies.So, first, it tries to generate test cases to cover nodes 403 and 418.After covering node 403, it tries to cover the node 404-406, which is control-dependent on the covered node.DynaMOSA continuously changes the search objectives up to the point that all of the targets are covered.
Since DynaMOSA uses the approach level and branch distance heuristics to guide the search process towards achieving the high line, branch, and weak mutation coverage, BBC may help this technique to cover more targets.This study performs an in-depth experiment and analysis to see whether BBC can improve DynaMOSA.

Search-based Crash Reproduction
After a crash is reported, one of the essential steps of software debugging is to write a crash reproducing test case to make the crash observable to the developer and help them in identifying the root cause of the failure [56].Later, this crash reproducing test can be integrated into the existing test suite to prevent future regressions.Despite the usefulness of a crash reproducing test, the process of writing this test can be labor-intensive and time-taking [52].Various techniques have been introduced to automate the reproduction of a crash [10,38,47,52,55], and search-based approaches (EvoCrash [52] and ReCore [47]) yielded the best results [52].
EvoCrash.This approach utilizes a single-objective genetic algorithm to generate a crash reproducing test from a given stack trace and a target frame (i.e., a frame in the stack trace that its class will be used as the class under test).The crash reproducing test generated by EvoCrash throws the same stack trace as the given one up to the target frame.For example, by passing the stack trace in Listing 2 and target frame 3 to EvoCrash, it generates a test case reproducing the first three frames of this stack trace (i.e., thrown stack trace is identical from line 0 to 3).
EvoCrash uses a fitness function, called WeightedSum, to evaluate the candidate test cases.WeightedSum is the sum scalarization of three components: (i) the target line coverage (d s ), which measures the distance between the execution trace and the target line (i.e., the line number pointed to by the target frame) using approach level and branch distance; (ii) the exception type coverage (d e ), determining whether the type of the triggered exception is the same as the given one; and (iii) the stack trace similarity (d tr ), which indicates whether the stack trace triggered by the generated test contains all frames (from the most in-depth frame up to the target frame) in the given stack trace.
Definition 1 (WeightedSum [52]) For a given test case execution t, the Weight-edSum (ws) is defined as follows: In this fitness function, d e (t) and d tr (t) are only considered in the satisfaction of two constraints: (i) exception type coverage is relevant only when we reach the target line and (ii) stack trace similarity is important only when we both reach the target line and throw the same type of exception.
As an example, when applying EvoCrash on the stack trace from Listing 2 with the target frame 3, WeightedSum first checks if the test cases generated by the search process reach the statement pointed to by the target frame (line 413 in class BaseClass in this case).Then, it checks if the generated test can throw a ClassCastException or not.Finally, after fulfilling the first two constraints, it checks the similarity of frames in the stack trace thrown by the generated test case against the given stack trace in Listing 2.
EvoCrash uses guided initialization, mutation and single-point crossover operators to ensure that the target method (i.e., the method appeared in the target frame) is always called by the different tests during the evolution process.
According to a recent study, EvoCrash outperforms other non-search-based crash reproduction approaches in terms of effectiveness in crash reproduction and efficiency [52].This study also shows the helpfulness of tests generated by Evo-Crash for developers during debugging.
In this paper, we assess the impact of BBC as the secondary objective in the EvoCrash search process.
ReCore.This approach utilizes a genetic algorithm guided by a single fitness function, which has been defined according to the core dump and the stack trace produced by the system when the crash happened.To be more precise, this fitness function is a sum scalarization of three sub-functions: (i) TestStackTraceDistance, which guides the search process according to the given stack trace; (ii) ExceptionPenalty, which indicates whether the same type of exception as the given one is thrown or not (identical to ExceptionCoverage in EvoCrash); and (iii) StackDumpDistance, which guides the search process by the given core dump.
Definition 2 (TestStackTraceDistance [47]) For a given test case execution t, the TestStackTraceDistance (ST D) is defined as follows: Where |R| is the number of frames in the given stack trace, and lcp is the longest common prefix frames between the given stack trace and the stack trace thrown by t.Concretely, |R| − lcp is the number of frames not covered by t.Moreover, StatementDistance(s) is calculated using the sum of the approach level and the normalized branch distance to reach the statement s, which is pointed to by the first (the utmost) uncovered frame by t: StatementDistance(s) = approachLevel s (t) + branchDistance s (t) .Since using runtime data (such as core dumps) can cause significant overhead [10] and leads to privacy issues [38], the performance of ReCore in crash reproduction was not compared with EvoCrash in prior studies [52], even though two out of three fitness functions in ReCore use only the given stack trace to guide the search process.Hence, this paper only considers TestStackTraceDistance + ExceptionPenalty (called STDistance hereafter).
As an example, when applying ReCore with STDistance on the stack trace in Listing 2 with target frame 3, first, STDistance determines if the generated test covers the statement at frame 3 (line 413 in class BaseClass).Then, it checks the coverage of frame 2 (line 615 in class PropertyClass).After covering the first two frames by the generated test case, it checks the coverage of the statement pointed to by the deepest frame (line 45 in class BaseStringProperty).For measuring the coverage of each of these statements, STDistance uses the approach level and branch distance.After covering all of the frames, this fitness function checks if the the generated test throws a ClassCastException in the deepest frame.
In this study, we perform an empirical evaluation to assess the performance of crash reproduction using STDistance with and without BBC as the secondary objective in terms of effectiveness in crash reproduction and efficiency.
3 Basic Block Coverage

Motivating Example
During the search process, the fitness of a test case is evaluated using a fitness function.These fitness functions are different according to the given test criteria.However, one of the main components of these fitness functions is the coverage of specific statements and branches.For instance, one of the main goals in unit test generation is achieving a high structural coverage (e.g., line and branch coverage).For this goal, the search process seeks to cover all of the statements and branches in the given CUT.Similarly, the fitness functions used in search-based crash reproduction (either WeightedSum or STDistance) require the coverage of specific statements pointed by the given stack trace.
The distance of the test case from the target statement is calculated using the approach level and branch distance heuristics.As we have discussed in Section 2.1, the approach level and branch distance cannot guide the search process if the execution stops because of implicit branches in the middle of basic blocks (e.g., a thrown NullPointerException during the execution of a basic block).As a consequence, these fitness functions may return the same fitness value for two tests, although the tests do not cover the same statements in the block of code where the implicit branching happens.
For instance, assume that one of the objectives of a search process (either for unit test generation or crash reproduction) is covering line 413 in method fromMap (appeared in Listing 1).This search process generates two test cases T 1 and T 2 for achieving this objective in a population of solutions.However, T 1 stops the execution at line 404 due to a NullPointerException thrown in method getName, and T 2 throws a NullPointerException at line 405 because it passes a null value input argument to map.Even though T 2 covers more lines, the combination of approach level and branch distance returns the same fitness value for both of these test cases: approach level is 2 (nodes 407-408 and 410), and branch distance cannot be helpful in this case as the last covered predicate does not change the execution path away from covering the target line and also the execution stops before covering the next predicate.This is because these two heuristics assume that each basic block is atomic, and by covering line 404, it means that lines 405 and 406 are covered, as well.

Secondary Objective
The goal of the Basic Block Coverage (BBC ) secondary objective is to prioritize the test cases with the same fitness value (i.e., same approach level and branch distance) according to their coverage within the basic blocks between the closest covered control dependency and the target statement.At each iteration of the search algorithm, test cases with the same fitness value are compared with return 0; end each other using BBC .Listing 3 presents the pseudo-code of the BBC calculation.Inputs of this algorithm are two test cases T 1 and T 2 , which both have the same approach level and branch distance values (calculated either using crash reproduction or unit test generation fitness functions), as well as line number and method name of the target statement.This algorithm compares the coverage of basic blocks on the path between the last control dependent node executed by both of the given tests and the basic block that contains the target statement (called effective blocks hereafter).If T 1 and T 2 do not cover any control dependency of the target block, BBC uses the entry point of the CFG of the given method instead as the starting point of the effective blocks' path.If BBC determines there is no preference between these two test cases, it returns 0. Also, it returns a value < 0 if T 1 has higher coverage compared to T 2 , and vice versa.A higher absolute value of the returned integer indicates a bigger distance between the given test cases.
In the first step, BBC detects the effective blocks that are fully covered by each given test case (i.e., the test covers all of the statements in the block) and saves them in two sets called FCB 1 and FCB 2 (lines 4 and 5 in Listing 3).Then, for each of the tests T 1 and T 2 , it detects the closest semi-covered effective block (i.e., the closest basic block to the target statement where the test covers the first line but not the last line of the block) and stores them as SCB 1 and SCB 2 , respectively (lines 6 and 7).The semi-covered blocks indicate the presence of implicit branches.
BBC can prioritize given tests in two scenarios: Scenario 1, both tests get stuck in the middle of the same basic block (i.e., they both have the same closest semi-covered basic block), or, Scenario 2, one of the tests throws an exception in an effective basic block while the other test fully covers this block.
Scenario 1. Line 9 in Listing 3 checks if the first scenario is true by determining two conditions.First, BBC checks if both tests have the same semi-covered basic block.Then, it examines if the fully covered basic blocks of one of the given tests are equal or the subset of the other test.If the second condition is not fulfilled, it means that each of these tests has one covered block that the other one does not cover, and thereby they achieve their semi-covered basic block from different paths.In this case, BBC cannot find the better test as we do not know which path can lead to covering the target statement.If these two conditions are fulfilled, BBC checks if one of the tests has a higher line coverage in the identified SCB (lines 10 to 13).If this is the case, BBC will return the number of lines in this block covered only by the winning test case.If the lines covered are the same for T 1 and T 2 (i.e., coveredLines1 and coveredLines2 have the same size), there is no difference between these two test cases and BBC returns value 0 (line 13).
Scenario 2. Line 14 in Listing 3 checks if the effective blocks covered by one test are a subset of the other one.This is true if all of the fully-covered blocks of one test are a subset of fully covered blocks of the other one.Also, the semi-covered block of this test must be among the fully-covered blocks of the test with more coverage (i.e., winner test).In this case, BBC returns the number of blocks that are only fully covered by the winner test case (line 15).If BBC determines T 2 wins over T 1 , the returned value will be positive, and vice versa.
Finally, if each of the given tests has a unique covered block in the given method (i.e., the tests cover different paths in the method), BBC cannot determine the winner and returns 0 (lines 16 and 17) because we do not know which path leads to the target block.Even if T 1 and T 2 reach a particular basic block from different paths in the CFG and both throw exceptions in different lines, BBC returns 0 and does not select the one with the more coverage in the closest basic block as the winner.The rationale behind this behavior of BBC is to provide an equal chance for these two tests to evolve as we do not know which path covered by each of these tests has more potential to help the search process to get closer to the target line.If BBC always selects the test with more coverage in the nearest basic block, even if it covers another path, we are negatively impacting the diversity of the tests chosen for the next generation, thereby reducing the search process's exploration ability.
Example.When giving two tests with the same fitness value (calculated by the primary objective) T 1 and T 2 from our motivation example to BBC with target method fromMap and line number 413, this algorithm compares their fully and semi-covered blocks with each other.In this example, both T 1 and T 2 cover the same basic blocks: the fully covered block is 403 and the semi-covered block is 404-406.So, here the conditions in Scenario 1 are fulfilled.Hence, BBC checks the number of lines covered by T 1 and T 2 in block 404-406.Since T 1 stopped its execution at line 404, the number of lines covered by this test is 1.In contrast, T 2 managed to execute two lines (404 and 405).Hence, BBC returns size(coveredLines2) − size(coveredLines1) = 1.The positive return value indicates that T 2 is closer to the target statement, and therefore, it should have a higher chance of being selected for the next generation.
Branchless Methods.BBC can also be helpful for branchless methods.These methods do not contain any branching statement (e.g., if conditions or for loops), and thereby theoretically, covering the first line in these methods leads to covering all of the other lines, as well.In other words, by ignoring the Entry and Exit nodes, CFGs of branchless methods contain only one node (i.e., basic block) without any edges.For instance, methods from frames 1 and 2 in Listing 2 are branchless.The absence of branches in these methods means that there are no control dependent nodes in them, and thereby approach level and branch distance cannot guide the search process in these cases if the generated tests throw implicit exceptions in the middle of these methods.However, in contrast with these two heuristics, BBC can guide the search process toward covering the most in-depth statement in these cases.As an example, if tests T 1 and T 2 both throws implicit branches in the middle of the only basic block (b 0 ) of branchless method m(), BBC enters the Scenario 1 (F CB 1 = F CB 2 = ∅ and SCB 1 = SCB 2 = {b 0 }) and examines if one of the tests has more lines covered in b 0 .

Application of BBC
The time complexity of BBC is O(N × E × log V ) where E and V are the numbers of edges and vertices of the CFG of the given method, respectively; and N is the number of semi-covered basic blocks calculated by semiCoveredBlocks method at lines 6 and 7 of Listing 3.This complexity stems from the computation of the closest semi-covered basic blocks in Line 12 of Listing 3. In this procedure, BBC measures the shortest path between each semi-covered basic block and the target basic block (i.e., the block containing the given target line) using Dijkstra's shortest path algorithm, which has a time complexity of O(E × log V ).
Given the complexity of BBC , applying this secondary objective for any generated tests with the same approach level and branch distance may negatively impact the search process's efficiency.In the following paragraphs, we discuss this potential negative impact on search-based crash reproduction and unit test generation.

Search-Based Crash Reporduction
The crash reproduction search process can be guided by either WeightedSum or STDistance.As discussed in Section 2.3, both of these fitness functions heavily rely on approach level and branch distance.Hence, BBC can be helpful in the crash reproduction search process.Since the crash reproduction search process's goal is to cover a specific path in the control dependent graph, which is indicated by the given stack trace, we apply BBC without any limitation on any case that includes two test cases with the same (and nonzero) approach level and branch distance.

Search-Based Unit Test Generation
In contrast with crash reproduction, the unit test generation search process has multiple statements and branches to cover simultaneously.In DynaMOSA, each line or branch to cover is an objective of the search.Hence, the number of times that BBC is applied as the secondary objective is higher compared to crash reproduction.Therefore, we should limit the number of times that BBC is applied in this algorithm.We introduce two parameters to bring this limitation: Sleep Time and Usage Rate.
Sleep Time.When DynaMOSA adds a target to the active search objectives, the target will stay active until the search process covers it.Some of the targets are easy to cover, and thereby, approach level and branch distance can simply cover them without BBC .However, BBC can help in harder cases where approach level and branch distance cannot cover them in a certain time.Sleep Time makes sure that BBC is only applied for the hard-to-cover search objectives.If we set this parameter to t seconds, DynaMOSA uses BBC secondary objective only for search objectives that are active for more than t seconds.
Usage Rate.Like any other evolutionary-based algorithm, the unit test generation search process needs to maintain a balance between the exploration and exploitation.The former indicates the diversity in the solutions (i.e., generated tests execute new paths in the code); the latter indicates searching the solutions in the existing ones' neighborhood (i.e., the search process should generate tests similar to the existing ones).By applying BBC , we improve the exploitation ability of the search process.However, the over-application of BBC may negatively impact the exploration ability of the search process.Usage Rate makes sure that BBC does not hinder this balance.Higher Usage Rate means that there is a higher chance of BBC application during the search process.Assume we set p ∈ [0, 1] as our Usage Rate.Any time that the search process generates two test cases with the same approach level and branch distance for a hard-to-cover target (i.e., target which stays as an active objective in DynaMOSA for more than Sleep Time), BBC will be used with the probability of p.
Moreover, by default, EvoSuite has eight types of search objectives [46]: line coverage, which aims to cover maximum lines in the given CUT; branch coverage, which aims to cover maximum branches in the CUT; exception coverage, which aims to maximize the number of exceptions captured by the generated tests; weak mutation, which aims to generate tests that kill the maximum number of mutants (in weak mutation, a mutant is considered killed if executing one of the generated tests on the mutant leads to a different state compared to the execution on the given CUT); output coverage, that aims for generating tests that drive the most diverse outputs; method coverage, which aims to cover all of the methods in the given CUT; no-exception Method Coverage, checks if each of the methods in the CUT is called directly by one of the tests and this invocation does not lead to any exception; and direct branch coverage that makes sure that each branch in the public methods of CUT is covered by a direct call from one of the generated tests.
Since BBC aims to help the search process rely on the approach level and branch distance in covering lines and branches that cannot be executed with the tests generated by DynaMOSA, this secondary objective is only triggered when two tests have the same fitness value either for a non-covered line coverage or branch coverage objective.Hence, BBC is not involved in segments of the search process in which two tests are getting the same fitness value for other kinds of objectives such as exception coverage.Thereby, despite the fact that BBC prioritizes tests without throwing implicit exceptions, since this secondary objective is not triggered for objectives other than line coverage and branch coverage, it does not have any negative impact on covering other search objectives (e.g., exception coverage).

Empirical Evaluation
Before evaluating the impact of BBC , we want to assess its potential usefulness by answering the following research question: RQ 0 How frequent are implicit branches in a search-based test case generation process?
This research question serves as a preliminary analysis before the full evaluation of the impact of BBC on search-based unit test generation and search-based crash reproduction.To answer it, we consider a special configuration of DynaMOSA, currently the best algorithm for unit test generation, where the executions of the BBC algorithm described in Listing 3 are monitored.We choose DynaMOSA, a many-objectives algorithm, because, unlike search-based crash reproduction, it targets each line and branch of a class under test independently, allowing us to collect more data about the execution of BBC for the different objectives.
To assess the impact of BBC on search-based unit test generation, we perform an empirical evaluation to answer the following research questions: RQ 1 What is the impact of BBC on search-based unit test generation?RQ 1.1 What is the impact of BBC on the structural coverage effectiveness of the unit tests?RQ 1.2 What is the impact of BBC on the output and implicit exception coverage of the unit tests?RQ 1.3 What is the impact of BBC on the fault finding capabilities of the unit tests?RQ 1.4 What is the impact of BBC on the structural coverage efficiency of the unit tests?
In these RQs, we want to evaluate the effect of BBC on DynaMOSA.As for other algorithms, DynaMOSA relies on the approach level and branch distance to evaluate the progress of the search process.Previous research has shown that it outperforms other search-based and guided random approaches [9,19,30,37,42,43].We compare DynaMOSA for 11 different configurations of BBC in terms of structural coverage effectiveness (RQ 1.1).Since a change in the structural coverage of a class might impact the data flow, we also study the output coverage (i.e., diversity of the values returned by the tested methods [3]) and captured implicit exceptions produced by the different tests (RQ 1.2).Then, we look at the fault finding capabilities using weak mutation and real faults from the Defects4J collection (RQ 1.3).Finally, we study the structural coverage efficiency of BBC (RQ 1.4).
Similarly, for search-based crash reproduction, we answer the following research questions: RQ 2 What is the impact of BBC on search-based crash reproduction?RQ 2.1 What is the impact of BBC on the crash reproduction effectiveness?RQ 2.2 What is the impact of BBC on the crash reproduction efficiency?
In these two RQs, we want to evaluate the effect of BBC on the existing fitness functions, namely STDistance and WeightedSum, from two perspectives: the crash reproduction ratio of the different configurations (RQ 2.1) and the time required to reproduce a crash (RQ 2.2).In Sections 4.1 and 4.2 we will detail the experimental setup for respectively the study on unit test generation (RQ 0 and RQ 1) and crash reproduction (RQ 2).
4.1 Setup for search-based unit test generation (RQ 0 and RQ 1)

Implementation
We implemented BBC as a secondary objective (called BBCOVERAGE) in Evo-Suite [20], the state-of-the-art tool for search-based unit test generation.As discussed in Section 3.3.2,since BBC impacts the exploration-exploitation trade-off and efficiency of the search process, we also defined two additional parameters for Sleep Time (BBC SLEEP with a default value of 60 seconds) and Usage Rate (BBC USAGE PERCENTAGE with a default probability of 0.5).Our implementation is openly available in our replication package on Zenodo [12].

Classes under test selection
We selected classes under test from the latest version of Defects4J (v.2.0.0) [29], a collection of reproducible failures coming from open source projects with the identification of the corresponding faulty classes.Defects4J has been used in other studies to assess the coverage and the effectiveness of unit-level test case generation [33,43,48], program repair [34,49], fault localization [7,45], and regression testing [32,39].We selected the ten most recent bugs from the 17 available projects for a total of 225 faulty classes, used as classes under test in our evaluation.This offers a good balance between the number of repetitions (i.e., statistical power) of each configuration and number of cases (i.e., generalization) [5].
Since EvoSuite may face inevitable challenges for generating tests for some particular classes [23,36,54], we performed a trial with default parameters, on all of the classes to filter out the ones for which EvoSuite cannot generate any test, as recommended by related work [9,37,43].We filtered out six classes according to our trial experiment results.In three of these classes, EvoSuite could not finish the class instrumentation.For the other two, DynaMOSA could not find any search objective.Finally, EvoSuite failed to generate tests for a class because of missing classes.By filtering these classes, we performed our main experiment on the 219 remaining cases.Table 1 provides more information about the classes selected for the evaluation.

Parameter settings
To evaluate the impact of BBC secondary objective on search-based unit test generation, first, we should set values for Sleep Time and Usage Rate (explained in Section 3.3.2).To find the optimum Sleep Time, we performed a pre-analysis on a subset of subjects.We have randomly selected 45 classes (20% of our subjects) for this pre-analysis.We ran DynaMOSA on each of the sampled classes for 30 times and collected the time required by the search process for covering each objective.These collected results indicate that DynaMOSA can cover more than 85% of the objectives in 60 seconds.For this reason, we have set Sleep Time to 60 seconds for our experiments.
For our pre-analysis (RQ 0), we have enabled BBC (Usage Rate = 1.0) after 60 seconds (with an additional setting to record the execution results of BBC ) to evaluate the number of implicit branches occurring during the search and the number of times BBC could help overcoming those implicit branches.Furthermore, to draw a comparison between setting different Usage Rate, we have used ten different values of this parameter in our main experiment (RQ 1): Usage Rate ∈ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}.
Hence, for the main experiment, we have executed DynaMOSA and one plus ten configurations of BBC on 219 classes for 30 rounds of execution with a search budget of 10 minutes.Also, we have executed DynaMOSA on 45 classes with the same number of repetitions and search budget for finding the optimum Sleep Time.In total, we ran 80,190 independent executions to answer RQ 0 and RQ 1.These executions took about 12 days overall.

Data collection
To evaluate the potential impact of BBC (RQ 0), we collected for each line and branch objective: the number of times its fitness has been evaluated, and the number of times BBC has been called, activated (i.e., the call effectively led to an evaluation of the BBC , line 13 or 15 in Listing 3), and useful (i.e., the call to BBC has returned a non-zero value).When BBC is useful, it indicates that at one or both of the test throw an implicit exception in the middle of a basic block in the method of search objective (i.e., line or branch coverage objective).
We compare BBC to DynaMOSA using branch coverage for RQ 1.1 and RQ 1.4 for 30 rounds of execution.Branch coverage provides an indication on the structural coverage by looking at the percentage of branches covered by the executions of the test cases in the class under test.We recorded the value of the branch coverage every ten seconds to see how it evolves over time and answer RQ 1.4.
For RQ 1.2, we consider output coverage and implicit exceptions.Output coverage [3] denotes the diversity of the outputs of the different methods of the class under test.It provides information about the data output coverage of the generated tests by looking at how many pre-defined abstract values (i.e., partitions of the output domain) are returned by the methods of the class under test.We used the method from Rojas et al. [46] available in EvoSuite to compute the output coverage.For instance, a method returning integer value has to return negative, zero, and positive values (when the tests are executed) to satisfy the output coverage criterion.In addition to (expected) outputs, we consider implicit exceptions by looking at the number (e) of top-level methods in the class under test throwing an undeclared (i.e., runtime) exception implicitly (i.e., without any throw new instruction).For one execution, we compute the implicit exception coverage as the ratio between e and the highest value of e among the all the executions of the different BBC configurations for that class.Since BBC addresses the challenge of handling implicit branches for search-based unit test generation, we expect it to impact both the output coverage and the number of methods throwing an implicit exception.
We rely on weak mutation and real faults to assess the fault finding capabilities of the generated tests (RQ 1.3).Weak mutation score [27,44] gives the percentage of mutants (i.e., artificially injected faults) for which at least one test triggers a different program state, compared to the original program, directly after the execution of the mutated statement.Weak mutation is a viable and cheaper alternative to strong mutation, which requires an additional propagation of the erroneous state to the output of the program [40].For our evaluation, weak mutation allows us to assess the diversity of runtime states, allowing to catch more faults, when using BBC .We use the default set of weak mutation operators available in EvoSuite [25]: delete call, delete field, insert unary operator, replace arithmetic operator, replace bitwise operator, replace comparison operator, replace constant, and replace variable.
Additionally, we use real faults from the Defects4J benchmark to compare the effective fault finding capabilities of tests generated using BBC .We executed all of the 11 configurations on the buggy versions of the software, and next, we check if the tests generated by each configuration can throw the same exception as the bug exposing stack traces, which are indicated by Defects4J.The rationale behind running all of the configurations only on the buggy versions, and not the fixed versions, is to have a realistic scenario.In a realistic scenario, developers are neither aware of the bug, nor have access to the fixed version.In this scenario, an automated test generation tool can help developers if it generates tests that throw an exception revealing the bug.Since EvoSuite can detect the assertion-based failures only by running it on the fixed version [24], we limited our comparison for fault detection only to the 92 faults that a non-assertion error can expose.

Data analysis
For each class under test, we use the Vargha-Delaney Â12 statistic [53], a nonparametric effect size measure, to examine the effect size of differences between using and not using BBC for branch, output, and implicit exception coverage, and weak mutation score (RQs 1.1-1.4).For a pair of factors (A, B) a value of Â12 > 0.5 indicates that A is more likely to achieve a higher coverage or mutation score, while a value of Â12 < 0.5 shows the opposite.Also, Â12 = 0.5 means that there is no difference between the factors.We used the standard thresholds [53] for interpreting the Â12 magnitude: 0.56 (small), 0.64 (medium), and 0.71 (large).To assess the significance of effect sizes ( Â12 ), we apply the non-parametric Wilcoxon Rank Sum test, with α = 0.01 for the Type I error (H 0 : there is no difference between using and not using BBC for x on a class under test c, where x is the branch, output, or implicit exception coverage, or weak mutation score).
We also rank the different configurations of BBC , based on their coverage and weak mutation score, using Friedman's non-parametric test for repeated measurements with a significance level α = 0.05 [26] (RQs 1.1-1.3).This test is used to test the significance of the differences between groups (treatments) over the dependent variable (here, coverage and weak mutation score).We further complement the test for significance with Nemenyi's post-hoc procedure [28,41]: two configurations are significantly different if their corresponding average ranks differ by at least the given Critical Distance (CD).
Finally, since fault coverage (RQ 1.3) has a dichotomic distribution (i.e., a generated test exposes the fault or not), for each fault, we use the Odds Ratio (OR) to measure the impact of each BBC configuration on the real faults coverage.A value OR > 1 in a comparison between a pair of factors (A, B) indicates that the application of factor A increases the fault coverage, while OR < 1 indicates the opposite.Also, a value of OR = 1 indicates that both of the factors have the same performance.We apply Fisher's exact test, with α = 0.01 for the Type I error, to assess the significance of the results (H 0 : there is no difference between using and not using BBC in reproduction ratio of the fault).

Implementation
Since ReCore and EvoCrash are not openly available, we implement BBC in Botsing [14], an extensible, well-tested, and open-source search-based crash reproduction framework already implementing the WeightedSum fitness function and the guided initialization, mutation, and crossover operators.We also implement STDistance (ReCore fitness function) in this tool.Botsing relies on EvoSuite for code instrumentation and test case generation by using evosuite-client as a dependency.We also implement the STDistance fitness function used as baseline in this paper.

Crash selection
We select crashes from JCrashPack [50], a benchmark containing hard-to-reproduce Java crashes.We apply the two fitness functions with and without using BBC as a secondary objective to 124 crashes, which have also been used in a recent study [16].These crashes stem from six open-source projects: JFreeChart, Commons-lang, Commons-math, Mockito, Joda-time, and XWiki.For each crash, we apply each configuration on each frame of the crash stack traces.We repeat each execution 30 times to take randomness into account, for a total of 114,120 independent executions.We run the evaluation on two servers with 40 CPU-cores, 128 GB memory, and 6 TB hard drive.In total, these executions took about 5 days.

Parameter settings
We run each search process with five minutes time budget and set the population size to 50 individuals, as suggested by previous studies on search-based test generation [43].Moreover, as recommended in prior studies on search-based crash reproduction [52], we use the guided mutation with a probability p m = 1/n (n = length of the generated test case), and the guided crossover with a probability p c = 0.8 to evolve test cases.We do note that prior studies do not investigate the sensitivity of the crash reproduction to these probabilities.Tuning these parameters should be undertaken as future work.

Data collection
To evaluate the crash reproduction ratio (i.e., the ratio of success in crash reproduction in 30 rounds of runs) of different assessed configurations (RQ 2.1), we follow the same procedure as previous studies [16,51]: for each crash C, we detect the highest frame that can be reproduced by at least one of the configurations (r max ).We examine the crash reproduction ratio of each configuration for crash C targeting frame r max .
To evaluate the efficiency of different configurations (RQ 2.2), we analyze the time spent by each configuration on generating a crash reproducing test case.We do note that the extra pre-analysis and basic block coverage in BBC is considered in the spent time.Since measuring efficiency is only possible for the reproduced crashes, we compare the efficiency of algorithms on the crashes that are reproduced at least once by one of the algorithms.We assume that the algorithm reached the maximum allowed budget (5 minutes) in case it failed to reproduce a crash.

Data analysis
As for real fault coverage (RQ 1.3), crash reproduction data (RQ 2.1) has a dichotomic distribution (i.e., an algorithm reproduces a crash C from its r max or not), for each crash, we use the Odds Ratio (OR) to measure the impact of each algorithm on the crash reproduction ratio for each crash.We apply Fisher's exact test, with α = 0.01 for the Type I error, to assess the significance of the results (H 0 : there is no difference between using and not using BBC in the reproduction ratio of the crash).
For RQ 2.2, for each crash, we use the non-parametric Vargha-Delaney Â12 statistic [53] with the non-parametric Wilcoxon Rank Sum test to examine differences between using and not using BBC for efficiency (H 0 : there is no difference between using and not using BBC in the reproduction efficiency of the crash).

Replicability
We enable the replicability of our results by providing replication packages on Zenodo (https://zenodo.org)for RQ 0 and RQ 1 [12] and RQ 2 [11].Those replication packages include the classes under test and crashes used for the evaluation, the evaluation infrastructure (including documentation and scripts to re-run

Potential impact of BBC (RQ 0)
Table 2 provides the general statistics of the preliminary analysis answering RQ 0 per project.The number of branch and line objectives ranges from 526 for Codec to 8,108 for JacksonCore.In total, the number of fitness evaluations per objective ranges between 1 and 1,143,620 with an average of 30,111.81evaluations.BBC has been called between 1 and 1,681,329 times per objective with an average of 34,988.58calls.It is interesting to note that, since the evaluation of an objective may require to compare multiple test cases, BBC can be called multiple times for each fitness evaluation.BBC has been effectively activated up to 1,365,526 (average of 9,472.140)times per objective, and has been useful up to 798,005 (average of 354) times per objective.
Figure 2 provides a summary of the usefulness of BBC .Each data point corresponds to the percentage of useful calls to BBC per fitness evaluation, measured for one objective and one execution out of 30.On average, BBC has been useful 2.5 times (σ = 3.17 times) per fitness evaluation, with a maximum of 4,0145 times for a single fitness evaluation (which happens when multiple test cases have to be compared).
Summary (RQ 0).Implicit branches are quite common.Our results show that on average, BBC has been activated (i.e., the call to BBC effectively led to an evaluation) 9,472.140times with a standard deviation σ = 40, 567.40, denoting big variations of the activation among the different objectives.The usefulness rate per activation is 2.39% on average (σ = 12.09%), confirming that not all activations can effectively lead to a distinction between two test cases w.r.t. to their partial coverage of basic blocks.Those results tend to confirm our design choice to parameterize the activation of BBC using an activation probability.

Search-based unit test generation (RQ 1)
We first discuss the results of applying BBC as a secondary objective for unit test generation using DynaMOSA.Contrarily to crash reproduction, which seeks to cover only a small number of branches, unit test generation targets all the branches in a class under test.
Branch coverage effectiveness (RQ 1.1).Figure 3a reports the branch coverage of the different classes under test for all the 30 test suites for the different configurations of BBC .Generally, the average branch coverage slightly improves when activating BBC as a secondary objective, from 74.5% (σ = 28%) for DynaMOSA up to 76.1% (σ = 27.5%) for BBC 0.2, 0.4, 0.6, and 1.0.Although small, this improvement is systematic across all BBC configurations according to the effect sizes reported in Figure 3b.BBC 0.6 gives the best results with a large positive ( Â12 > 0.5) effect size for 59 classes under test (against 0 large negative, Â12 < 0.5, effect size), followed by BBC 0.2 with 59 classes (against 1 classes), and BBC 0.8 with 57 classes (against 1 class).
Figure 4 provides a graphical representation of the ranking (i.e., mean ranks with confidence interval) of the different BBC configurations.According to Friedman's test, the different treatments BBC 0.1 to 1.0 achieve significantly different branch coverage (p-values < 0.01) compared to DynaMOSA.Furthermore, the differences between the average ranks of BBC 0.1 to 1.0 and the average rank of the baseline are larger than the critical distance CD = 1.375 determined by Nemenyi's post-hoc procedure (denoted by red dots in Figure 4).This indicates that BBC 0.1 to 1.0 achieves a significantly higher branch coverage than DynaMOSA.
We analyzed the correlation between the effect sizes ( Â12 ) of the best performing BBC configuration (according to Friedman's test with Nemenyi's post-hoc procedure) and BBC usefulness (presented in RQ 0).The result of this analysis indicates that there is a positive correlation between the number of times that BBC Fig. 4 Non-parametric multiple comparisons of the branch coverage using Friedman's test with Nemenyi's post-hoc procedure.
could be useful (i.e., select a winner between two given tests with the same approach level and branch distance) and the effect that this secondary objective has on branch coverage improvement (Spearman's ρ = 0.4 with a p-value < 0.6e − 10).Hence, in any case that BBC exposes that one generated test is closer to the target line than another test with the same approach level and branch distance (due to the implicit branch occurrence), there is a considerable chance that it helps the search-based test generation process to generate tests with higher branch coverage.
To confirm if this observed correlation stems from the connection between the potential implicit branches in the middle of basic blocks and improvement in the branch coverage, we manually analyzed some cases in which BBC application leads to statistically significant improvement in branch coverage achieved by the generated test.In this manual analysis, we identified multiple potential implicit exceptions before the target lines and branches, which are only covered by tests generated by utilizing BBC as a secondary objective.For instance, for the class under test com.fasterxml.jackson.databind.node.TreeTraversingParser in JacksonDatabind-106, we see that tests generated by BBC configurations achieve a higher structural coverage against DynaMOSA.In the majority of runs, the tests generated by BBC managed to cover Lines 6 to 11 in method nextToken() (Listing 4), while DynaMOSA is not successful in covering these lines.By looking at method nodeCursor.iterateChildren()(Listing 5), which is called by nextToken() in line 6 of Listing 4, we see that this method may throw an IllegalStateException at lines 4 and 12. Since Dyna-MOSA does not have any information about the branches in the other classes other than the class under test, it cannot guide the search process to execute the method iterateChildren() without raising an exception.
Output coverage and implicit exception coverage (RQ 1.2).The improvement of branch coverage also leads to more output diversity, reported in Figure 5a: from 54.2% (σ = 26.6%)for DynaMOSA up to 55.5% (σ = 26.2%)for BBC 0.8.This improvement is also systematic across all BBC configurations according to the effect sizes reported in Figure 5b.BBC 0.6 give the best results with a large positive ( Â12 > 0.5) effect size for 57 classes under test each (against 2 large negative, Â12 < 0.5, effect sizes each), followed by BBC 0.1 and 0.5 with 54 classes (against 2 classes), and BBC 0.4 with 53 classes (against 2 classes).The two target classes with large negative effect sizes on the output coverage are the same classes for the different configurations of BBC : i.e., different versions of the class org.apache.commons.cli.HelpFormatter in Cli-31 and Cli-32.Interestingly, all BBC configurations achieve a statistically significant higher implicit runtime exception coverage (i.e., undeclared runtime exceptions not explicitly thrown by a throw new instruction) with a large effect size for the same class on the same buggy versions of Cli, indicating that for this particular class, the loss of coverage of output values is compensated by a higher number of methods throwing implicit runtime exceptions.
This could be explained by the fact that BBC favors test cases with a higher coverage of basic blocks, but that are not able to reach the return statements of the methods under test (e.g., if the values used by the test cause implicit runtime exceptions).There is however no general correlation between the output coverage and the implicit exception coverage (Spearman's ρ = −0.008with a pvalue < 0.001).
Same as RQ 1.1, we evaluated the correlation between the improvement of BBC in terms of output coverage and BBC usefulness (presented in RQ 0).This analysis shows a positive correlation between these two metrics (Spearman's ρ = 0.3 with a p-value < 0.1e − 5).As we explained, this observation stems from the correlation between branch coverage and the output coverage achieved by each test: covering more lines and branches increases the chance of seeing more diverse output from CUT.To support this hypothesis, we also checked if there is a correlation between branch coverage and output coverage.Our analysis shows that branch coverage and output coverage are strongly correlated (Spearman's ρ = 0.6 with a p-value < 0.3e − 16).
Figure 6a reports the implicit runtime exception coverage of the generated tests.Implicit exceptions are not declared in the method under test and are triggered when implicit branches are executed.Results show that the average exception coverage increases when using BBC as a secondary objective: from 75.1%  (σ = 22.8%) when using DynaMOSA up to 80.3% for BBC 0.1 (σ = 21.2%) and 0.6 (σ = 21%).BBC 0.9 gives the best results with a large positive ( Â12 > 0.5) effect size for 67 classes under test (against 8 large negative, Â12 < 0.5, effect size), followed by BBC 0.6 with 66 classes (against 8 classes), and BBC 0.1 with 64 classes (against 7 classes).The rankings in Figure 7 indicate that BBC 0.1 to 1.0 perform well, with an average rank much smaller than the baseline, both for output and exception coverage.The configurations' average ranks differences with the average rank of the baseline are larger than the critical distance CD = 1.375 determined by Nemenyi's post-hoc procedure.
In contrast with branch coverage and output coverage, Spearman's test does not show any general correlation between BBC usefulness and implicit exception coverage (Spearman's ρ = 0.04 with a p-value = 0.5).This result supports our discussion in Section 3: since BBC is only triggered when DynaMOSA compares tests regarding a line or branch coverage search objective, it does not have any negative impact on other search objectives, including the implicit exception cover- Fig. 9 Non-parametric multiple comparisons of the weak mutation score using Friedman's test with Nemenyi's post-hoc procedure.
Weak mutation score and real faults (RQ 1.3)As for branch and output coverage, activating BBC slightly improves the weak mutation score of the generated tests (reported in Figure 8a).BBC 0.4, 0.6 and 0.8 achieve the higher average mutation score with 74.6% (σ = 29.6%),compared to 73.2% (σ = 30.1%)for the baseline (DynaMOSA).That improvement is also systematic across the different configurations of BBC according to the effect sizes reported in Figure 8b.BBC 0.5 gives the best results with a large positive ( Â12 > 0.5) effect size for 54 classes under test (against 0 large negative, Â12 < 0.5, effect size), followed by BBC 0.2 with 53 classes (against 0 class), and BBC 0.4, 0.6, 0.7 and 0.9 with 51 classes each (against 0 class).Looking at the ranking reported in Figure 9, BBC 0.1 to 1.0 have an average rank much smaller than the baseline.Those differences are larger than the critical distance CD = 1.375 determined by Nemenyi's post-hoc procedure.
Moreover, we checked if we could find any correlation between the weak mutation score and BBC usefulness (presented in RQ 0).This analysis shows a mod-Table 3 Real faults coverage of the different configurations with the number of faults covered at least once in 30 runs (#) out of 92 faults, the average coverage frequency (f req., σ), and the number of time a configuration performed better (> 1) of worse (< 1) than DynaMOSA with a significance level of 0.01.

Config.
Faults coverage Odds ratio erate correlation between these two metrics (Spearman's ρ = 0.37 with a p-value < 0.3e−8).One reason for this correlation could be the strong correlation between weak mutation score and branch coverage (Spearman's ρ = 0.91 with a p-value < 0.3e − 16).Thanks to BBC secondary objective, the search-based test generation process can cover more lines and branches, thereby killing the mutants in these newly covered lines.Finally, we compare the fault revealing capabilities of the generated tests using Defects4J.Table 3 presents the results for the different configurations of BBC and the baseline (DynaMOSA).In general, the tests reveal between 25 and 28 faults at least once in 30 rounds of executions out of the 92 faults considered (the selection procedure is detailed in Section 4.1).For the faults that are revealed in at least one round, the average coverage frequency (for 30 rounds of execution) varies between 22.25% (for BBC 0.1 and 1.0) and 23.04% (for BBC 0.7).The table also reports the number of faults for which a configuration performed better (odds ratio above 1) or worse (odds ratio below 1) than the DynaMOSA baseline with a significance level of 0.01.The best configurations are BBC 0.4, 0.5, 0.6, 0.8, and 1.0 with 3 faults (against 0).We manually analyzed the three faults that are captured significantly more often by BBC .In all of them, we identified potential implicit branches before covering the target line (i.e., the line in which the fault happens) that can prevent the classical and approach level from guiding the search process towards covering these failures.For instance, Listing 8 presents the stack trace that reveals a fault in JFree-Chart. 2 When selecting the XYPlot class as class under test, BBC configurations can throw this exception significantly more often than tests generated by Dyna-MOSA.This stack trace has five frames that are pointing to a method in the target class (XYPlot): Lines 1, 4, 5, 6, and 7 in Listing 8.By analyzing the methods in these frames, we can see that majority of them are simple methods with one line except the first frame in Line 1 of Listing 8, which points to method getDataRange that has about 100 lines of codes.
As we can see in Listing 9, the target line, in which the NullPointerException occurs (Line 4493), is in an if condition which starts at Line 4472.The target line is directly control-dependent on this condition.Hence, when a test fulfills the condition in line 4472, the approach level and branch distance heuristics assume that the generated test eventually will cover the target line (Line 4494), and thereby these two heuristics do not provide any guidance for the test generation search process afterward.However, by taking a closer look, we can see that even after entering the if condition, a test needs to, first, call the combine method (in one of the Lines 4476, 4479, 4485, or 4488) and also call either findDomainBounds (in Lines 4476 or 4479) or findRangeBounds (in Lines 4485 or 4488) before it can reach the target line.Each of these methods can throw explicit exceptions.Since these methods are not part of the class under test, the search process is unaware of those exceptions.Also, each of these methods calls multiple methods that can also throw exceptions.
BBC can guide the test generation search process to execute these lines without any exception and cover the target line.By covering the target line, the search process has the opportunity to generate a test that throws a NullPointerException in this target line, and thereby captures the fault.
Branch coverage efficiency (RQ 1.4).Figure 10a presents the tendency of the branch coverage over time using the smoothed conditional means.Overall, BBC 0.5 tends to achieve a higher branch coverage.This is confirmed by the number of classes for which we observe a significant difference (with α = 0.01) in the coverage achieved, reported in Figure 10b and grouped by effect size ( Â12 ) magnitude.Counts above (resp.below) 0 denote the number of classes for which we observe a positive (resp.negative) effect.After three minutes, BBC 0.4 achieves a large (resp.medium) positive effect size for 34 (resp.18) classes under test against 1 (resp.0) large (resp.medium) negative effect sizes.Those numbers slightly decrease over time with 27 (resp.18) classes under test with a large (resp.medium) effect size after exhaustion of the ten minutes search budget, for 1 (resp.0) large (resp.medium) classes with a negative effect size.
Summary (RQ 1).We see an improvement of the branch coverage of the generated tests when activating BBC as a secondary objective in DynaMOSA.This improvement in branch coverage also leads to an increase of the output and exception coverage, and of the diversity of runtime states (denoted by an increase of the weak mutation score).Among the different configurations, BBC 0.5 gives the best results and those results remain stable over time.It also leads to the coverage of three additional faults in Defects4J without any loss compared to the baseline.Giving our results, we can recommend using BBC 0.5 as a secondary objective for unit test generation.

Search-based crash reproduction (RQ 2)
Crash reproduction effectiveness (RQ 2.1). Figure 11 presents the crash reproduction ratio of the search processes guided by STDistance (Figure 11a) and Weight-edSum (Figure 11b), with and without BBC as a secondary objective.This figure shows that, on average, the crash reproduction ratio of WeightedSum improves 3.3% when using BBC : the average crash reproduction ratio of WeightedSum is 74.8% (with standard deviation 38.1%) while the average crash reproduction of WeightedSum + BBC is increased to 78.1% (with standard deviation 36.1%).This improvement is higher for crash reproduction using STDistance.On average, the crash reproduction ratio achieved by STDistance + BBC is 9.2% higher than STDistance without BBC : STDistance achieves 70.5% (with standard deviation 38.1%) average crash reproduction ratio, while the average crash reproduction ratio of STDistance + BBC is 79.7% (with standard deviation 37.3%).Higher improvement in STDistance was expected as this fitness function relies more on the approach level and branch distance heuristics for covering each of the frames in the given stack trace.Also, in both of the fitness functions, the lower quartile of crash reproduction ratio has been improved by utilizing BBC .These improvements in crash reproduction ratio for WeightedSum and STDistance are 19.1% and 31.7%,respectively.To make our observations in Figure 11 more robust, we performed an additional statistical analysis.Figure 12 depicts the number of crashes, for which BBC has a significant impact on the effectiveness of crash reproduction guided by STDistance (Figure 12a) and WeightedSum (Figure 12b).BBC significantly improves the crash reproduction ratio in 10 and 4 crashes for fitness functions STDistance and WeightedSum, respectively.Notably, the application of this secondary objective does not have any significant adverse effect on crash reproduction.Tables 4  and 5 present the odds ratio and p-value in cases that BBC leads to a significant improvement in crash reproduction ratios of WeightedSum and STDistance, respectively.As we can see in these tables, the odds ratio values in all cases are lower or equal to 0.2, indicating the high impact of BBC .Finally, we observed  that BBC helps each of the STDistance and WeightedSum to reproduce 3 new crashes that could not be reproduced without this secondary objective.
Crash reproduction efficiency (RQ 2.2). Figure 13 illustrates the number of crashes, in which BBC significantly affects the time consumed by the crash reproduction search process.As Figure 13b shows, BBC significantly improves the speed of crash reproduction guided by WeightedSum in 54 crashes (43.5% of cases), while it does not lose efficiency in the reproduction of any crash.Similarly, Figure 13a shows that BBC has a higher positive impact on the efficiency of the search process guided by STDistance.It significantly reduces the time consumed by the search process in 56 crashes (45.1% of cases), while it had no adverse impact on the reproduction efficiency of any crash.
Figure 14 depicts the average improvements in the efficiency and effect sizes for crashes where the difference in the consumed budget when using BBC or not was significant.According to the right-side plot in Figure 14a, BBC reduces the time consumed by the search process guided by STDistance up to 98% (being 71.7% on average).Also, the left-side plot indicates that the average effect size of differences between STDistance and STDistance +BBC (calculated by Vargha-Delaney) is 0.102 (lower than 0.5 indicates that BBC improved the efficiency).Figure 14b shows that the average improvement (right-side plot) achieved by using BBC as Rate).This approach might however not be optimal.For instance, for classes under test with a high number of implicit branches, activating BBC sooner and more often might improve the search process.In our future work, we will explore how the secondary objective can be dynamically adapted during the search, for instance, based on the evolution of the fitness values of the different objectives in DynaMOSA.

BBC for crash reproduction
Generally, using BBC as secondary objective leads to a better crash reproduction ratio and higher efficiency in search-based crash reproduction.This improvement is achieved thanks to the additional ability to guide the search process when facing implicit branches during the search.Combining BBC with STDistance shows an important improvement compared to the combination of BBC with WeightedSum.This result was expected, since only one (out of three) component in WeightedSum is allocated to line coverage, and thereby most parts of the fitness function do not use the approach level and branch distance heuristics.In contrast, STDistance uses the approach level and branch distance to cover each of the frames in the given stack trace incrementally.
Our results show that BBC helps the crash reproduction process to reproduce new crashes.For instance, the crash that we used in this study (XWIKI-13377) can be reproduced only by STDistance + BBC .

Threats to validity
Internal validity.We cannot guarantee that our implementation of BBC in Evo-Suite and Botsing is bug-free.However, we mitigated this threat by testing our implementations and manually examining some samples of the results.Moreover, following the guidelines of the related literature [5], we executed each configuration 30 times to take the randomness of the search process into account.
External validity.We cannot ensure that our results are generalizable to all cases.However, for both of our experiments for unit test generation and crash reproduction, we have used two earlier established benchmarks: JCrashPack [50], which is a benchmark for crash reproduction containing 124 hard-to-reproduce crashes provoked by real bugs in a variety of open-source applications, and Defects4J [29], a collection of real-world Java projects failures containing 835 bugs.
To increase the external validity while maintaining a good balance between the statistical power and the overall execution, analysis, and reporting time, we choose to consider only the ten most recent bugs from the 17 projects available in Defects4J.After filtering out classes that cannot be handled by EvoSuite, we ran our evaluation on 219 classes.Among those 219 classes, 44 come from different versions of the same projects.Although involved in different bugs, those classes might be similar and influence our results.To mitigate this threat, we performed a qualitative analysis to confirm the effect of BBC .
Construct validity.For unit test generation (RQ 1), we left the parameters of Dy-naMOSA to their default values used by EvoSuite.Those values are commonly used in the literature and it has been empirically shown that they give good results [6,23,42,43].We can, however, not guarantee that these default values are the best when used with BBC .Nevertheless, our results show that BBC can improve search-based unit test generation when using the default parameter values.
For search-based crash reproduction (RQ 2), we used BBC with two different fitness functions and left other parameters to their default values, used in previous studies [18,52].Those studies do not investigate the sensitivity of search-based crash reproduction to these values, and tuning these parameters should be undertaken as future work.However, as for unit test generation, our results show that BBC can improve search-based crash reproduction with the default parameter values.

Conclusion validity.
We based our conclusion on standard statistical analysis for significance [5] with α = 0.01.Effects of multiple comparisons are mitigated by adjusting p − values via Nemenyi's post-hoc procedure [28,41].Furthermore, we complemented our quantitative analysis with qualitative investigations to confirm the observed effects.
Verifiability.Finally, we openly provide all our implementations: Botsing [14], as an open-source crash reproduction tool, and the implementation of BBC in Evo-Suite [12].Also, the data and the processing scripts used to present the results are available as two replication packages on Zenodo [11,12].
8 Related work

Handling implicit branches
Related to our approach, the Testability Transformations (TT) technique addresses the problem of implicit branches in unit test generation [24,31].This strategy transforms the code to make implicit branches explicit by adding extra branches for error conditions and brings more guidance for the approach level and branch distance heuristics.For code transformation of each class, TT needs extra bytecode instrumentation.Since instrumenting some classes can be difficult due to several known issues [21], instrumenting each class, which is coupled with the class under test, may fail.Also, if we limit the testability transformations to the class under test, the search process will not have any extra guidance in cases of facing the implicit branches in the other classes.
In this study, we tried to evaluate TT in DynaMOSA.However, EvoSuite failed before starting the search process for all the different classes under test.After a deeper investigation, we found out that TT is not compatible with Dyna-MOSA, which is the default search algorithm in EvoSuite.Moreover, TT faces extra challenges while it needs extra bytecode instrumentation.In theory, given the nature of TT and BBC , these two techniques can be applied simultaneously to the search process.Hence, these two approaches can complement each other to achieve high structural coverage and detect more faults.Studying the impact of using both TT and BBC on search-based test generation calls for further implementation and efforts, and thereby, it is part of our future research agenda.

Search-based crash reproduction
Many previous papers have studied search-based crash reproduction approaches.Two of these papers introduced new fitness functions to guide the search process.EvoCrash [52] measures the distance of a generated test from a given crash, and Rößler et al. [47] have proposed an approach called ReCore in order to guide the crash reproduction search process using the given crash and core dump.We have described both these approaches with their corresponding fitness functions in Section 2.3; we consider them as baselines in our evaluation.
We have previously performed multiple studies on the search process introduced in EvoCrash.One of our recent studies evaluated the crash reproduction ability of EvoCrash against 200 real-world crashes [50].We have also performed an extensive manual analysis of the EvoCrash execution results to identify the challenges in this search process.We have also carried out other studies on other aspects of this search process to address some of the identified challenges.For instance, we have proposed an approach called Behavioral Model Seeding [16].In this approach, the usages of objects in the source code of software under test are transformed into transition systems, and these models are later used for generating more realistic solutions (i.e., tests) during the search process.Furthermore, in other studies [15,51], we rely on multi-objectivization techniques to improve the diversity of the population during the search.
Each of the aforementioned studies show that the proposed approaches can improve crash reproduction in their respective way.All of these studies use the EvoCrash approach as baseline.We also used this approach (i.e., WeightedSum) as a baseline, and our results are consistent with those of our prior studies [15,16] (i.e., the no seeding configuration in [16], and the Single configuration in [15]).However, it should be noted that these results can slightly differ from [51] and [50] as the experiments for these studies are performed using the EvoCrash tool.We previously re-implemented the EvoCrash approach (i.e., including the Weight-edSum fitness function) in Botsing [14], a framework for search-based crash reproduction.Since Botsing is a well-tested and more mature tool compared to the early versions of EvoCrash, it can achieve more stable results.
In this study, we applied BBC only on WeightedSum.We have not considered other strategies introduced in our previous studies [15,16,50] because each of these strategies works independently, and thereby can be applied simultaneously on the search process.For instance, model seeding improves the test generation capability of the search process, while BBC focuses on improving the guidance that the second objectives can provide for the search process.Hence, both of them can be activated during the crash reproduction process.If we wanted to apply BBC for each of these strategies, we would have many configurations to assess and compare.This kind of analysis is out of the scope of this study, which only concentrates on BBC , and calls for further studies in our future work.
In addition, for the first time, we have also considered the STD fitness function as one of the baselines [47].As we explained in Section 2.3, STD is part of a main fitness function in ReCore.This sub-function measures the distance of a generated test from covering a given crash.Since this study considers that we only have the crash stack trace and do not have any other information like core dumps, we only implemented STD as an independent fitness function in Botsing.

Conclusion and future work
Approach level and branch distance are two well-known heuristics, widely used by search-based test generation approaches to guide the search process towards covering target statements and branches.These heuristics measure the distance of a generated test from covering the target using the coverage of control dependencies.However, these two heuristics do not consider implicit branches.For instance, if a test throws an exception during the execution of a non-branch statement, approach level and branch distance cannot guide the search process to tackle this exception.In this paper, we extended our previous work on Basic Block Coverage (BBC ), a secondary objective addressing this issue.We complemented our previous study into BBC on search-based crash reproduction with an investigation of BBC for unit test generation.
Our results show that BBC improves the branch coverage for unit tests generated using DynaMOSA.Although small (∼1%), this improvement in the branch coverage is systematic and leads to an increase of the output and implicit runtime exception coverage, and of the diversity of runtime states.BBC also helps STDistance and WeightedSum to reproduce 6 and 1 new crashes, respectively.Finally, BBC significantly improves the efficiency in 26.6% and 13.7% of the crashes using STDistance and WeightedSum, respectively.
An important implication of our work for future research is that we need to investigate secondary search objectives that can be dynamically activated depending on the software under test.In this work, we applied the activation mechanism for secondary search objectives (BBC ) based on user-provided (static) metaparameters.We have seen indications that such a mechanism can both improve the search process and at the same time reduce the computational cost, yet it can be counter-productive in some cases.We envision that BBC and other secondary objectives would benefit from an adaptive activation, depending on the runtime behavior (e.g., if the number of implicit runtime exceptions increases) or structure (e.g., high coupling or deep inheritance hierarchy) of the classes under test.
In our future work, we will investigate the application of BBC for other searchbased test generation techniques (such as testability transformations, and system and integration testing), as well as the implications of an increase of the diversity of program states in the generated unit tests (e.g., for assertions generation).We will also investigate how BBC can be dynamically activated using an adaptive secondary objectives approach to reduce the computational overload on the search process.

Listing 3 2 output : int 3 begin 4 FCB1 5 FCB2 6 SCB1 7 SCB2
BBC secondary objective computation algorithm 1 input : test T1 , test T2 , String method , int line ← f u l l y C o v e r e d B l o c k s ( T1 , method , line ) ; ← f u l l y C o v e r e d B l o c k s ( T2 , method , line ) ; ← s e m i C o v e r e d B l o c k s ( T1 , method , line ) ; ← s e m i C o v e r e d B l o c k s ( T2 , method , line ) ;

Fig. 2
Fig.2Distribution of the usefulness of BBC activations per fitness evaluations.The usefulness is defined as the number of BBC evaluations returning a non-zero value divided by the number of activations.Grey points denote fitness evaluations without any BBC activation.

Fig. 3 Friedman
Fig. 3 Branch coverage of the tests generated for the 219 classes under test (out of 30 executions) for different configurations of BBC .The square ( ) denotes the arithmetic mean, the bold line (-) is the median.

Fig. 5
Fig. 5 Output coverage of the tests generated for the 219 classes under test (out of 30 executions) for different configurations of BBC .The square ( ) denotes the arithmetic mean, the bold line (-) is the median.

Fig. 6 Friedman
Fig. 6 Exception coverage of the tests generated for the 219 classes under test (out of 30 executions) for different configurations of BBC .The square ( ) denotes the arithmetic mean, the bold line (-) is the median.

Fig. 7
Fig.7Non-parametric multiple comparisons of the coverage using Friedman's test with Nemenyi's post-hoc procedure.

Fig. 8 Friedman
Fig. 8 Weak mutation score of the tests generated for the 219 classes under test (out of 30 executions) for different configurations of BBC .The square ( ) denotes the arithmetic mean, the bold line (-) is the median.

Listing 9
method getDataRange from JFreeChart public Range getDataRange ( ValueAxis axis ) { [...] // iterate through the d a t a s e t s that map to the axis and get the union // of the ranges .Iterator iterator = mapp edDatas ets .iterator () ; while ( iterator .hasNext () ) { XYDataset d = ( XYDataset ) iterator .next () ; if ( d != null ) { XYIte mRender er r = g e t R e n d e r e r F o r D a t a s e t ( d ) ; if ( isDomainAxis ) { if ( r != null ) { result = Range .combine ( result , r .f i n d D o m a i n B o u n d s ( d ) ) ; } else { result = Range .combine ( result , D a t a s e t U t i l i t i e s .f i n d D o m a i n Bo u n d s ( d ) ) ; != null ) { result = Range .combine ( result , r .fi nd R an ge Bo u nd s ( d ) ) ; } else { result = Range .combine ( result , D a t a s e t U t i l i t i e s .f i nd Ra ng e Bo un d s ( d ) ) ; } } Collection c = r .get Annotat ions () ; // target line

Fig. 10
Fig. 10 Evolution of the branch coverage of the tests generated for the 219 classes under test (out of 30 executions) for different configurations of BBC .

Fig. 11
Fig. 11 Crash reproduction ratio (out of 30 executions) of fitness functions with and without BBC .The square ( ) denotes the arithmetic mean and the bold line (-) is the median.

Fig. 12
Fig.12Pairwise comparison of impact of BBC on each fitness function in terms of crash reproduction ratio with a statistical significance < 0.01.

Fig. 13
Fig.13 Pairwise comparison of impact of BBC on each fitness function in terms of efficiency with a small, medium, and large effect size Â12 < 0.5 and a statistical significance < 0.01.

Table 1
Classes under test used for the evaluation of BBC for unit testing (RQ 0 RQ 1): number of classes under test (CUTs), number of non-commented source statements per class (NCSS), number of methods per class (Methods), weighted methods per class (WMC), and cyclomatic complexity per method (CCN).

Table 2
Statistics about the number of objectives (Obj.), fitness evaluations (Fitness eval.), calls to BBC evaluations (BBC calls), calls effectively leading to an evaluation of the BBC (BBC active), and evaluations returning a non-zero value (BBC useful).
The fault in CHART-4 which is captured significantly more often by tests generated by the search process utilizing BBC secondary objective.

Table 4
Comparing the crash reproduction ratio between crash reproduction using WS and WS + BBC , for cases where one of the configurations has a significantly higher crash reproduction ratio (p-value < 0.01) .

Table 5
Comparing the crash reproduction ratio between crash reproduction using STD and STD + BBC , for cases where one of the configurations has a significantly higher crash reproduction ratio (p-value < 0.01) .