1 Introduction

Verification is an integral part of software development processes [54, 67]. Next to code reviews and static code analysis, testing is a widely adopted quality-assurance technique. Since manually constructing test suites is laborious, automatic test-case generation techniques are used, where possible. Black-box testing uses, for example, model-based techniques, and white-box testing might be based on control-flow coverage. In this paper, we are interested in white-box techniques for structural coverage. Existing, automatic test-generation techniques in this area range from random testing [68, 104] and fuzzing [50, 95, 96], over search-based testing [99] to symbolic execution [43, 47, 93, 105] and reachability analyses [13, 25, 80, 81].

Reachability analyses perform well when they are applied to bug finding. This is supported by a recent case study that compared model checkers with test tools w.r.t. bug finding [32]. Furthermore, reachability analyses derive test suites that achieve high coverage, and several verification tools support test-case generation (e.g., Blast [13], PathFinder [123], CPAchecker [25]). At the first glance, performing a reachability check for each test goal seems too expensive. However, due to tremendous advances in software verification [11], in practice, those reachability analyses can be made pretty efficient. This motivated us to use reachability analyses for test-case generation.

It is well known, e.g., from the International Competition on Software Verification (SV-COMP) [11], that reachability analyses come with different strengths and weaknesses. For example, consider function foo in Listing 1. Let us assume that we want to generate test cases for all branches. Explicit state model checking [35, 89] cannot detect the infeasibility of the if-branch in line 5 because it cannot capture a concrete value for variable n. In contrast, the different branches of the while loop can easily be reached when explicitly tracking the concrete values of variable s. An

figure a

analysis based on predicate abstraction [71] must learn the predicates \(n<0,i\ge 0, s=1\) to detect the infeasibility of the if-branch in line 5. To reach the different branches in the while loop, predicates \(s=1\) and \(s=2\) are required. Learning these predicates might need expensive, counterexample-guided abstraction-refinement [53] steps. In CPAchecker, about half of the test-case generation time is spent on the refinement. For bounded model checking [41], it is easy to detect that the condition \({\texttt {i<n}}\) is not feasible, but it needs to unroll the while loop at least three times. Thus, the loop bound k must be increased multiple times, and each time bounded model checking is restarted. Symbolic execution [93] can detect the infeasibility of the if-branch in line 5. Moreover, like explicit-state model checking, symbolic execution easily covers the branches occurring in the while loop, but it may fail to terminate the exploration of the while loop.

The combination of approaches, which is applied in a wide area of computer science [49, 94, 108, 124], is a typical solution to overcome weaknesses of different approaches. Also, test and verification approaches already employ combinations. Existing approaches can be classified into parallel combinations [19, 42, 68, 82, 100, 116, 117, 119], sequential combinations [28, 33, 51, 57, 66, 83, 88], selective combinations [4, 15, 61, 65, 91, 121], nested combinations [6, 7, 64, 113], and interleaved combinations [8, 60, 73, 98, 120]. In this paper, we are particularly interested in interleaved combinations because we think they provide a good trade-off between the effort for their implementation and the efficiency of the combined approaches. Existing interleaved test-case generation techniques typically alternate a fixed set of approaches and prescribe which information is exchanged. We propose a new cooperative, verifier-based testing approach called CoVeriTest, which interleaves different reachability analyses and exchanges various types of analysis information between analyses. In contrast to existing interleaving approaches, CoVeriTest allows us to configure the analyses that will be combined and the level of cooperation, i.e., which information is exchanged.

CoVeriTest is inspired by abstraction-driven concolic testing [60], which interleaves concolic execution and predicate abstraction and informs the concolic execution about infeasible paths detected by the predicate analysis. In detail, CoVeriTest iteratively executes a configurable sequence of reachability analyses. In each iteration, the analyses are run sequentially and each analysis in the sequence is limited to its individual, but configurable time limit. Moreover, we can configure CoVeriTest to exchange different types of information gained during a reachability analysis, e.g., which paths are infeasible, have already been explored, or which abstraction level to use. We implemented CoVeriTest in the configurable software-analysis framework CPAchecker [29], which offers a large variety of reachability analyses, and use a large, well-established benchmark set to evaluate 126  CoVeriTest configurations. Furthermore, we compare CoVeriTest with two existing state-of-the-art test-generation tools. Our experiments confirm that the CoVeriTest approach is valuable for test generation.

Summing up, we make the following contributions:

  • We present the test-generation approach CoVeriTest that supports flexible, high-level interleavings of reachability analyses with information exchange.

  • We perform an extensive evaluation of CoVeriTest studying 126  different CoVeriTest cooperation setups and comparing CoVeriTest against two state-of-the-art test-generation tools.Footnote 1

  • CoVeriTest ’s open-source tool implementation and our experimental data are publicly available for others to reproduce our results (see Sect. 6).

This paper is an extended version of a paper presented at FASE 2019 [26]. To become self-contained, we added material about test-case generation from counterexamples and about condition construction. We enriched our explanations with examples. Furthermore, we added a new cooperation type transform-precision to CoVeriTest, which transfers information about the abstraction level from one analysis to another. Our experiments are extended to study the new cooperation type and a second combination of verifiers. All our experiments are run on the same version of the benchmark set. CoVeriTest participated in Test-Comp 2019 [84] and 2020 [85].

2 Foundations of test-case generation with reachability analyses

2.1 Programs

Following literature [22], we model a program by a control-flow automaton (CFA).

Definition 1

(Control-Flow Automaton) A control-flow automaton \(P=(L,\ell _0,G)\) consists of

  • a set L of program locations, which represent the program counter values,

  • an initial program location \(\ell _0\in L\), and

  • a set \(G\subseteq L\times Ops\times L\) of control-flow edges, where Ops denotes the set of all possible operations.

Fig. 1
figure 1

CFA for example program foo

Figure 1 shows the control-flow automaton for the function body of our example foo from Listing 1. The CFA contains an edge for every statement and two, dashed, blue edges for every condition in an if- or while-statement. The two edges represent the two possible evaluations of the respective condition.

For the semantics of a CFA, we assume a standard operational semantics, which we do not further detail.

2.2 Test goals

Generally, our approach should work for all elementary coverage criteria [81], i.e., criteria that can be expressed by a set of independent predicates on program paths (the test goals). For simplicity, we demonstrate our CoVeriTest approach on simple, structural coverage properties like branch coverage. In our program representation, structural coverage properties map to coverage of control-flow edges. Thus, we formally represent a set of test goals by a subset of a program’s control-flow edges.

Definition 2

(Test Goals) A set of test goals for a CFA \(P=(L, \ell _0, G)\) is a subset goals \(\subseteq G\) of control-flow edges.

Looking at program foo in Fig. 1, the test goals for branch coverage are the blue, dashed edges.

Our test-goal format is not a standard specification format for reachability analyses, which we want to apply for test generation. Specifications for reachability analyses typically encode when a property violation is reached, e.g., using an observer automaton [14]. Figure 2 shows how to turn a set of test goals into a common observer automaton specification. The observer automaton monitors the control-flow edges reached and traverses from the initial, safe state \(q_0\) to the violation state \(q_e\) when an edge from the set of test goals is explored.

Fig. 2
figure 2

Encoding test goals as specification violation

2.3 Generating tests from counterexamples

CoVeriTest employs reachability analyses to construct feasible counterexamples for test-goal specification violations. These counterexamples reflect executions that cover at least one test goal. Since counterexamples are not standard tests, they need to be transformed into tests. More concretely, we need to extract a test vector, i.e., a sequence of test inputs for parameters and external methods like random, scanf, __VERIFIER_nondet_int. We write our test vectors in the formatFootnote 2 used by the International Competition on Software Testing [12].

To compute a test vector from a feasible counterexample, we follow the approach of Blast [13], which we briefly sketch now:

  1. 1.

    Compute the strongest postcondition [62] for the counterexample. We use an encoding that applies skolemization to replace existential quantifiers and that works similar to static single assignment [1].

  2. 2.

    Get a satisfying assignment for the strongest postcondition formula, e.g., using an SMT solver.

  3. 3.

    In the order of the occurrence of all inputs, find out the corresponding variable for the input in the formula, look up its value in the satisfying assignment, and write the value to the test vector.

For example, consider the following counterexample:

$$\begin{aligned} \ell _0\xrightarrow {\texttt {s=1;}}\ell _1\xrightarrow {\texttt {i=0;}}\ell _2\xrightarrow {\texttt {!(n<0)}}\ell _7\xrightarrow {\texttt {i<n}}\ell _8\xrightarrow {\texttt {s==1}}\ell _9\end{aligned}$$

The strongest postcondition for this counterexample is

$$\begin{aligned} s_1=1\wedge i_1=0 \wedge \lnot (n_0<0) \wedge i_1<n_0 \wedge s_1=1 \end{aligned}$$

and a corresponding satisfying assignment is

$$\begin{aligned} n_0\mapsto 1 \quad \quad i_1\mapsto 0 \quad \quad s_1\mapsto 1. \end{aligned}$$

The only input in our example is the parameter n, which is referenced by \(n_0\) in the formula. The generated test looks as follows:

figure b

2.4 Abstract reachability graphs

CoVeriTest uses abstract reachability graphs (ARGs) [13] for cooperation between analyses: ARGs are the primary artifacts for exchanging information inside CoVeriTest. We also extract counterexamples, which we need for test generation, from this data structure. An ARG is constructed for a program \(P = (L,\ell _0,G)\) and tracks the work performed by a reachability analysis, that is, stores the abstract state space that has been explored so far and the frontier nodes (abstract states that still need to be explored). The representation of the explored abstract state space (abstract states and successor relation) depends on the respective reachability analysis. For example, a value analysis represents abstract states as value assignments, while a predicate-abstraction analysis represents abstract states as predicates. Additionally, the ARG keeps the information about the abstraction level of an analysis, e.g., tracked variables and considered predicates, respectively.

Definition 3

(Abstract Reachability Graph) An abstract reachability graph \(ARG=(N, \textit{succ}, \textit{root}, W, \pi )\) for a CFA \(P=(L, \ell _0, G)\) consists of

  • a set N of explored abstract states,

  • a relation succ \(\subseteq N \times G \times N\) that describes the already explored successor relations,

  • the abstract state \(\textit{root}\in N\),

  • a set \(W \subseteq N\) of frontier nodes, and

  • a precision \(\pi \) determining the abstraction level.

Next to the syntactical requirements mentioned in the above definition, an ARG must also fulfill the following semantic requirement: Each ARG node \(n\in N\) must either be contained in W or all its abstract successors must have been explored. We do not formalize this requirement since this would require a complete formalization of abstract reachability analyses.

Fig. 3
figure 3

ARG for example program foo

Figure 3 shows an ARG for our example foo. For simplicity, we labeled ARG edges by program operations instead of CFA edges. The reachability analysis constructing the ARG finished the exploration of the outer if-branch and stopped exploration when reaching a specification violation after executing statement s==1. The path in red shows the counterexample paths from Sect. 2.3. Furthermore, dotted states belong to the set of frontier nodes. All abstract states (nodes) contain information about the program counter (location information). Data values are abstracted by a set of predicates [71]. Here, we consider predicates \(s=1, i\ge 0,\) and \(n<0\). Furthermore, the test-goal specification from Fig. 2 is monitored. For the example, we assume that the set of test goals only contains the conditions of the if-statements in the while-loop, e.g., all other test goals for branch coverage have already been covered.

2.5 Condition

Conditions were first proposed as an information exchange format in conditional model checking [23], a cooperative verification approach. A condition describes which program execution paths have already been explored by an analysis. Analyses can use conditions to focus their exploration on unexplored paths. However, verifiers that do not understand conditions can safely ignore them.

A condition is commonly modeled as an automaton that accepts the verified program paths. Verified program paths are described using syntactical program paths in combinations with data assumptions on the syntactical paths. For this paper, we ignore assumptions in our formal definition of conditions.

Definition 4

A condition \(A=(Q,\delta ,q_0,F)\) consists of

  • a finite set of states Q,

  • a transition relation \(\delta \subseteq (Q\setminus F)\times 2^G\times Q\), where G denotes the set of control-flow edges of a program,

  • an initial state \(q_0\in Q\), and

  • a set \(F\subseteq Q\) of accepting states.

The specific condition \(A_\mathrm {none}=(\{q_0\},\emptyset , q_0,\emptyset )\) describes that no exploration has been done, i.e., the complete state space needs to be explored.

In CoVeriTest, we use conditions to focus test-case generation on paths that have not been explored by the previous analysis run. Conditions used in CoVeriTest do not impose any data restriction on syntactical paths.

Like in sequential conditional model checking [23], CoVeriTest will extract conditions from ARGs. Alg. 1 describes the extraction process. The idea is to copy incompletely explored ARG paths (i.e., paths ending in a frontier node \(f\in W\)) to the condition and to reduce the completely explored state-space parts to a single accepting state. In lines 2–6, the algorithm performs a backward search from the set W of incompletely explored nodes to detect all nodes on paths leading to frontier nodes. All these detected nodes become states in the condition. Line 7 builds the set of accepting states, which represent all non-extendable completely explored subgraphs of the ARG. Thus, we add one accepting state, whenever an ARG edge leads from a Q-state (state on an incompletely explored path) to a non-Q-state. Line 8 puts it all together and constructs a correct condition. The condition states are given by the union of the accepting states and the states detected in lines 2–6. The transition relation is a restriction of the ARG’s successor relation to all edges on incompletely explored paths.

figure c

Figure 4 shows the condition constructed by Alg. 1 from our example ARG shown in Fig. 3. For the sake of representation, we labeled the edges only by operations instead of complete CFA edges. Since reachability analyses typically construct ARGs that are unrollings of CFAs, conditions extracted from ARGs are sparse and can be efficiently represented by an adjacency list.

Fig. 4
figure 4

Condition constructed from ARG shown in Fig. 3

2.6 Test generation with reachability analyses

Having introduced the basics, we finally describe how to generate tests with a single reachability analysis. Algorithm 2 depicts the test-generation workflow. For test generation, the algorithm of course needs the program and the set of test goals. We also provide a time limit for test generation. To apply the algorithm in CoVeriTest, we need two additional inputs, an initial ARG and a condition. These two inputs provide the information gained in cooperation to the component test-generation algorithm. Basically, they guide the state-space exploration. Without cooperation one would use condition \(A_\mathrm {none}\) and an ARG consisting only of the root node (the beginning of the state-space exploration) and an empty precision.

figure d

First, Algorithm 2 initializes the data structures for the test suite and the set of covered goals. Additionally, it generates the initial specification as shown in Fig. 2. Then, it generates tests until all test goals are covered, the state space is explored completely (\({\texttt {arg}}.\text {W}=\emptyset \)), or the time limit is exceeded. Finally, it returns the generated test suite, the set of covered goals, and the last ARG built. The ARG is only returned to enable cooperation.

To generate tests, Algorithm 2 continues the exploration of the current ARG taking into account program prog, specification \(\varphi \), current ARG arg, (if understood) condition \(\psi \), and the remaining test-generation time. The exploration stops due to three reasons: (1) the state space is explored completely (\({\texttt {arg}}.\hbox {W}=\emptyset \)), (2) the time limit exceeded, or (3) a counterexample has been found.Footnote 3 Reasons (1) and (2) indicate that the test-generation process should be stopped. Only when reason (3) applies, a test is generated. The test is generated from a counterexample. First, the counterexample path, which is a path from the root to a violating state, needs to be extracted from the ARG. Since only the traversal of a test goal leads to a specification violation (see Fig. 2), the last edge on the path is a test goal. After path extraction, Alg. 2 generates a test from the counterexample path following the procedure explained earlier and adds the test to the test suite. To finish test generation, the covered test goal (last edge of the counterexample path) is added to the set of covered test goals and removed from the set of (open) test goals. Since the set of test goals might change during loop-body execution, finally the specification \(\varphi \) is updated.

Fig. 5
figure 5

Visualization of CoVeriTest ’s analysis cycle

3 CoVeriTest approach

The previous section introduced the basic concepts for test generation with a single component reachability analysis. In this section, we describe CoVeriTest.

3.1 CoVeriTest workflow

CoVeriTest combines different reachability analyses for test generation to accommodate for the different strengths and weaknesses of reachability analyses: certain test goals are harder to cover for one analysis than for another. To optimize the number of covered goals while keeping the instantiation effort of the combination simple, we decided to rotate analyses for test generation in cycles. In contrast to a sequential combination, analyses that get stuck trying to cover a particular goal may recover later. In advantage over parallel combination possibilities, we avoid to cover the same goals in parallel. In advantage over algorithm selection, we do not need to know in advance which analysis can cover which goal. Moreover, CoVeriTest supports cooperation among analyses, allowing them to exchange information about their exploration using ARGs.

Before we explain the CoVeriTest algorithm (Alg. 3), we provide an overall description of the workflow using the illustration in Fig. 5. While the standard test-generation in Alg. 2 gets an ARG and a condition, CoVeriTest gets a sequence of analysis configurations. An analysis configuration is a pair of an analysis and an individual time limit for that analysis. An analysis run is the execution of an analysis configuration. One analysis cycle is a sequence of n analysis runs, which is defined by a sequence of n analysis configurations. The analysis cycle is the core of CoVeriTest ’s alternating test-generation process, according to which ARG and condition are set up before running an analysis. As illustrated in Fig. 5, CoVeriTest cycles through a sequence of n analysis runs, each defined by an analysis configuration (analysis i, time limit i). Each analysis run gets initialized with information from the central data store, which contains the program, the remaining test goals, and a sequence of previously constructed ARGs. When an analysis terminates, it adds its constructed ARG in the central data store and adds the produced test cases to the central test suite.

Now consider Alg. 3, which more formally defines the overall CoVeriTest algorithm. The inputs prog, goals, and total_limit are similar to the inputs in Alg. 2. At the beginning (line 1), CoVeriTest initializes the test suite, the sequence of ARGs, and the index of the current analysis configuration. Then, it iterates over the analysis configurations. In each iteration, CoVeriTest extracts the current analysis configuration (pair of analysis and time limitFootnote 4), sets up and runs the respective analysis, and afterward registers the results of the analysis run in its data structures. The newly generated tests are added to the test-suite. Covered test goals are removed from the set of (open) test goals, and the sequence of ARGs is extended by the returned ARG. The iteration stops if all test goals are covered, the global time limit is exceeded, or the waitlist W of the current analysis is empty. At the end, CoVeriTest returns the generated test suite.

figure e

Now, let us look closer at the specifics of the CoVeriTest workflow. In line 9, CoVeriTest checks whether the last analysis run finished its exploration. If the analysis finished its exploration, then all reachable goals are detected and all remaining, uncovered goals are unreachable. Thus, the generated test suite is returned. Otherwise, the index of the next analysis configuration is set for the next iteration. The method coopAndInit sets up the analysis, i.e., it prepares the ARG and condition. During preparation, it may reuse information from previous ARGs and, thus, supports the cooperation between different iterations in CoVeriTest. The cooperation options are explained next when describing the method coopAndInit.

figure f

Method coopAndInit (Alg. 4) is responsible for the setup of the cooperation between analysis runs. CoVeriTest provides the following 4 cooperation options: reuse-arg, reuse-precision, use-condition, andtransform-precision. We distinguish between two types of cooperation, which mainly differ in the cooperation partner. First, we provide cooperation between different analysis runs of the same analysis (in different cycles): reuse-arg and reuse-precision. Currently, CoVeriTest allows to reuse the complete ARG or the precisionFootnote 5 of the previous run of an analysis. Second, we offer cooperation between different analyses: use-condition and transform-precision. On the one hand, CoVeriTest can use conditions to exclude those paths from future exploration that have been explored by another analysis. On the other hand, CoVeriTest can refine the abstraction level of a given analysis by adding information from the precision of other analyses to the precision of the given analysis.

Except for the reuse-arg option, all cooperation options can be combined arbitrarily. The option reuse-arg cannot be combined with the other three options due to the following reasons: (1) a combination with the reuse-precision option is always implied because the ARG already contains the precision. (2) A combination with the options for cooperation between different analyses is technically incompatible. To integrate information from other analyses in the exploration, currently the analysis needs to restart its exploration.

CoVeriTest uses ARGs to exchange information between analyses. Thus, callers of the method coopAndInit must transfer the sequence of already generated ARGs. To distinguish ARGs constructed by the same analysis from ARGs constructed by different analyses, Alg. 4 is also provided the number of different analysis configurations used by CoVeriTest.Footnote 6 Additionally, the program, which is needed for the setup, is passed to coopAndInit.

After having clarified the inputs, we now explain the behavior of Alg. 4. In line 1, it starts with the initialization of the ARG components and the condition. Thereby, it is pessimistic and assumes that no cooperation is enabled. The condition is set to the condition \(A_\mathrm {none}\), i.e., no path has been explored, and the precision is empty. Additionally, the root node describes all states pointing to the beginning of the program, i.e., the beginning of any new exploration. Lines 2–11 update precisions and conditions according to the configured cooperation options. Line 4 or line 12 returns the setup for the next analysis run. The ARG returned in line 12 only contains the root node, which needs to be explored. Hence, the next analysis run will restart the state-space exploration.

Lines 2–6 are responsible for cooperation of the same analysis. Such a cooperation is of course only possible when the analysis has been run before, i.e., CoVeriTest iterated over all analysis configurations at least once. For cooperation, we only consider the last ARG constructed by the analysis. Since the configuration is not changed during execution of CoVeriTest, the last ARG should contain all reuse information of the analysis’ previous ARGs. Lines 3 and 4 handle the reuse-arg option, which is incompatible with the remaining options. Thus, the last ARG of the analysis is looked up and returned together with the non-restricting condition \(\psi =A_\mathrm {none}\). Lines 5 and 6 handle the reuse-precision option. Like in reuse-arg, the last ARG of the analysis is extracted. However, line 6 only updates the precision to the precision of that last ARG.

Lines 7–11 show the setup of the cooperation between different analyses. Cooperation between different analyses becomes possible after the first analysis run, i.e., the first loop iteration in Alg. 3 finished. Currently, CoVeriTest only considers the last generated ARG. Lines 8 and 9 handle the use-condition option, which extracts a condition that describes the paths explored by the previous ARG. The extraction process has been described in the previous section. Lines 10 and 11 handle the transform-precision option. This option refines the abstraction level (i.e., the precision) of the next analysis with information obtained from the precision used in the last analysis run. The next paragraph describes how precisions can be transformed for different analyses.

3.2 Transformation of analysis precision

CoVeriTest transforms precisions to propagate knowledge about the abstraction level necessary for the reachability analyses from one analysis to another. The incentive of this information exchange is to avoid to rediscover the knowledge about the required abstraction level in expensive refinement steps. However, note that different analyses use different types of precisions. A precise transformation might not always be possible. Furthermore, there does not exist one transformation procedure for all types of precisions, but the procedure depends on the type of input and output precision.

If the format of the input and output precision is the same, the transform method will simply be the identity function. For all other pairs of input and target precision type, we need a dedicated transformation procedure. So far, CoVeriTest only supports one such dedicated transformation procedure that allows the transformation of a set of predicates into a set of tracked variables. The idea of this transformation procedure is that if a variable is used in a predicate, the variable will be somehow relevant for the reachability analysis. Thus, the procedure adds all variables occurring in a predicate to the returned target precision, formally \(\pi _V=\bigcup _{p\in \pi _P}\hbox {var} s(p)\).

For example, consider the following set of predicates \(\pi _P=\{i\ge 0, n<0, s=1\}\), a possible precision of the predicate analysis. Transforming the precision \(\pi _P\) into a set of tracked variables, e.g., a precision for the value analysis, results in \(\pi _V=\{i,n,s\}\).

3.3 Implementation

We integrated our CoVeriTest approach into the software-analysis framework CPAchecker [29]. This framework is highly configurable and provides a large number of different reachability analyses. Due to its support for conditional model checking, CPAchecker also contains an implementation for condition extraction. Furthermore, CPAchecker supports various export formats for counterexamples. Thus, the generation of tests from counterexamples was already available in CPAchecker. To integrate CoVeriTest in CPAchecker, we basically implemented Alg. 2 and an algorithm integrating Alg. 3 and Alg. 4. While we could have used CPAchecker ’s specification format (observer automata) to provide the test-goal specification to the analyses, it is technically quite difficult to adapt CPAchecker ’s observer automaton whenever the set of test goals changes. Thus, we implemented our own updatable observer component to monitor uncovered test goals. Our observer component is a direct implementation of automata like the one shown in Fig. 2. It is integrated as a configurable program analysis [22], CPAchecker ’s interface for an analysis (component). This way it can be easily composed with reachability analyses.

4 Experimental evaluation

We study CoVeriTest cooperation setups using a combination of either explicit-state model checking—named value analysis in the following (Val)—and predicate abstraction (Pred), or bounded model checking (BMC) and symbolic execution (SymExec). The detailed cooperation setups are described later in this section. As test goals, we use branches. Branch coverage is a commonly used coverage criterion that is supported by many test-generation tools and is easy to express as a set of test goals. We evaluate our experiments along the following research questions.

4.1 Research questions to evaluate

In the following, we list the research questions together with brief mentioning of the results that we obtained, which we later describe in more detail.

RQ 1. Time Limits for Val+Pred. We study cooperation setups using Val+Pred, and compare the coverage achieved by cooperation setups that use the same reuse type (i.e., the same configuration of the cooperation options), and thus, only differ in the time limits. Result: The combination Val+Pred performs best if more runtime is assigned to the stronger predicate analysis.

RQ 2. Reuse in Val+Pred. From all sets of cooperation setups using Val+Pred that differ only in the time limits, we select the best and compare these. Result: The combination Val+Pred achieves higher coverage if it reuses own information and does not use conditions for the predicate analysis.

RQ 3. Time Limits for BMC+SymExec. We study cooperation setups using BMC+SymExec, and compare the coverage achieved by cooperation setups that use the same reuse type, and thus, only differ in the time limits. Result: The combination BMC+SymExec performs well if switches between analyses are avoided.

RQ 4. Reuse in BMC+SymExec. From all sets of cooperation setups using BMC+SymExec that differ only in the time limits, we select the best and compare these. Result: The combination BMC+SymExec performs best if the BMC analysis is restricted by a condition.

RQ 5. Best combination. We compare the coverage results of the best cooperation setup of each verifier combination. Result: CoVeriTest performs best using combination Val+Pred.

RQ 6. Cooperation versus single analysis. For each of our analysis combinations, we compare the coverage achieved by the best cooperation setup using that analysis combination with the coverage achieved by one of the analyses of the combination alone. Result: Cooperative test-generation often performs better than a single analysis.

RQ 7. Cooperation versus parallel analyses. For each analysis combination, we compare the coverage achieved by the best cooperation setup for that analysis combination with the coverage achieved when running the analyses of the combination in parallel. Result: An interleaved combination often performs better than a parallel combination.

RQ 8. Cooperation versus other tools. We let the best cooperation setup construct test suites in the same environmental setup as in the International Competition on Software Testing (Test-Comp 2019).Footnote 7 Then, we compare the coverage of CoVeriTest, which is measured by the Test-Comp 2019 validator, with the coverage of the best two test-generation tools from Test-Comp 2019. Result: Cooperative test-generation with CoVeriTest complements existing test-generation tools.

4.2 Experimental setup

We now describe the setup of our experiments, i.e., the cooperation setups of CoVeriTest, the tools and the evaluation tasks, as well as the computing resources.

4.2.1 CoVeriTest cooperation setups

A CoVeriTest cooperation setup consists of (1) a sequence of analysis configurations (each a pair of analysis and time limit, see Sect. 3.1) and (2) one of 10 reuse types (in order to configure Alg. 4 with the respective cooperation options). We restrict our evaluation to the 4 program analyses Val, Pred, BMC, and SymExec.

Value analysis (Val). CPAchecker ’s value analysis [35] explicitly tracks the variable values of all variables in its current precision. Untracked variables are abstracted by any value. To determine which variables to track, the value analysis combines counterexample-guided abstraction refinement (CEGAR) [53] with path-prefix slicing [38] and refinement selection [37]. Value analysis can be efficient if only few variable values need to be tracked. If many different values are assigned to variables (e.g., for loop counters), then huge state spaces might be created.

Predicate analysis (Pred). CPAchecker ’s predicate analysis applies predicate abstraction with adjustable block encoding (ABE) [30]. ABE abstracts at dedicated locations (in our case loop heads) and computes the strongest postcondition at all remaining locations. The precision of the predicate analysis is a set of predicates that is determined with CEGAR [53], lazy refinement [77], and interpolation [74]. The predicate analysis is powerful and often handles loops easily. However, computing abstractions requires expensive SMT solver calls.

Bounded model checking (BMC). CPAchecker ’s bounded model checking iteratively unrolls the CFA up to a given loop bound k while simultaneously encoding the unrolled CFA and the property specification into a formula. To find counterexamples, the satisfiability of the generated formula is checked. CPAchecker ’s BMC formula encoding is based on the unified SMT-based approach for software verification [20]. Furthermore, BMC is enhanced with constant propagation applied to the unrolled CFA to rule out simple, infeasible paths. The loop bound k, which is the precision of BMC, is increased iteratively. BMC is very precise, but it may not terminate in case of unbounded loops and the satisfiability checks can become costly.

Symbolic execution (SymExec). We use \({{{\textsc {SymEx}}}}^+\) [31], an approach that combines symbolic execution [93] and CEGAR [53]. During CEGAR we apply path-prefix slicing [38] and refinement selection [37]. The CEGAR approach determines which variables to track symbolically and which path condition constraints to consider. Tracked variable values and constraints form the precision. Symbolic execution can be efficient if it only tracks few symbolic variables and constraints, but may struggle with loops or many symbolic variables and constraints.

As in our previous work [26], we combine value and predicate analysis. Additionally, we combine BMC and symbolic execution. Note that we neither combine value analysis and symbolic execution nor BMC and predicate analysis because the latter analyses can subsume the first analyses. Furthermore, we do not consider a combination of value analysis and BMC because BMC uses constant propagation already, which is a special case of value analysis. We also excluded a combination of symbolic execution and predicate analysis since it is similar to the combination of value and predicate analysis (predicate analysis tracks the symbolic values already).

To complete the analysis configurations, we need to specify the time limit for each analysis run. We are interested in two questions: (1) Are switches between analyses beneficial for the test coverage and (2) does the coverage benefit from a non-uniform distribution of the time resources, i.e., different analyses get different individual time limits? To this end, we select four time limits (10 s, 50 s, 100 s, 250 s) that are applied to both analyses and trigger switches often, sometimes, or rarely. Furthermore, we apply the two non-uniform time-limit pairs (20 s, 80 s) and (80 s, 20 s). Combining all 6 time-limit pairs with the two analysis combinations, we get 12 analysis configurations.

Fig. 6
figure 6

Reuse types specify the cooperation types for CoVeriTest experiments. We abbreviate the set of cooperation options for analysis \(A_i\) as \(co_i\). We assume that the cooperation setup is for two analyses \(A_i\) and \(A_j\)

In the following, we describe the reuse types that we use in our experiments to configure CoVeriTest ’s cooperation setups. A reuse type specifies the cooperation options for Alg. 4, which prepares the initial ARG (including the precision) and the condition for the next analysis run. The algorithm has access to ARGs from previous analysis runs. Thus, the reuse type defines the cooperation between analysis runs. For the CoVeriTest cooperation setup that we experiment with, we need to specify for both analyses how they use the information from previous ARGs. In our experiments, we use the 10 reuse types (2 of them are parametric) listed in Fig. 6.

The first seven reuse types are supported for all combinations of analyses; the last three types are only supported for Val+Pred, because the precisions of symbolic execution and BMC are not convertible. Additionally, reuse types \({\hbox {cond}}_{A_i}\) and \({\hbox {cond}}_{A_i}{+}\hbox {r}\) are parametric. Analysis \({A_i}\) can be instantiated with either analysis in the combination. Thus, we end up with 12 different reuse types for analysis combination Val+Pred and 9 for analysis combination BMC+SymExec. Combining the 12 analysis configurations (6 per analysis combination) with all compatible reuse-types, we obtain 126  cooperation setups (72 for Val+Pred, 54 for BMC+SymExec).

4.2.2 Tools

CoVeriTest is part of the software-analysis framework CPAchecker. For our experiments, we use version cpachecker-1.8-coveritest-sttt (revision 31 599) of CPAchecker. To compare CoVeriTest with state-of-the-art tools, we use the two best tools from Test-Comp 2019: VeriFuzz [50] and Klee [43]. We use their versions submitted to Test-Comp 2019. Klee uses symbolic execution. VeriFuzz is a test-generation tool based on the fuzzer AFL. To improve on AFL, VeriFuzz applies verification techniques to compute initial inputs and to set the parameters for AFL. For the comparison of CoVeriTest with VeriFuzz, and Klee, we used VeriFuzz ’s and Klee ’s resultsFootnote 8 from Test-Comp 2019 [12],Footnote 9 where the coverage of the test suites was measured using the test-suite validator TestCov [34] in version v1.2,Footnote 10 which is based on gcovFootnote 11 to measure branch coverage.

Note that we need to measure the branch coverage externally (using the original program) for this comparison because due to internal program transformations in CPAchecker, especially splitting of branch conditions, branches considered by CoVeriTest may differ from the actual program branches. Since all CoVeriTest cooperation setups are based on the same tool (CPAchecker), we do not need to measure branch coverage externally when comparing the CoVeriTest cooperation setups. Thus, we compare the number of generated tests when comparing CoVeriTest cooperation setups.

4.2.3 Programs and test goals

The test-generation tools CoVeriTest, Klee, and VeriFuzz generate tests for C programs, and they all participated in Test-Comp 2019. The structural coverage property considered in Test-Comp 2019 is branch coverage. Thus, we use the set of all branches as test goals. To compare these three test tools, we use all 1 720 programs of the Test-Comp 2019 benchmark setFootnote 12 that are used to generate tests for the branch-coverage property. For the comparison of the different CoVeriTest cooperation setups, we extend the benchmark set to all 7 644 programs considered in the software-verification competition SV-COMP 2019. Note that this is only possible because we do not need to execute tests to compare CoVeriTest cooperation setups.

4.2.4 Computing resources

We run our experiments on machines with 33 GB of memory and an Intel Xeon E3-1230 v5 CPU with 8 processing units and a frequency of 3.4 GHz. The underlying operating system is Ubuntu 18.04 with Linux kernel 4.15. We use the same resource limits as in Test-Comp 2019. Each test-generation run may use up to 8 processing units, 15 min of CPU time, and 15 GB of memory. Furthermore, the test-suite execution runs, which are required to compare against the other state-of-the-art test-generation tools Klee and VeriFuzz, are granted only 2 processing units and 7 GB of memory, but 3 h of CPU time. Measuring resources and enforcing limits is done by BenchExec [39].

4.2.5 Availability

All our experimental data are available onlineFootnote 13 [27].

Fig. 7
figure 7

Comparison of relative coverage (number of covered goals divided by maximal number of covered goals) achieved by 72 different CoVeriTest cooperation setups that use a combination Val+Pred. Figure 7a–l compares the relative coverage for cooperation setups using a fixed reuse type and different time limits. Figure 7m compares the best cooperation setup (which is always (20 s, 80 s)) of each reuse type

Fig. 8
figure 8

Comparison of relative coverage (number of covered goals divided by maximal number of covered goals) achieved by 54 different CoVeriTest cooperation setups that use a combination BMC+SymExec. Figure 8a–i compares the relative coverage for cooperation setups using a fixed reuse type and different time limits. Figure 8j compares the best cooperation setup of each reuse type

Fig. 9
figure 9

For both analysis combinations, Val+Pred (first two scatter plotters) and BMC+SymExec (last two scatter plots), we compare the coverage achieved by the respective best CoVeriTest cooperation setup with the coverage achieved when running the single analyses alone

4.3 Experimental results

RQ 1. Time Limits for Val+Pred: Assign More Time to Pred. First, we study CoVeriTest cooperation setups that interleave analyses Val and Pred, looking at the configuration of time limits. Next to the already fixed analysis combination, we also fix the reuse type and compare for each of the 12 reuse type all six CoVeriTest cooperation setups that only differ in their time limits. For comparison, we use relative coverage, which is relative to the highest number of covered goals instead of the total number of test goals. We select this measure because it allows a better comparison of the approaches, especially when many test goals are either unreachable or not covered by any of the approaches. To compute the relative coverage for a set of cooperation setups, one extracts per task and cooperation setup the achieved number of covered goals and divides it by the maximum number of covered goals extracted for that task. Figure 7 shows box plots for each reuse type. The box plots show the distribution of the relative coverage. The closer the bottom border of a box is to one, the higher is the coverage achieved. For all 12 reuse types, the last box plot has the bottom border closest to one. These box plots represent cooperation setups that use a time limit of 20 s for Val and 80 s for Pred in each round.

RQ 2. Reuse in Val+Pred: Use Own Information But No Condition for Pred. Up to now, we found out how to configure time limits for CoVeriTest with Val and Pred. Now, we look into the configuration of the reuse type. To this end, we fix the time limit to (20 s, 80 s), the time limit that performed best for each reuse type, and compare the relative coverage of the resulting 12 cooperation setups. Figure 7m shows box plots of the distribution of the relative coverage, which is relative to the highest number of covered goals. Since the highest number of covered goals depends on the compared cooperation setups, boxes in Fig. 7m can be significantly larger than in the respective figure of the reuse-type. The first five box plots in Fig. 7m show all cooperation setups that do not reuse own information. The sixth to eighth box plots show all cooperation setups that reuse own information, but in which Pred uses conditions. The ninth to twelfth box plots show those cooperation setups that reuse own information and do not use conditions for Pred. We observe that these last four box plots are smaller than the remaining box plots and their bottom borders are closer to one. Looking into our raw data, we found out that the best cooperation setup only reuses the ARG.

RQ 3. Time limits for BMC+SymExec: Switch Rarely. Next, we consider CoVeriTest cooperation setups that interleave BMC and SymExec. Again, we first look at the configuration of time limits. As before, we fix the reuse type and compare for each of the 9 reuse types all 6 CoVeriTest cooperation setups that only differ in their time limits. Figure 8 shows box plots for each reuse type. These box plots show the distribution of the relative coverage. For all 9 reuse types, the box plot that shows the time limit configuration (250 s, 250 s) has a bottom border close to one, but not necessarily the closest. Switching rarely is a good choice, but not necessarily the best. This is also supported when comparing BMC and SymExec for test generation in isolation (not in CoVeriTest). Both analyses achieve about the same coverage for one third of the tasks, for one third BMC performs better, and for another third SymExec is best. Nevertheless, when using the condition constructed from the ARG of the SymExec (reuse type \({\texttt {cond}}_s\), cond, \({\texttt {cond}}_s{+} {\texttt {r}}\), and cond+r) assigning more time to BMC than to SymExec is typically significantly better. A possible explanation is that the condition generated by SymExec might prevent BMC from merging at join points, which makes BMC inefficient. In contrast, if using cooperation option reuse-precision or reuse-arg, it is best to assign more time to SymExec than to BMC. The reason might be that BMC reuses only the loop bound k while SymExec reuses much more information, namely which variables and constraints to track.

RQ 4. Reuse in BMC+SymExec: Use Conditions from SymExec. So far, we learned how to configure time limits for different reuse types of BMC and SymExec. Next, we want to find out how to configure the reuse type. For each reuse type, we select from the six available cooperation setups the one that performed best. Again, we use the relative coverage for comparison, which depends on the compared cooperation setups. Therefore, the relative coverage of a cooperation setup varies when computed for different sets of cooperation setups. Figure 8j shows the box plots of the distributions of the relative coverage. Only the last four box plots show cooperation setups that use conditions constructed from the ARGs of SymExec. Since the last four, especially the last three, boxes are smaller than the first five boxes and their bottom borders are closer to one, we conclude that the last four cooperation setups achieve higher coverage. The best cooperation setup (\({\texttt {cond}}_s{+} {\texttt {r}}\)) reuses own information and restricts BMC to paths not yet explored by SymExec.

RQ 5. Best Combination: Val+Pred is Best. Our goal is to find out which of our analysis pairs performs best. To this end, we compare the coverage, i.e., number of covered goals divided by number of total goals, achieved by the best CoVeriTest cooperation setup of each analysis combination that we evaluated. Figure 10 shows a scatter plot that compares the coverage achieved by the best cooperation setup for BMC+SymExec (x-axis) with the coverage achieved by the best cooperation setup for Val+Pred. Note that we excluded those programs from the scatter plots, for which we miss the number of covered goals for at least one test generator, e.g., due to timeout of the analysis. Looking at the scatter plot, we observe that more data points are in the upper left half, i.e., the CoVeriTest cooperation setup interleaving Val+Pred often performs better. Indeed, the combination Val+Pred achieves a higher coverage for about 40% of the programs, and both combinations achieve the same coverage for another 40 % of the programs.

Fig. 10
figure 10

Comparison of the coverage achieved by the best CoVeriTest cooperation setup for the two analysis combinations used by CoVeriTest

RQ 6. Cooperation Versus Single Analysis: Combination Better than Single Analysis. To find out whether CoVeriTest benefits from interleaving, we take the best CoVeriTest cooperation setup for each analysis combination (Val+Pred and BMC+SymExec) and compare it against the analyses of the combination. Each single analysis is granted the same time limit for test generation as the CoVeriTest cooperation setup. Figure 9 shows four scatter plots. Each scatter plot compares the coverage achieved by the respective best CoVeriTest cooperation setup (x-axis) with the coverage achieved by one of the CoVeriTest analyses alone. Again, we excluded those programs from the scatter plots, for which we miss the number of covered goals for at least one test generator. Looking at the scatter plots, we see that in the last three scatter plots most of the points are in the lower right half. Thus, the CoVeriTest cooperation setup often achieves a higher coverage than the respective single analysis. The second scatter plot, which compares CoVeriTest using Val+Pred with the predicate analysis, is more diverse. About 53 % of the points are on the diagonal, i.e., both test generators achieved the same coverage. The predicate analysis achieves higher coverage for about 21 % of the programs (upper left half), while CoVeriTest performs better for 26 % of the programs. CoVeriTest is especially beneficial for programs that only need few variable values to trigger the branches, like ssh programs or programs from the product-lines subcategory.

RQ 7. Cooperation versus Parallel Analyses: Better Interleave than Parallelize. For both analysis combinations, Fig. 11 compares the coverage of the respective best CoVeriTest cooperation setup (x-axis) with the coverage of a test generator running a parallel combination of the same analyses.Footnote 14 As before, the scatter plots in Fig. 11 do not contain data points for which we miss the coverage value for one of the test generators. Looking at the scatter plots, we observe that many points (58 % and 64 %, respectively) are on the diagonal, i.e., the two test generators achieved the same coverage. Furthermore, CoVeriTest performs better in about 33 % (lower right half of the left scatter plot) and 20 % of the programs. In total, CoVeriTest achieves more often a better coverage than the parallel test generator and, thus, should be preferred over the parallel test generator. This is no surprise because parallelization evenly distributes the CPU time among the analyses, while we learned from previous experiments (e.g., RQ 1) that CoVeriTest cooperation setups often perform better with uneven runtime distribution.

Fig. 11
figure 11

Comparison of the coverage achieved by CoVeriTest ’s best cooperation setup using a combination of a combination Val+Pred (left) and BMC+SymExec (right) with the coverage achieved by the parallel combination of the respective analyses

Fig. 12
figure 12

Comparison of the branch coverage achieved by CoVeriTest (best cooperation setup) with the branch coverage achieved by the existing state-of-the-art test generation tools VeriFuzz (left) and Klee (right)

RQ 8. Cooperation versus Other Tools: CoVeriTests Cooperation Complements Well We compare the best CoVeriTest cooperation setup with the two best tools of Test-Comp 2019 [12]: VeriFuzz and Klee. As already mentioned, we compare the branch coverage achieved by the respective tools, which was measured by the test-suite validator TestCov [34]. Figure 12 shows two scatter plots. Each scatter plot compares the branch coverage achieved by CoVeriTest with the branch coverage achieved by one of the other tools. Points in the lower right half indicate that CoVeriTest achieved higher coverage. Both scatter plots contain points in both halves.

In concrete numbers, CoVeriTest achieves higher coverage than VeriFuzz or Klee for about 15 % of the programs (257 and 246 programs, respectively). In contrast, VeriFuzz and Klee achieve higher coverage for about 77 % of the programs (1298 and 1022 programs, respectively). Thus, there exist programs for which CoVeriTest performs better and vice versa. For example, CoVeriTest is often better on tasks of the subcategory sequentialized. However, CoVeriTest has problems with tasks from categories array and ECA. We already know from verification that CPAchecker sometimes lacks refinement support for tasks with arrays. The problem with the ECA tasks is that these tasks contain a loop in which the feasibility of execution paths heavily depends on specific variable values, that these variables are initialized with concrete values, and that the values of these variables are changed in each loop iteration. Thus, it may take several loop iterations to reach certain branches. While we know from verification that the value analysis performs better on ECA tasks than the predicate analysis, which has to apply CEGAR to learn about the specific variable values, the CoVeriTest cooperation setup used for comparison grants the predicate analysis more time. Summing up, CoVeriTest is not always best, but it is also not dominated—CoVeriTest complements the existing approaches.

4.4 Threats to validity

Our results might not generalize for several reasons. We evaluated CoVeriTest on programs of a diverse and well-established benchmark set, which consists not only of verification tasks from real-world applications but also contains generated programs. CPAchecker ’s analyses are well trained on this benchmark set, and CoVeriTest may take an advantage of these benchmark programs. However, the training is performed w.r.t. a different property: reachability of a function call instead of reachability of branches (the test goals). Furthermore, our CoVeriTest cooperation setups use two combinations of verifiers. We already observed that some conclusions hold for one combination, but not for the other. Our results might not carry over if using CoVeriTest with a different set of verifiers.

The coverage results might be imprecise. The comparison of CoVeriTest with VeriFuzz and Klee relies on the coverage values reported by the test-suite validator TestCov [34]. Due to bugs, TestCov might report wrong coverage numbers. However, TestCov was used in Test-Comp 2019 and in other research projects, and thus, we trust it. Furthermore, it is based on the well-established coverage-measurement tool gcov. Therefore, severe bugs are unlikely. For the remaining comparisons, we relied on the number of covered goals reported by CoVeriTest. While in principle invalid counterexamples could be used to cover test goals, we assume this is unlikely. The analyses used by CoVeriTest are executed in the SV-COMP configuration of CPAchecker or use a CEGAR approach. In both cases, they try hard to avoid reporting false results. Another problem is that whenever CPAchecker does not output statistics (due to timeout, out of memory, etc.), we use the last number of covered goals reported in the log. However, this might be an underapproximation of the number of covered goals. All these problems do not occur in the comparison of CoVeriTest with existing test-generation tools. Thus, this comparison still supports the value of CoVeriTest.

5 Related work

Different reachability analyses take turns in CoVeriTest to generate tests for C programs. To enable cooperation among analyses, CoVeriTest reuses different types of information from ARGs constructed by previous analysis runs.

5.1 Testing with verifiers

There is a survey on test-case generation with model checkers [63], but most approaches discussed in the survey rely on formal models instead of programs. However, also some model checkers for software support test generation. Blast [13] applies predicate abstraction to generate a test for each program location that can be reached with a state fulfilling a target predicate p. For test generation, Blast uses predicate abstraction. FShell [79,80,81] and CPA/Tiger [25] generate tests for a coverage criterion specified in the FShell query language (FQL) [81]. Both transform the FQL specification into a set of test-goal automata and check for each automaton whether its final state can be reached. FShell uses CBMC [55] to answer those reachability queries, and CPA/Tiger uses CPAchecker ’s predicate abstraction. PathFinder [123] can generate tests with explicit-state model checking or symbolic execution. Generally, test-case generation with symbolic execution [93] has received lots of interest [44, 105]. Moreover, conditional testing [33] proposes a template construction that builds an automatic test-case generator from an arbitrary verifier that can produce violation witnesses [17]—a standard exchange format for counterexamples, which is supported by many verifiers. Basically, the template combines the verifier with a transformer [18] that generates tests from witnesses.

5.2 Combined approaches for testing and verification

Combinations used for testing and verification can be classified into parallel, sequential, selective, nested, or interleaved combinations.

Parallel combinations. Portfolio combination approaches [72, 78, 82, 102] run different, independent configurations in parallel. A second class of approaches [19, 24, 56, 68, 69] runs different approaches in parallel while letting them interoperate, e.g., exchange information. A third class splits the search space [21, 23, 52, 100, 116, 117, 119], e.g., program executions, among different workers. Workers often apply the same approach, but to different parts of the search space.

Sequential combinations. Also, sequential approaches may split the search space between different approaches [23, 28, 33, 51, 59, 88]. Typically, the subsequent approach is restricted to the search space not considered by the previous approaches. Like CoVeriTest, conditional model checking [23, 28] uses a condition to restrict the search space. The condition is constructed from an ARG when a verifier gives up. Conditional testing [33] and CoVeriTest exchange information about covered test goals. Conditional testing uses reducers and extractors to exchange this information between arbitrary test tools. Evacon [83] combines symbolic execution and search-based testing and transfers the generated tests from one approach to the other. Further sequential combinations testify the result of the previous approach [48, 57, 66, 97, 107].

Selective combinations. Selective combination approaches [4, 15, 58, 61, 65, 91, 109, 121] perform algorithm selection [108]. They use certain features of a verification or test task to choose the best approach for the particular task.

Nested combinations. Nested combinations use another approach as one component of the main approach. CBST [113] uses symbolic execution to compute the initial population for search-based testing. EvoSuite [64] uses concolic execution to compute some of the new individuals in search-based testing. EvoSE [7] uses concolic execution, and some others [6] apply symbolic execution during fitness computation of individuals in search-based testing. VeriFuzz [50] applies verification techniques to compute initial inputs and to set the parameters for the fuzzer AFL.

Interleaved combinations. Interleaved combinations alternate different approaches. For example, the verifiers UFO [2] and SMASH [70] alternate underapproximation with overapproximation, while SYNERGY [73], DASH [8] and others [128] alternate test generation and proof construction to (dis)prove a property. Klee [43] alternates different exploration strategies. Hybrid concolic testing [98] interleaves random testing and symbolic execution. When random testing does not make progress, symbolic execution is started from the current state. Symbolic execution stops as soon as it covers a new goal and provides the input for covering the goal to random testing. Similarly, Driller [120] and Badger [103] alternate fuzzing with concolic execution. However, they exchange inputs when changing from one analysis to the other. Alternating different approaches [92, 125] can also augment test suites. Abstraction-driven concolic testing [60] interleaves concolic execution and predicate analysis. It inspired us to work on CoVeriTest, which is designed as a generalization of abstraction-driven concolic testing, in order to explore more such combinations.

Conditional testing. The concept of conditional testing [33] explains how testing approaches can be combined such that approaches with different strengths can contribute to the test suite, and thus, increase the coverage. From a conceptual viewpoint, CoVeriTest is an instance of conditional testing: CoVeriTest and conditional testing maintain a set of test goals to book-keep what work is done already and what is still left to do (Fig. 3 [33] explains the passing of test-goal sets for conditional testers). CoVeriTest ’s sequence of analysis runs and analysis cycles (Fig. 5 and Alg. 3) can be expressed as sequential tester (Fig. 8 [33]) and cyclic tester (Fig. 9 [33]), respectively. The standard Alg. 2 for generating test cases using reachability analyses can be expressed as a combination of a verifier-based tester and a cyclic conditional tester, as described in Figs. 13 and 14 [33]. On top of the above features, CoVeriTest supports cooperation setups in which not only test-goal sets but also ARGs, condition automata, and abstraction precisions are exchanged between different analyses (ARGs can be huge in size).

5.3 Reusing information from state-space exploration

Information from state-space explorations has been reused in different context like, e.g., validation of verification results or incremental verification.

Validating results. Validation approaches use information provided by the verification to check the verification result. Many verifiers [11] construct verification witnesses [16, 17] from the explored state space. To check correctness results, several proof-carrying code approaches provide (partial) state-space information [3, 10, 86, 110], transform the state space into verification conditions [9, 46, 75, 114], or transform the program into an easier verifiable program [87].

Incremental verification. The goal of incremental verification is to use information from a previous verification to reverify a program after the program or property changed. Some approaches [106, 126, 127] use the state-space information to skip the verification of unmodified program parts. Other approaches reuse the solutions of constraint or SAT proofs [5, 90, 101, 122]. Precision reuse [36] and trace-abstraction reuse [111] reuse information on the abstraction level. Other types of approaches [25, 40, 45, 76, 112, 115, 118] adapt the explored state space to the change. Extreme model checking [76] and CPA/Tiger [25] adapt ARGs. Extreme model checking [76] reuses still valid ARG parts and reexplores invalid ARG subgraphs. CPA/Tiger [25] transforms an ARG that was constructed for one test goal such that it fits to a new test goal. Lazy abstraction refinement [77] adapts an ARG to continue exploration after abstraction refinement. CoVeriTest continues the exploration of the ARG, but does not need to adapt it. Furthermore, it integrates the idea of precision reuse and some of the analyses in CoVeriTest apply lazy abstraction refinement.

6 Conclusion

Software quality assurance is an important aspect in software development. Testing is a standard means for quality assurance, but state-of-the-art techniques have difficulties in covering sophisticated branching conditions [32]. Analyses that are designed to check reachability properties are well suited for this task because they only need to check the reachability of such a branching condition and generate a test if the branch condition is reachable. Nevertheless, for each technique there exist reachability queries (i.e., branch conditions) on which the technique is inefficient or fails in practice. To overcome this limitation, we propose CoVeriTest, which interleaves different reachability analyses to generate tests. In our experiments, we study various CoVeriTest cooperation setups that differ in the used analyses, the time limits of the analyses, and the information exchanged between analysis runs. CoVeriTest works best when interleaving value and predicate analysis, letting them resume their exploration, restricting the information exchange between them to covered test goals, and assigning the more mature predicate analysis a larger time limit. Furthermore, a comparison of CoVeriTest with (a) the analyses used by CoVeriTest and (b) state-of-the-art test-generation tools show the benefits of our CoVeriTest approach.

Future Work. Currently, not all reuse options are always available. Precision transformation is only available from predicate to value analysis. It is promising to develop such transformations for other combinations as well. Furthermore, not all options can be freely combined. It would be interesting to investigate how to automatically detect unavailable options. One question is, e.g., how to adapt the ARG to a new condition.

Another future direction focuses on a better understanding of CoVeriTest, i.e., when and in which cooperation setup to use CoVeriTest. Therefore, one could study the influence of program and analysis characteristics on the performance of CoVeriTest.