Introduction

Software test case generation is well-recognized as one of the necessary prerequisites for improving software quality in practice [1]. Bugs may exist in statements guarded by a specific condition such as a loop. For example, existing studies show that there are at least 24 loop bugs in ArduPilot system [2], 331 loop bugs in Eclipse [3], 137 loop bugs in ArgoUML [4], and 235 loop bugs in MegaMek [5]. Figure 1 shows a code snippet of the loop program egcd-ll_valuebound1.c in the nla-digbench-scalingFootnote 1 category of verification tasks from the competition on software verification 2021 (SV-COMP’21) [6]. This program has test oracles [7] expressed by assertions of conditions. A bug is considered to be exposed when the corresponding negated conditions of the assertions are triggered. Intuitively, the goal of an automated test case generator is to find the corresponding input that triggers the negated condition of the assertions (i.e., expose a bug). Unfortunately, it is a difficult task, especially when bugs can only be triggered after an interleaving execution among the branches. Furthermore, the number of loop paths grows exponentially when the iteration count increases.

Three categories of techniques are mainly employed to generate test cases in loops, namely loop summarization, search-based heuristics, and bounded iteration [8]. Loop summarization [9,10,11] infers the relationship between loop variables and iteration counts as a set of formulas. Constraint solvers are applied for those formulas to reach a test goal [8]. However, there are some challenges in the general applicability of loop summarization. For instance, loop summarization is not suitable in loops that contain non-induction variables [12]. Another problem of this technique is that it suffers from high analysis cost [8]. Thus, it is challenging to expand to loop programs with complex interleaving relationships. Search-based heuristics optimize test cases to reach coverage targets [8, 13]. In those approaches [8, 13], a predefined test goal inside or outside the loop body is initially identified. A fitness function is employed to guide the optimized test cases toward an optimal solution. Those approaches [8, 13] typically explore a few paths leading to the predefined test goals to avoid exploring all possible interleaving patterns (i.e., paths) in the loop body. However, bugs may be contained in any potential loop paths. Thus, it is necessary to explore all possible interleaving patterns. In bounded iteration, the iteration limit value is set to prevent iterating the loop multiple times without reaching any new coverage targets [14,15,16,17,18,19]. Some researchers found that combining search-based heuristics with bounded iteration is suited for generating test cases in loops [8].

Fig. 1
figure 1

Example of a loop program

A large number of existing works [20,21,22] use bounded iteration techniques with either symbolic execution or search-based methods to generate test cases for covering the targets (e.g., paths) in loops. Symbolic execution leverages constraint solvers to generate test cases which can systematically explore program paths. In contrast to symbolic execution, search-based methods use evolutionary algorithms, such as genetic algorithms (GAs) [23, 24], to generate test cases covering multi-path loops. Search-based methods require a numerical representation of the coverage target (called fitness function) to guide the search toward the coverage target. A plethora of research effort in evolutionary algorithms has been devoted to satisfying structural coverage criteria such as statement, branch and path coverage of a bounded depth. Some researchers attempt to address automated test case generation based on path coverage (ATCG-PC) in the programs with loops [21, 22, 25,26,27]. It is noted that the existing works [21, 22, 25,26,27,28,29] either neglect the interleaving of paths in loops or are only tailored to address ATCG-PC in unnested loops. Thus, it is a difficult task to address loops with complex interleaving relationships among the paths, such as nested loops. In this paper, we focus on finding bugs in both nested and unnested multi-path loops based on path coverage by exploring all possible paths.

Notably, the aforementioned works that address ATCG-PC [20, 28, 29] implement either single-target (i.e., one-path-at-a-time) [25,26,27,28] or multi-target approaches [21, 29] to address ATCG-PC. In single-target approaches, evolutionary algorithms are applied to optimize a single-coverage target (e.g., a loop path) at a time. However, recent studies [30, 31] have shown that there are certain drawbacks to applying single-target approaches for test case generation. First, some coverage targets may be infeasible. Second, some targets are difficult to be covered. Thus, a significant amount of time allotted to those targets may be consumed in single-target approaches [30, 31]. In multi-target approaches, evolutionary algorithms are applied to optimize all coverage targets simultaneously. Some recent works [32, 33] have shown that multi-target approaches are more effective and efficient than single-target approaches in programs with many coverage targets. In loop programs, the number of paths can often reach hundreds of thousands. As a result, many-objective evolutionary algorithms are a better choice for ATCG-PC in multi-path loops.

Since multi-path loops contain a significant number of paths, exploring all possible paths to improve bug finding may not be feasible within a limited search time budget. The large number of objectives to optimize results in an enormous amount of time-consuming fitness evaluations and comparisons of individuals in terms of their objective values. Ideally, in a many-objective evolutionary algorithm, an individual (i.e., test case) is evaluated on more than one path. In the context of finding bugs in multi-path loops based on path coverage, we find that many paths are similar. These latent similarities can be exploited to find solutions quickly within the given search budget.

In this paper, we propose a many-objective test case generation framework called generalized differential evolution 3 based on knowledge transfer (KT-GDE3) to improve bug finding capability in multi-path loops. KT-GDE3 is designed to explore many paths to find bugs in multi-path loops. Our strategy employs a knowledge transfer scheme that leverages the latent similarities among groups of loop paths to enhance the optimization process. Specifically, similar loop paths are organized into groups based on the leveraged common path prefix information, while related groups are assigned to the same neighbor set. Based on the created neighbor sets, our approach samples test cases that cover previous loop paths in groups from the same neighbor set. The archived test cases in the same neighbor set are incorporated into the initial population to find solutions for loop paths of another group in this set. We assume that the knowledge transfer scheme is essential for enhancing the optimization process of many-objective algorithms such as generalized differential evolution 3 (GDE3) to find bugs in multi-path loops.

To evaluate the effectiveness of KT-GDE3 in triggering bugs in multi-path loops, we select the loops and reach-safety benchmarks from the competition on software verification 2016 and 2021 (SV’COMP 16 and SV’COMP 21) [6, 34], respectively. The experimental results show that KT-GDE3 has higher bug finding capability in multi-path loops than existing state-of-the-art test case generation algorithms (DynaMOSA [31], WTS [30], MISA [35] and DE-SS [27]). The experimental results also indicate that the proposed knowledge transfer scheme can significantly improve bug finding capability of algorithms such as GDE3 [36] in multi-path loops.

The main contributions of our work are summarized as follows:

  1. 1.

    We propose a group-based fitness evaluation specifically designed for many-objective search in the context of generating test cases to cover bug-inducing paths. The approach we employ capitalizes on the similarities among loop paths to effectively organize them into groups. By grouping similar paths, redundant fitness evaluations can be minimized, leading to enhanced efficiency.

  2. 2.

    We propose a many-objective test case generation framework tailored to find bugs in multi-path loops. The proposed framework transfers knowledge from previously covered loop paths to cover the remaining loop paths. The knowledge transfer scheme can enhance the bug finding capability of multi-objective algorithms in both nested and unnested loops.

Basic concepts and related works

In this section, we present an overview of the basic definitions related to multi-path loops, the categories of bugs in multi-path loops, and the related works.

Fig. 2
figure 2

Example of a unnested loop and its CFG

Fig. 3
figure 3

Example of a nested loop and its CFG

Basic definitions

  • Definition 1 (Control flow graph) A control flow graph (CFG) of a loop is a tuple G = (V, E, \(V_{pre}\), \(V_{h}\), \(V_{e}\)), where (V, E) is a finite directed graph; V is a set of nodes, and each node in V is a code basic block. \(E \subseteq V\times V\) is the set of directed edges connecting the nodes. \(V_{pre} \subset V\) is the pre-header from which the loop enters into the entry node. \(V_{h}\subset V\) is a set of loop header nodes. \(V_{e}\subset V\) is a set of exit nodes.

  • Each edge \(e = (v, v') \in E\) represents a transfer of control flow from node \(v \in V\) to \(v' \in V\). Intuitively, node \(v'\) can be executed after node v.

  • \(v \in V\) dominates node \(v' \in E\) if every execution path from the pre-header \(V_{pre}\) to node \(v'\) passes through v.

  • \(v \in V_{h}\) is a loop header node if there is an edge \((v', v) \in E\) and v dominates \(v'\).

  • \(v \in V_{e}\) is an exit block if there is an edge \((v', v) \in E\) where \(v'\) is in the loop and v is not in the loop.

Figures 2 and 3 show the instances of unnested and nested loops, respectively. The pre-header nodes are highlighted in blue. The loop header nodes and exit nodes in the two instances are highlighted in red and green, respectively.

  • Definition 2 (Simple route) Given a CFG G = \((V, E, V_{pre}, V_{h}, V_{e})\) of a loop, a simple route denoted by \(\sigma \) is sequence of edges \(v_{o}\overset{e_{0}}{\rightarrow }v_{1}\overset{e_{1}}{\rightarrow }\ldots \overset{e_{n-1}}{\rightarrow }v_{n}\). The \(v_{0} \in V_{h} \bigcup V_{pre}\) is the head of \(\sigma \), while the \(v_{0} \in V_{h}\) is the loop header node which dominates all nodes in the CFG. \(v_{i} \notin V_{h} \bigcup V_{e} (\forall ~ 1 \le i < n\)) is the middle node of \(\sigma \). \(v_{n} \in V_{h} \bigcup V_{e}\) is the tail of \(\sigma \).

Intuitively, a loop can have multiple simple routes. We denote the set of possible simple routes in a CFG as \(S = \{\sigma _{1},\ldots ,\sigma _{m}\}\). A simple route is a head-to-head route if its head and tail are the same (i.e., head of \(\sigma \) = tail of \(\sigma \)). Otherwise, it is a head-to-exit route. For example, the CFG of the nested loop program in Fig. 3b has the following head-to-head routes.

  • \(\sigma _{1}= b\overset{m< k}{\longrightarrow }c\overset{T_{2}}{\longrightarrow }e\overset{j< n}{\longrightarrow }f\overset{t[j]> r}{\longrightarrow } h\overset{T_{4}}{\longrightarrow }e\overset{j < n}{\longrightarrow }f\overset{t[j]> r}{\longrightarrow } h\overset{T_{4}}{\longrightarrow }e\overset{j >= n}{\longrightarrow }g\overset{T_{3}}{\longrightarrow }b\)

  • \(\sigma _{2}= b\overset{m< k}{\longrightarrow }c\overset{T_{2}}{\longrightarrow }e\overset{j< n}{\longrightarrow }f\overset{t[j]<= r}{\longrightarrow } i\overset{T_{5}}{\longrightarrow }e\overset{j< n}{\longrightarrow }f\overset{t[j] <= r}{\longrightarrow } i\overset{T_{5}}{\longrightarrow }e\overset{j >= n}{\longrightarrow }g\overset{T_{3}}{\longrightarrow }b\)

  • \(\sigma _{3}= b\overset{m< k}{\longrightarrow }c\overset{T_{2}}{\longrightarrow }e\overset{j< n}{\longrightarrow }f\overset{t[j]<= r}{\longrightarrow } i\overset{T_{5}}{\longrightarrow }e\overset{j < n}{\longrightarrow }f\overset{t[j]> r}{\longrightarrow } h\overset{T_{4}}{\longrightarrow }e\overset{j >= n}{\longrightarrow }g\overset{T_{3}}{\longrightarrow }b\)

  • \(\sigma _{4} = b\overset{m< k}{\longrightarrow }c\overset{T_{2}}{\longrightarrow }e\overset{j< n}{\longrightarrow }f\overset{t[j]<= r}{\longrightarrow } h\overset{T_{4}}{\longrightarrow }e\overset{j < n}{\longrightarrow }f\overset{t[j]> r}{\longrightarrow } h\overset{T_{4}}{\longrightarrow }e\overset{j >= n}{\longrightarrow }g\overset{T_{3}}{\longrightarrow }b\)

The head to exit route in the nested loop is \(\sigma _{4} = b\overset{m >= k}{\longrightarrow }d\).

  • Definition 3 (Loop path) Given a CFG G = \((V, E, V_{pre}, V_{h}, V_{e})\), a loop path is a sequence of edges \(v_{o}\overset{e_{0}}{\rightarrow }v_{1}\overset{e_{1}}{\rightarrow }\ldots \overset{e_{n-1}}{\rightarrow }v_{n}\) representing one possible way to execute the loop for a bounded number of iterations. \(v_{0} \in V_{h} \bigcup V_{pre}\) is the head of simple route \(\sigma \). \(v_{0} \in V_{h}\) is the loop header node which dominates all nodes in the CFG. \(v_{n} \in V_{h} \bigcup V_{e}\) is the tail.

Conceptually, a loop path consists of an interleaving of head-to-head routes and head-to-exit routes. Head-to-head routes can appear multiple times in a loop path, while head-to-exit routes can appear only one time in the loop path. For example, let the iteration bound of nested loop in Fig. 3a be equal to 2. One possible head-to-head route is \(\sigma _{1}= b\overset{m< k}{\longrightarrow }c\overset{T_{2}}{\longrightarrow }e\overset{j< n}{\longrightarrow }f\overset{t[j]> r}{\longrightarrow } h\overset{T_{4}}{\longrightarrow }e\overset{j < n}{\longrightarrow }f\overset{t[j]> r}{\longrightarrow }h\overset{T_{4}}{\longrightarrow }e\overset{j >= n}{\longrightarrow }g\overset{T_{2}}{\longrightarrow }b\). Another possible simple route is \(\sigma _{2} = b\overset{m< k}{\longrightarrow }c\overset{T_{2}}{\longrightarrow }e\overset{j< n}{\longrightarrow }f\overset{t[j]<= r}{\longrightarrow } i\overset{T_{5}}{\longrightarrow }e\overset{j< n}{\longrightarrow }f\overset{t[j] <= r}{\longrightarrow } i\overset{T_{5}}{\longrightarrow }e\overset{j >= n}{\longrightarrow }g\overset{T_{3}}{\longrightarrow }b\). The loop path interleaving pattern \(\sigma _{1}\sigma _{2}\sigma _{1}\) represents a possible loop path of the nested program. The loop path \(\sigma _{1}\sigma _{2}\sigma _{1}\) is a sequence of edges. \(b\overset{m< k}{\longrightarrow }c\overset{T_{2}}{\longrightarrow }e\overset{j< n}{\longrightarrow }f\overset{t[j]> r}{\longrightarrow } h\overset{T_{4}}{\longrightarrow }e\overset{j< n}{\longrightarrow }f\overset{t[j]> r}{\longrightarrow } h\overset{T_{4}}{\longrightarrow }e\overset{j>= n}{\longrightarrow }g\overset{T_{3}}{\longrightarrow }b\overset{m< k}{\longrightarrow }c\overset{T_{2}}{\longrightarrow }e\overset{j< n}{\longrightarrow }f\overset{t[j]<= r}{\longrightarrow } i\overset{T_{5}}{\longrightarrow }e\overset{j< n}{\longrightarrow }f\overset{t[j]<= r}{\longrightarrow } i\overset{T_{5}}{\longrightarrow }e\overset{j>= n}{\longrightarrow }g\overset{T_{3}}{\longrightarrow }b\overset{m< k}{\longrightarrow }c\overset{T_{2}}{\longrightarrow }e\overset{j< n}{\longrightarrow }f\overset{t[j]> r}{\longrightarrow } h\overset{T_{4}}{\longrightarrow }e\overset{j < n}{\longrightarrow }f\overset{t[j]> r}{\longrightarrow } h\overset{T_{4}}{\longrightarrow }e\overset{j>= n}{\longrightarrow }g\overset{T_{3}}{\longrightarrow }b\overset{m >= k}{\longrightarrow }d\).

Let \(L = \{\pi _{1},\ldots ,\pi _{m}\}\) be the set of loop paths in the loop program. The loop path \(\pi _{j}\), where \(0 < j \le m\), is the j-th loop path. The length of \(\pi _{j}\) is denoted as \(|\pi _{j}|\) which refers to the number of simple routes in \(\pi _{j}\). \(X = \{x_{1},\ldots , x_{n}\}\) is a set of candidate test cases. The loop path traversed by test case x is denoted as p(x). Similarly, the length of the loop path traversed by a test case is denoted as |p(x)|. Let \(F(\pi _{i},\pi _{j})\) be the function to calculate the number of successively same simple routes between \(\pi _{i}\) and \(\pi _{j}\). If \(\dfrac{|F(p(x),\pi _{j})|}{|\pi _{j}|} = 1\), test case x covers loop path \(\pi _{j}\).

Table 1 Three examples of loop bug categories

Loop bug categories

Bugs in multi-path loops can be classified into three categories, i.e., bugs inside the loop body, bugs outside the loop body, bugs existing both inside and outside the loop body. Table 1 shows examples for the three categories of loop bugs. In the following, we discuss the interleaving execution orders that trigger bugs in the loop programs.

Type 1: Bugs inside the loop body

This kind of bug is shown on Line 10 of Prog_1. The bug on Line 10 (i.e., z/j) results in a division by zero bug after an interleaving execution among the branches. Intuitively, this bug can be detected if j is 0 on Line 10. One possible way to reveal this bug is by generating a test case that executes the loop in the following interleaving order: Line 5 \({\rightarrow }\) Line 14 \({\rightarrow }\) Line 5 \({\rightarrow }\) Line 14 \({\rightarrow }\) Line 5 \({\rightarrow }\) Line 7 \({\rightarrow }\) Line 8 \({\rightarrow }\) Line 9 \({\rightarrow }\)Line 10\({\rightarrow }\) Line 11\({\rightarrow }\) Line 5 \({\rightarrow }\)Line 7 \({\rightarrow }\)Line 8 \({\rightarrow }\) Line 9 \({\rightarrow }\)Line 10.

Type 2: Bugs outside the loop body

This bug can be witnessed on Line 15 in Prog_2. In particular, we assume that the bug is revealed when \( m == -7\). We discover two possible interleaving execution orders that can trigger the assert statement on Line 15. The first possible execution order to trigger the bug is: Line 5 \({\rightarrow }\) Line 7 \({\rightarrow }\) Line 8 \({\rightarrow }\) Line 5 \({\rightarrow }\) Line 11 \({\rightarrow }\) Line 12 \({\rightarrow }\) Line 5 \({\rightarrow }\) Line 11 \({\rightarrow }\) Line 12\({\rightarrow }\) Line 5 \({\rightarrow }\) Line 11 \({\rightarrow }\) Line 12 \({\rightarrow }\) Line 5 \({\rightarrow }\) Line 15. Another execution order is: Line 5 \({\rightarrow }\) Line 11 \({\rightarrow }\) Line 12 \({\rightarrow }\) Line 5 \({\rightarrow }\) Line 11 \({\rightarrow }\) Line 12 \({\rightarrow }\)Line5 \({\rightarrow }\) Line 11 \({\rightarrow }\) Line 12\({\rightarrow }\) Line 5 \({\rightarrow }\) Line 7 \({\rightarrow }\) Line 8 \({\rightarrow }\) Line 5 \({\rightarrow }\) Line 15.

Type 3: Bugs inside and outside the loop body

A loop program may contain bugs that can be triggered inside and outside its loop body. For example, Prog_3 in Table 1 has two assert statements on Line 5 and Line 15, respectively. We assume that the bug inside the loop can be witnessed when the negated condition of the assert statement on Line 5 is triggered. Similarly, another bug can be revealed when the negated condition of the assert statement on Line 15 is triggered. The bug inside the loop body can be triggered after the following execution order: Line 4 \({\rightarrow }\) Line 14 \({\rightarrow }\) Line 4 \({\rightarrow }\) Line 14 \({\rightarrow }\) Line 4 \({\rightarrow }\) Line 14 \({\rightarrow }\)Line 4 \({\rightarrow }\)Line 6 \({\rightarrow }\) Line 7 \({\rightarrow }\) Line 4 \({\rightarrow }\) Line 6 \({\rightarrow }\)Line 7 \({\rightarrow }\) Line 4 \({\rightarrow }\) Line 6 \({\rightarrow }\) Line 7 \({\rightarrow }\) Line 4 \({\rightarrow }\) Line 12 \({\rightarrow }\) Line 13 \({\rightarrow }\) Line 4 \({\rightarrow }\) Line 5. The bug outside the loop can be triggered after the following execution order: Line 4 \({\rightarrow }\)Line 6\({\rightarrow }\) Line 7\({\rightarrow }\) Line 4\({\rightarrow }\) Line 6\({\rightarrow }\) Line 7 \({\rightarrow }\) Line 4\({\rightarrow }\) Line 6\({\rightarrow }\) Line 7 \({\rightarrow }\) Line 4 \({\rightarrow }\)Line 12\({\rightarrow }\) Line 13 \({\rightarrow }\) Line 4 \({\rightarrow }\)Line 15.

Fig. 4
figure 4

Example of a loop and a subset of its loop paths

Related work

Applying test case generation techniques to find software bugs has attracted much research attention in the field of software engineering. In this section, we discuss the main techniques for tackling test case generation in loops according to the survey study by Xiao et al. [8]. Besides, we present some existing approaches for addressing ATCG-PC and corresponding bug detection methods.

Loop summarization

A loop summary uses symbolic analysis to capture the relationship between loop variables and iterations, while it counts a set of formulas to address test case generation in loops [9, 12, 37]. Constraint solvers are applied to solve these formulas in order to reach the desired test goals. However, loop summarization suffers from high analysis costs [8], thus posing a challenge to its scalability in loops with complex interleaving. Moreover, it is not applicable to loops with non-induction variables [12].

Search-based heuristics

Search-based heuristics use fitness values to guide the algorithm toward paths which are more likely to reach a particular test goal such as a branch outside the loop body [38, 39]. For example, Xie et al. [9] designed a technique that uses a fitness function to measure how close an explored path is to a test goal after the loop body. However, those mentioned approaches [9, 38, 39] do not guarantee exploration of the different loop path possibilities because they focus on covering a predefined test goal.

Bounded iteration

Bounded iteration limits the number of loop iterations or the input range, thus reducing the search space [19, 40, 41]. This technique is applicable when the bounded assumption is suitable to prevent iterating the loop many times without reaching any new coverage targets [8, 42]. In addition, if a coverage target can only be reached after a number of iterations more than the predefined bound, the bound can be increased [8]. A majority of approaches implementing bounded iteration [19, 38, 41, 43], apply symbolic execution [44] to guide the explorations of loop paths. Williams et al. [43], proposed a path coverage tool (i.e., PathCrawler) for conducting unit testing of C programs. The problem of covering all loop paths in a loop is reduced to a k-path criterion with the aim of covering loop paths from k-bounded loop iterations. However, the tool only returns a subset of loop paths as the output of all the possible paths. Huster et al. [19] proposed a symbolic execution-based approach to achieve path coverage on an unwound loop body. In this work [19], only iteration orders that affect each other were combined into equivalence classes to reduce the complexity of covering all the possible loop path variations.

Path Coverage techniques

We present approaches that apply meta-heuristic search techniques to address ATCG-PC. In particular, we discuss some existing single-target and multi-target approaches. Besides, we present some test case generation tools applying symbolic execution to achieve path coverage.

Single-target approaches. Lin et al. [28] proposed an approach that extends hamming distance to calculate the fitness of individuals. However, no experiments were conducted on subjects with loops to evaluate this approach. Huang et al. [26] proposed a relationship matrix to empower DE algorithm to improve path coverage in fog computing programs. In their work, the test case generation problem is formulated as a single-objective optimization problem. It is noted that the loop is treated as a branch condition (i.e., if-condition). The exploration of possible loop paths in loop programs is not considered in their work. Bouchachia et al. [45] improved test case generation based on GA by incorporating immune operators in GA. Mala et al. [46] proposed a test case generation technique using an artificial bee colony optimization approach by combining both global search strategy and local search strategy to improve the efficiency of reaching near-global optimal solutions. The aforementioned works [26, 28, 45, 46] apply single-target approaches to test case generation. However, more recent empirical studies [21, 30, 32, 33, 47] have shown that multi-target approaches help to reach a higher coverage rate than single-target strategies for the test case generation problem with a large number of coverage targets. On the contrary, the loop benchmarks in this paper have large amounts of loop paths.

Multi-target approaches. Ahmed and Hermadi [29] proposed a multi-target approach to cover many target paths, simultaneously. Their empirical analysis included some loop programs. In their work, a standard GA was applied to simultaneously optimize the objectives. Gong et al. [20] proposed an approach based on grouping to generate test case for many paths coverage. This work reduces the complexity of considering all coverage targets at the same time by evolving sub-populations to simultaneously optimize the groups (i.e., subsets of paths). A small number of loop paths, out of the possible enumerated number in loop paths, are randomly selected as the coverage targets. However, some randomly selected paths may be infeasible. Bidgoli et al. [22] recently proposed an approach for prime path coverage instead of complete path coverage. The authors employed a multi-target strategy that applies ant colony optimization to cover prime paths simultaneously. However, prime path coverage does not capture all the possible distinct execution orders of the loop. The aforementioned works [20, 22, 29] use single-objective optimization algorithms to solve an intrinsically multi-objective problem [31]. However, evidence from two recent empirical studies by Campos et al. [33] and Panichella et al. [32] suggests that employing many-objective algorithms helps to achieve higher coverage than single-objective algorithms in programs with a large number of coverage targets.

Symbolic execution approaches. There are symbolic execution tools, such as PEX [38], KLEE [48], CUTE [49], and DART [50] which generate test cases to explore all feasible execution paths.

Bug detection approaches

There are many tools for detecting bugs such as CTrigger [51], Ample [52], Spectrum-based fault Localization (SBFL) [53] and BugPre [54]. Specifically, CTrigger [51] combines trace analysis and execution perturbation to locate the bugs. Ample [52] is an Eclipse plug-in for identifying faulty classes in Java programs. SBFL exploits the program elements with preset test cases to locate the programs that are likely to be bugs, while BugPre [54] conducts defect prediction on changed software modules. However, suitable test cases are required to improve the performance of those tools.

Our work

Different from previous approaches that employ search-based test case generation techniques targeting path coverage [20, 22, 28, 29], this work can perform path exploration in both nested and unnested multi-path loops to trigger bugs. KT-GDE3 is a novel search-based test case generation technique based on path coverage for exploring paths in nested loops. In addition, KT-GDE3 addresses ATCG-PC in multi-path loops using many-objective search. To achieve this goal, we proposed a many-objective search-based framework with a knowledge transfer scheme to enhance the optimization process for target loop paths. The framework is designed to focus the search on optimizing groups of loop paths, one after another in a sequential manner. Knowledge transfer is realized by initializing the population for optimizing a new target group with test cases previously covering loop paths in similar groups.

Problem formulation

This section presents the many-objective formulation of the test case generation for multi-path loops. In the first part, we introduce the computation method of the loop path distance, highlighting how to quantify the distance between the execution trace of a test case and a loop path. In the second part, the many-objective formulation based on loop path coverage will be presented.

Computing the loop path distance

As discussed in “Loop bug categories” section, bugs may exist inside or outside the loop. Intuitively, many bugs in multi-path loops can be revealed by generating test cases high loop path coverage rate. Let \(B = \{b_{1},\ldots ,b_{m}\}\) be a set of bugs in a loop program, a bug b is reachable if it can be triggered along at least one executable loop path. Intuitively, to trigger b, the goal of loop path coverage is to find a test case that triggers b along the bug-inducing loop path. To find test cases covering potential bug-inducing paths, the search-based method is guided by a fitness function. The fitness function quantifies the distance between the execution trace of a test case and a loop path. In this work, we quantify the execution trace of a test case from a loop path as follows.

Figure 4a is an example of a loop program. A corresponding subset of its loop paths is shown in Fig. 4b. The simple routes in this loop are \(\sigma _{1} = 3\overset{i< k}{\longrightarrow }5\overset{c < 20}{\longrightarrow }3\), \(\sigma _{2} = 3\overset{i < k}{\longrightarrow }7\overset{c == j*z}{\longrightarrow }3\) and \( \sigma _{3} = 3\overset{i < k}{\longrightarrow }9\overset{c >= 20 \& \& c != j*z}{\longrightarrow }3\). Let the loop path \(p(x)=\) \(\varvec{\sigma }_{\varvec{1}}\) \(\varvec{\sigma }_{2}\) \(\varvec{\sigma }_{\varvec{3}}\) \(\varvec{\sigma }_{\varvec{3}}\) \(\varvec{\sigma }_{\varvec{1}}\) \(\varvec{\sigma }_{\varvec{1}}\) be the path traversed by test case x. The third loop path, denoted by \(\pi _{j}=\) \(\sigma _{1}\) \(\sigma _{1}\) \(\sigma _{3}\) \(\sigma _{3}\) \(\sigma _{1}\) \(\sigma _{1}\) is the target loop path. The simple routes in \(\pi _{j}\) and p(x) are compared and the branch distance at each differing simple route from \(\pi _{j}\) is evaluated. The branch distance is evaluated based on the variable values at the conditional expression where the execution diverges from the target node of interest (i.e., the unmatched simple route). This distance is calculated by a defined set of formulas for the various predicate types [55]. For example, the unmatched simple route in \(\pi _{j}\) is the simple route \(\sigma _{1}\) appearing at the second position in \(\pi _{j}\). \(\sigma _{2}\) is the simple route traversed at the corresponding position in p(x). The simple route \(\sigma _{1}\) is the sequence of edges, \(3\overset{i< k}{\longrightarrow }5\overset{c < 20 }{\longrightarrow }3\). The simple route \(\sigma _{2}\) is the sequence of edges \(3\overset{i < k}{\longrightarrow }7\overset{c == j*z}{\longrightarrow }3\). The execution diverges from the tail node of simple route \(\sigma _{1}\) at the conditional expression on Line 5 c 20. The branch distance at the true branch in simple route \(\sigma _{1}\) (i.e., Line 5 of Fig. 4a) is \(branch\_distance = |C-20| + K\), where K is a constant added when the alternate branch is triggered. The branch distance is normalized to a value between [0, 1] by using the normalization function: \(norm \text {(} branch\_distance \text {)} = (branch\_distance) \mathbin {/}(1 + branch\_distance))\) [56]. Only the node appearing at the second position in \(\pi _{j}\) and p(x) is the unmatched node in the path of the test case. Loop path distance, which indicates the distance from loop path \(\pi _{j}\) to p(x), is the sum of the normalized branch distances from all unmatched simple routes in the loop path \(\pi _{j}\).

Many-objective formulation for loop path coverage

Test case generation can be formulated as an optimization problem, where the goal is to maximize coverage of structural targets following an adequacy testing criterion [57]. Some researchers model the test case generation problem by using a single-objective formulation [26, 28, 45, 46]. In this formulation, a test case is evaluated against only one coverage target at a time and thus is assigned a single fitness value. Conversely, other studies employ multi-target strategies to generate test cases where each test case is evaluated against many coverage targets simultaneously. One of the multi-target strategies is the whole test suite approach (WTS) [30], where each individual is a set of test cases (i.e., a test suite). The fitness value of the test suite is the sum of all branch distances from the coverage targets (e.g., branches) in the program under test. Another multi-target strategy involves addressing the test case generation problem in a many-objective fashion [31].

In a many-objective optimization formulation [58], each test case is assigned a fitness vector where each fitness value is in the vector. The corresponding fitness value represents the distance between the execution trace of the test case and the specific coverage target. Recent large-scale studies have theoretically and empirically showed that approaches using multi-target strategies are superior to approaches based on single-target strategies in programs with a large number of targets [32, 33]. Furthermore, hundreds of targets are usually contained in ATCG-PC with multi-path loops. This evidence motivates our choice of using many-objective search to address ATCG-PC with multi-path loops.

Intuitively, a generated individual is unlikely to be high performing across all loop paths (i.e., objectives). Hence, an individual can ideally be evaluated for only selected paths on which it is most likely to perform well. The intuition is that when some paths are related, there may exist some individuals which perform well on these paths. As a result, useful knowledge acquired during the process of optimizing some loop paths may assist in finding solutions to cover other similar loop paths. Accordingly, this many-objective optimization problem aims at optimizing a subset of loop paths at a time. Specifically, the set of loop paths L is partitioned into groups denoted as the set \(G = \{g_{1},\ldots ,g_{\xi }\}\), where all the groups are sub-problems to be optimized in a sequential manner. A group \(g_{target} \in G\) is randomly selected at a time as the group of interest to be optimized until all the assigned search budget is consumed or all loop paths are covered. More precisely, we consider the following many-objective optimization formulation for a group of loop paths at a time.

Let \(g_{target} =\{\pi _{1},\ldots ,\pi _{m}\}\) be the set of target loop paths \(g_{target}\), where \(\pi _{j}\) (\(0 < j \le m\)) is the j-th loop path. The objective is to find a subset of non-dominated test cases in \(X=\{x_{1},\ldots ,x_{n}\}\) which has the minimum fitness value in all target \(g_{target}\), i.e.,

$$\begin{aligned} {\left\{ \begin{array}{ll} \text {min} f_{1}(x) = d(\pi _{1}, x) \\ \vdots \\ \text {min} f_{m}(x) = d(\pi _{m}, x)\\ \end{array}\right. } \end{aligned}$$
(1)

where \(d(\pi _{j}, x)\) denotes the corresponding the loop path distance of \(\pi _j\). Vector \(\langle f_{1}, \ldots , f_{m} \rangle \) is the fitness vector of test case x. A test case \(x_{a}\) dominates another test case \(x_{b}\) (i.e., \(x_{a} \prec x_{b}\)) iff \(\forall j\in \{1,\ldots ,m\}:~f_{j}(x_{a}) \le f_{j}(x_{b})\)  and  \(\exists j\in \{1,\ldots ,m\}\) such that \(f_{j}(x_{a}) < f_{j}(x_{b})\).

Knowledge transfer scheme

In this section, we present the design of the knowledge transfer scheme. Specifically, we organize similar loop paths into groups based on the leveraged common path prefix information, and assign related groups into the same neighbor set. In addition, each group is associated with an archive to store test cases previously covering its paths.

Knowledge transfer scheme exploits the latent similarities of loop paths to enhance the optimization process. As discussed in “Many-objective formulation for loop path coverage” section, the defined many-objective optimization problem aims at optimizing a subset of loop paths (i.e., a group) at a time. The design of knowledge transfer scheme enhances the optimization process in two ways. First, by grouping similar paths, redundant fitness evaluations are minimized. For example, Fig. 5 shows an example of a group of loop paths with a common path prefix. We assume that these loop paths are a subset of loop paths for the loop program in Fig. 4a. Since all paths in this group share a common path prefix, during fitness evaluation for a test case, the computation for that path section is only done for one path. All the paths will share the same value for the initial path section. The remaining path section for each path is the only computational effort incurred. By reducing redundant evaluations, the optimization process can be enhanced in case of limited search budget.

Second, the knowledge transfer scheme retrieves solutions of previously covered paths from the archives and includes them in the population to assist in finding solutions for similar paths in the same neighbor set. We assume that the distance of those test cases to the potential solutions of the similar paths is close. The incorporated test cases speed up the optimization process by providing initial starting points to guide the search toward promising regions of the search space. This reuse of archived solutions minimizes redundant exploration and accelerates convergence toward solutions to the remaining paths.

In the following subsection, we explain how to organize similar loop paths into groups and similar groups into neighbor sets.

Fig. 5
figure 5

A group of paths with common path prefix \(\sigma _{1}\) \(\sigma _{2}\) \(\sigma _{1}\)

Grouping similar paths

Our strategy of grouping loop paths imposes a criterion that assigns loop paths with a common path prefix to the same group. A common path prefix refers to a sequence of simple routes which is the same among a group of paths starting from the first simple route. The procedure of determining the common path prefix is based on the longest path among the group of paths. Algorithm 1 is the procedure of determining the common path prefix and organizing loop paths into groups. First, the longest loop path \(\pi _{j}\) of length \(|\pi _{j}|\) is chosen from the loop path bundle L. Then, all loop paths with the same \(\lceil |\pi _{j}|\mathbin {/}2\rceil \) sequence of simple routes \(|\pi _{j}|\) will be selected to the same group.

The rationale behind this strategy (Algorithm 1) is based on the observation that many loop paths share the same path prefix. We assume that a test case that covers a loop path can assist in finding solutions to cover other loop paths which share a common path prefix. For example, Fig. 5 shows an example of a group of loop paths with a common path prefix. We assume that these loop paths are a subset of loop paths for the program in Fig. 4a. Let the loop path \(\pi _1\)= \(\sigma _{1}~\sigma _{2}~\sigma _{1}~\sigma _{3}~\sigma _{1}~\sigma _{2}\) in Fig. 5 be the loop path traversed by a test case x. We assume that the loop paths \(\pi _1\) to \(\pi _7\) belong to a group with a common prefix \(\sigma _{1}~\sigma _{2}~\sigma _{1}\). Intuitively, the test case x traverses the same simple routes from the first to the third corresponding positions in all the loop paths from this group.

Algorithm 1
figure b

GROUP-PATH

To take this insight into account, we assume that a test case attempting to cover a loop path in a group of interest may accidentally cover a loop path in another group. This is known as serendipitous coverage [30]. To exploit serendipitous coverage, each group is associated with an archive to save the test cases that cover any of its loop paths. Subsequently, when a group is selected as the group of interest, test cases are retrieved from its archive and included in the initial population for covering the remaining loop paths. We assume that this strategy enhances the optimization process by prioritizing individuals for evaluation only on loop paths that they are most likely to cover.

Algorithm 1 shows the procedure of organizing loop paths into groups based on the shared prefix. More formally, to establish these groups, we define an equivalence relation \(\sim \) which divides the set L.

  • Definition 4 Given a set of loop paths \(L = \{\pi _{1},\ldots ,\pi _{m}\)}, an equivalence relation \(\sim \), which divides L, is defined such that loop paths with a common path prefix belong to the same group. Intuitively, if loop paths \(\pi _{i}, \pi _{j} \in L\) share a common path prefix, \(\pi _{i}\) and \(\pi _{j}\) belong to the same group, where \(i,j \in [1,w] ~{and}~ i\ne j\).

Conceptually from Definition 4, a partition of a set L is a collection of non-empty disjoint subsets (i.e., groups of loop paths). We denote the set of the disjoint subsets in L as \(G = \{g_{1},\ldots ,g_{\xi }\}\), where \(\xi \) is the total number of groups. The groups in G, form a partition of L such that: \(\bigcup ^{\xi }_{i=1} g_{i} = L\) and \(g_{i} \cap g_{j} = \emptyset \), if \(g_{i} \ne g_{j}\), where \(1 \le i,j \le \xi \). We assume that each group \(g_{i}\) is associated with an archive \(g^{archive}_{i}\), which is the set of test cases that cover its loop paths.

After organizing loop paths into groups, each group \(g_{i}\) is uniquely identified by an id denoted as \(id_{g_j}\) which corresponds to the common path prefix of all its loop paths. For example, the common path prefix is \(\sigma _{1}\sigma _{2}\sigma _{1}\) in Fig. 5.

Algorithm 2
figure c

GENERATE-NEIGHBOR-SET

Organizing similar groups into neighbor sets

In this subsection, we discuss how to establish neighbor sets among similar groups. Algorithm 2 shows the procedure of organizing groups into neighbor sets. The corresponding assumption is that test cases which cover loop paths in the different groups (either serendipitously or as desired targets) can provide useful knowledge for covering similar loop paths in another group. More precisely, those related groups will be assigned to the same neighbor set.

  • Definition 5 (Neighbor set) A neighbor set is a collection of groups whose loop paths have the same initial simple route.

For example, let us consider a loop path \(\sigma _{1}\sigma _{2}\sigma _{3}\sigma _{1}\sigma _{3}\sigma _{3}\) and a common path prefix extracted from it. We assume that this loop path belongs to a group with id which equals \(\sigma _{1}\sigma _{2}\sigma _{3}\). The id represents the common path prefix of a group, while each group id is different from the other groups. This group can be assigned to the same neighbor set as the group of paths in Fig. 5 with a common path prefix \(\sigma _{1}\sigma _{2}\sigma _{1}\). Let us consider another loop path \(\sigma _{1}\sigma _{2}\sigma _{2}\sigma _{1}\sigma _{2}\sigma _{2}\) in a group whose id equals \(\sigma _{1}\sigma _{2}\sigma _{2}\). The path group with path prefix \(\sigma _{1}\sigma _{2}\sigma _{2}\), \(\sigma _{1}\sigma _{2}\sigma _{3}\) and \(\sigma _{1}\sigma _{2}\sigma _{1}\) can be assigned to the same neighbor set.

Fig. 6
figure 6

Reuse of archived test cases

Based on the aforementioned observation, we assume that the test case covering a loop path in a neighbor set will cover the initial simple route of all loop paths from the same neighbor set. To exploit this information, each neighbor set is associated with an archive, which is the set that contains all archives of loop paths in the corresponding neighbor set. Subsequently, when a new group of interest belongs to this set, test cases are retrieved from these archives and injected into the initial population to find test cases to cover the loop paths in the group.

More formally, to establish the neighbor sets, we define an equivalence relation \(\sim \) which partitions G into loop path groups.

  • Definition 6 Given a set of groups that form a partition of G denoted \(G = \{g_{1},\ldots ,g_{\xi }\}\), an equivalence relation \(\sim \) on G is defined such that groups whose loop paths have the same initial simple route are assigned to the same neighbor set.

Conceptually from Definition 6, G can be divided into non-empty disjoint subsets (i.e., neighbor sets in this context) based on the equivalence relation. We denote \(\Psi \) as the set of all neighbor sets, \( \Psi = \{\psi _{1},\ldots , \psi _{w}\}\); where w is the total number of neighbor sets. The neighbor sets \(\psi _{1},\ldots ,\psi _{w}\) form a partition of G such that: \(\bigcup ^{w}_{i=1} \psi _{i} = G\) and \(\psi _{i} \cap \psi _{j} = \emptyset \) if \(\psi _{i} \ne \psi _{j}\), where \(1 \le i,j \le w\). We assume that the neighbor set \(\psi _{j}\) is associated with a neighbor set archive to all groups in \(\psi _{j}\), where \(1\le j \le w\). The neighbor set archive of \(\psi _{j}\) is denoted by \(\psi ^{archive}_{j} =\{g^{archive}_{1},\ldots ,g^{archive}_{n}\}\), where \(g^{archive}_{i} \in \psi ^{archive}_{j} \) is an archive of group \(g_{i} \in \psi _{j}\), \(1 \le i \le n\), and \(1 \le j \le w\).

In addition, each neighbor set is uniquely identified by an id which corresponds to the initial simple route of all loop paths. Figure 6 presents an illustrative example of the concept behind neighbor sets. We assume that the new group of interest Group i belongs to the neighbor set in Fig. 6. To cover the remaining paths in Group i, test cases are retrieved from the neighbor set archive and injected into the initial population. Based on the aforementioned ideas, we present the definition of the knowledge transfer scheme as follows.

  • Definition 7 (Knowledge transfer scheme) Given an archive of test cases \(\psi ^{archive}_{j}\) (\(1 \le j \le w\)) previously covering loop paths in groups of a neighbor set and a new group of interest belonging to this set (i.e., \(g_{target} \in \psi _{j}\)), knowledge transfer is a scheme that enhances the optimization process for loop paths in \(g_{target}\) by reusing archived test cases in \(\psi ^{archive}_{j}\).

In the next section, we present a discussion on how the knowledge transfer scheme is incorporated into the proposed many-objective test case generation framework.

Many-objective test case generation framework

This section presents the many-objective test case generation framework tailored to find bugs of bounded depth in loops. We refer to this framework as knowledge transfer based generalized differential evolution 3 (KT-GDE3). We assume that KT-GDE3 has high probability to trigger bugs in multi-path loops. Although only a subset of loop paths are selected as the objectives to be optimized at a time, KT-GDE3 is designed to exploit serendipitous coverage of loop paths in all groups. The key aspect of KT-GDE3 is that retrieved solutions from previously covered paths can be used to cover the loop path in another similar group. The pseudo code of KT-GDE3 is presented in Algorithm 3. KT-GDE3 is based on the generalized differential evolution 3 (GDE3) [36], a multi-objective derivative of differential evolution (DE) version DE/rand/l/bin [59, 60]. We highlight in bold the modifications with the original GDE3 algorithm.

Algorithm 3
figure d

KT-GDE3

Algorithm 4
figure e

CORNER-SORTING

As shown in Algorithm 3, KT-GDE3 starts with a randomly generated initial population (i.e., test cases) of cardinality N (Line 10 of Algorithm 3). Each randomly generated test case is executed against the program under test with a k-bounded loop. The loop path \(\pi _{temp}\), which matches the loop path p(x), is found (Lines 12–15 of Algorithm 3). Next, KT-GDE3 selects a target loop path group of interest (Line 17 of Algorithm 3). A group is selected as the target if it contains the lowest number of uncovered loop paths among all groups. In the beginning of KT-GDE3, no knowledge transfer is performed because the archives are initially empty (Lines 2–5 of Algorithm 3).

The initial population of test cases is evaluated against uncovered loop paths in \(g_{target}\) (Lines 20–21 of Algorithm 3). In the following, KT-GDE3 uses MUTATION-AND-CROSSOVER operations to generate new offspring test cases(Line 23 of Algorithm 3). Similarly, the Steps in Lines 12–15 of Algorithm 3 are repeated for recording coverage status of the offspring test cases. In the next step, KT-GDE3 calculates the fitness values of the offspring test cases for uncovered loop paths in \(g_{target}\) (Line 26 of Algorithm 3).

The selection procedure is followed to select candidate test cases (Lines 27–28 of Algorithm 3). Specifically, the WEAK-DOMINANCE-RELATION function (Line 27 of Algorithm 3) determines the non-dominated test case between a parent test case and its offspring test case. More formally, the WEAK-DOMINANCE-RELATION function defines the weak dominance relation \(\preceq \) between a parent test case and its offspring test case such that a test case u weakly dominates a test case v (i.e., \(u \preceq v\)) iff \(\forall j\in \{1,\ldots ,m\}:f_{j}(u) \le f_{j}(v)\). Consequently, the test case u is added to the new population Pop (Line 27 of Algorithm 3). If neither test case dominates each other, both test cases are added to the new population Pop. Different from GDE3, which applies the traditional non-dominated sorting to rank the new population Pop into non dominance fronts, KT-GDE3 uses the CORNER-SORTING function (Line 28 of Algorithm 3) [61].

CORNER-SORTING is applied to save comparisons when obtaining the non-dominated test cases (Algorithm 4). For example, let the size of population be N, CORNER-SORTING only requires \(N-1\) times objective comparisons for each objective. Normally, m times objective comparisons are required for the comparison of two candidate solutions (i.e., test cases) in traditional non-dominated sorting. \(N-1\) times comparisons are required for one single objective. The total objective comparison times are fewer than \(m(N-1)\). The function CORNER-SORTING ranks the new population Pop into non-dominance iteratively using a preference procedure. More formally, a test case x is preferred to a test case y for a loop path \(\pi \in g_{target}\) iff \(f_{j}(x) < f_{j}(y)\). Consequently, the test case x is assigned to the first non-dominated front \(\mathbb {F}_{0}\) which is the subset of \(\mathbb {F}\) (Line 28 of Algorithm 3). Similarly, the remaining test cases in Pop are ranked using the same procedure when assigning test cases to \(\mathbb {F}_{0}\). After performing the CORNER-SORTING routine, the algorithm implements the CROWDING-DISTANCE procedure using the subvector dominance assignment [62]. The more diverse candidate test cases in the same front have a higher chance to be selected for the offspring population (Lines 31–35 of Algorithm 3).

Algorithm 5
figure f

REUSE-TEST-CASES

The evolution process of KT-GDE3 will focus on covering \(g_{target}\) until the allotted overall budget (i.e., local budget) is consumed or all loop paths are covered. The operation UPDATE-ARCHIVES (Line 40 of Algorithm 3) saves test cases that covered its previously remaining loop paths with the associated group’s archive. In the next iteration, a new group of interest is selected (Line 16 of Algorithm 3) to be optimized. REUSE-TEST-CASES (Line 19 of Algorithm 3) can be used to find suitable archived test cases as the initial population for covering target loop paths based on the designed knowledge transfer scheme (see “Knowledge transfer scheme” section). Specifically, Algorithm 5 follows a seeding strategy which prioritizes the test cases to include in the initial population for covering target loop paths. First, REUSE-TEST-CASES uses the id of \(id_{g_{target}}\) to find the j-th neighbor set \(\psi _j\) to which \(g_{target}\) belongs to (Line 3 of Algorithm 5). In the following, it obtains the neighbor set archive which is associated with \(\psi _j\) (Line 4 of Algorithm 5). Test cases are added from the archive of the new group of interest \(g^{archive}_{target}\) to the new population H if the original population size is not exceeded (Lines 5–9 of Algorithm 5). If the original population size is not exceeded, test cases are randomly added to H from archives of groups in the same neighbor set as \(g_{target}\) (Lines 10–17 of Algorithm 5). If the size of test cases added from the archives is still less than the initial population size, test cases from the current population are added to the new population until the size equals N (Lines 18–22 of Algorithm 5). Finally, the new population H replaces the current population \(P_r\) as the initial population for covering target loop paths in \(g_{target}\) (Lines 23 of Algorithm 5).

After searching for test cases on all groups, KT-GDE3 adds test cases from all the archives to the set X (Line 42 of Algorithm 3). In summary, KT-GDE3 iteratively selects one group of loop paths to be optimized at a time. The key aspect of the proposed knowledge transfer scheme is to provide an initial starting point that is closer to the solutions of the remaining uncovered paths. Thus, guiding the search toward promising regions of the search space.

The computational complexity of KT-GDE3 contains the following parts:

  1. 1.

    Fitness evaluation. In the fitness evaluation step, each test case needs to be evaluated against all loop paths in a group (i.e., \(g_{target}\)) to calculate its fitness vector. The complexity depends on the number of test cases, N, and the number of loop paths in \(g_{target}\) in the current target group. We assume that the number of paths in \(g_{target}\) is m. For each test case, all loop paths in \(g_{target}\) are evaluated, resulting in a complexity of O(\(N * m\)).

  2. 2.

    Corner sort procedure. Corner sort is performed on test cases in the intermediate population, Pop. Corner sort performs \(Pop-1\) times objective comparisons for each objective in the population. Since there are m objectives, the total number of objective comparisons required by CORNER-SORTING would be \((Pop-1)* m\). The computational complexity of CORNER-SORTING is \(O((Pop-1)*m)\).

  3. 3.

    Reusing test cases from the archives as the initial population for covering target loop paths. KT-GDE3 retrieves test cases from the archives until the original size of the population, N, is not exceeded. The complexity of this step depends on the size of the population and the number of test cases available in the archives. Let us assume the number of test cases available in the archives is K. The complexity for this step can be approximated as O(min(K, N)).

  4. 4.

    Crowding distance. For each test case, the distances between its neighboring test cases are calculated. Let us assume the candidate test cases are N and there are m objectives, the total number of pairwise comparisons for all objectives is \(m * N * (N-1)\). The total number of distances to be calculated is proportional to the square of the population size, N. Hence, the complexity of the crowding distance procedure is O(\(m * N^{2}\)).

The total complexity is \(O(N * m) + O((Pop-1)*m) + O(\)min(K, N)) + \(O(m * N^{2})\). Considering the dominant term that grows the fastest with the input size, the highest order term is \(O(m * N^{2})\). Therefore the overall total complexity is \(O(m * N^{2})\).

Note that the design of the knowledge transfer scheme is done offline before being incorporated into the many-objective test case generation framework. Thus, the complexity involved in creating groups and neighbor sets is excluded from the complexity analysis of the proposed many-objective test case generation framework (i.e., KT-GDE3).

In this paper, we assume that augmenting many-objective search with the proposed knowledge transfer scheme enhances optimization process for exploring bug-inducing paths. Thus, improving bug finding capability in multi-path loops. The next section will conduct an experiment to further support our argument.

Experiment setting

In this paper, we conduct an experiment to assess the performance of KT-GDE3. Two research questions guide this experiment:

  • RQ1: How does KT-GDE3 perform compared to the state-of-the-art algorithms in bug finding capability among multi-path loops? This research question investigates to what extent KT-GDE3 is able to achieve higher bug coverage in comparison with state-of-the-art algorithms (DynaMOSA [31], WTS [30], MISA [35] and DE-SS [27]).

  • RQ2:(internal assessment) How does the knowledge transfer scheme affect the bug finding capability of KT-GDE3 in multi-path loops? This research question aims at assessing the internal function of our approach. KT-GDE3 incorporates a knowledge transfer scheme that reuses archived test cases to traverse a new group of paths. Particularly, we investigate whether the knowledge transfer scheme from our approach is essential to attain higher bug coverage rates.

Table 2 Details of the benchmarks

Baseline comparison

To answer the research questions, we compared the performance of KT-GDE3Footnote 2 with state-of-the-art algorithms, DynaMOSA [31], WTS [30], MISA [35] and DE-SS [27]. DynaMOSA and WTS are state-of-the-art algorithms recognized in search-based software testing. Those two algorithms employ a multi-target approach to address the test case generation problem. MISA and DE-SS are state-of-the-art algorithms that generate path coverage test cases. Hence, we find these algorithms as suitable baselines for comparison with KT-GDE3. A detailed description of the techniques is presented as follows.

  1. 1.

    To answer the research question (RQ1), we considered DynaMOSA, WTS, MISA and DE-SS as the baselines for comparison. DynaMOSA [31] addresses the test case generation problem in a many-objective fashion by dynamically focusing the search on a subset of test targets (e.g., branches) at a time based on a control dependency hierarchy. DynaMOSA is built based on the Non-Dominated Sorting Algorithm (NSGA-II) [63]. WTS [30] is a multi-target approach based on the genetic algorithm that generates whole test suites where each individual in WTS is a set of test cases. DynaMOSA and WTS are publicly available in EVOSUITE [64]. In addition, we considered recently proposed state-of-the-art algorithms MISA [35] and DE-SS [27] which generate path coverage test cases. MISA is a recently proposed single-target approach that finds the equivalent mapping subspaces using a test-case-path relationship matrix to generate test cases covering all possible paths by searching in the found subspaces. DE-SS is a search-based algorithm with scatter search strategy which generates path coverage test cases by searching in the selected local space according to the distribution of paths.

  2. 2.

    To answer the second research question (RQ2), we implemented a derivative of KT-GDE3 named KT-GDE3-unarchived. The knowledge transfer scheme is not employed in KT-GDE3-unarchived.

Case study subjects

A key factor of studying test case generation approaches is the selection of benchmark programs. The loops and reach-safety benchmarks are selected from the competition on software verification 2016 and 2021 (SV’COMP 16 and SV’COMP 21) [6, 34], respectively. The loops in those benchmarks depict real-world complex loop programs that commonly surface in software engineering practice. The loop programs in loop and reach-safety benchmarks are both written in C-language. The benchmarks contain programs having primitive input parameters and non-primitive input parameters. The programs in these benchmarks are small but contain non-trivial enormous amount of paths. We discarded loop programs without interleaving relationships among paths. 30 loop programs with complex interleaving are selected from the following categories of verification tasksFootnote 3: loop-acceleration, loop-invgen, loop-lit, loops, loops-zilu and nla-digbench-scaling. The programs have bugs inside or outside the loop body. A bug is considered to be witnessed when the negated condition of the assert statements is triggered. For each program, an if branch replaces the assertion statement. Figure 7 shows a code snippet of program egcd-ll_unwindbound2.c under the nla-digbench-scaling category. A bug is considered to be triggered when the if branch condition on Line 22 is executed.

Fig. 7
figure 7

The code of egcd-ll_unwindbound2.c

Additional information for each program is also included in Table 2. Specifically, Table 2 reports the characteristics of the selected loop benchmarks.

  • Function: The function under test with loops.

  • Type: The type of loops in the function under test. Some of the functions under test have both nested and unnested loops.

  • Bug classification: The category of loop bugs (see “Loop bug categories” section), indicating the location of bugs in the function under test.

  • #Bugs: The number of bugs in the function under test with loops.

  • #k-bound: The number of k-bounded loop iterations within which the bugs can be triggered.

Parameter setting

The performance of the evaluated algorithms is influenced by various parameters. In the experiments, we used the following parameter settings in our implementation of KT-GDE3 and its variant without the knowledge transfer scheme (KT-GDE3-unarchived). We adopted the mentioned population size, crossover and factor parameter values since they have been shown to give reasonably acceptable results in existing related literature [26, 27].

  • Population size: The population size is set to 50.

  • Crossover: The crossover probability is set to 0.2.

  • Factor Parameter: The factor parameter is set to 0.5.

  • Maximum search time: The maximum search time is set to 5 min. Each target group is assigned a portion of the maximum search time which is calculated as:\(~~~~ {local\_budget} = \dfrac{\text {Maximum search time}}{\#\text {of Groups}}\), where \({local\_budget}\) is the portion of the maximum search time allotted to a group, and #of Groups is the number of groups of loop paths in a loop program.

We used the default crossover, mutation and tournament selection settings in EVOSUITE for DynaMOSA and WTS. In addition, we used the default population size, crossover and mutation parameter settings employed in MISA [35] and DE-SS [27]. The maximum search time for all experiments is set to 5 min [31].

Experimental procedure

We run KT-GDE3, DynaMOSA, WTS, MISA, DE-SS and KT-GDE3-unarchived for each loop program, collecting the corresponding bug coverage rate. We set a maximum search time of 5 min [31]. Hence, the search process stops when the maximum time allotted for the search is consumed. Notice that in KT-GDE3, the search process on a group has its corresponding \({local\_budget}\). Due to the intrinsic randomness of search algorithms, different results can be produced in different runs. Therefore, each experiment is repeated 30 times. Thus, a total number of 6\(\times \)30 (programs)\(\times \)30 (repetitions)=5400 experiments are performed.

To answer RQ1 and RQ2, we calculate the percentage of covered bugs as:

$$\begin{aligned} \text {Bug Coverage (\%)} = \dfrac{\#\text {Number of loop bugs found}}{\#\text {total number of loop bugs}}\times 100\%. \end{aligned}$$

We also conduct statistical analysis for the experimental results. Statistical significance is measured by using the non-parametric Wilcoxon test [65] with the p-value threshold of 0.05. This is done to check whether coverage achieved by any two approaches under comparison is statistically significant or not. In addition, we use the Vargha-Delaney (Â\(_{12}\)) statistical test [66] to measure the effect size. The Vargha-Delaney (Â\(_{12}\)) statistic also categorizes the obtained effect size value into four different magnitude levels (negligible, small, medium, and large).

Experimental results

This section presents the experimental results to answer the research questions in “Experiment setting” section.

Table 3 Bug coverage, standard deviation, effect size, and statistical significance achieved by KT-GDE3, DynaMOSA, and WTS
Table 4 Bug coverage, standard deviation, effect size, and statistical significance achieved by KT-GDE3, MISA, and DE-SS

RQ1: How does KT-GDE3 perform compared to the state-of-the-art algorithms in bug finding capability among multi-path loops?

Table 3 summarizes the results of average bug coverage rate, standard deviation, and the corresponding statistic analysis results achieved by KT-GDE3, DynaMOSA, and WTS for each loop program. In addition, Table 4 summarizes the results of average bug coverage rate, standard deviation, and the corresponding statistic analysis results achieved by KT-GDE3, MISA, and DE-SS for each loop program.

To better understand Tables 3 and 4, the indicators used in the experiments are presented as follows.

  • Bug Coverage (%) (Standard Deviation \(\sigma \)): The average bug coverage and standard deviation value achieved for KT-GDE3, DynaMOSA, WTS, MISA and DE-SS in each loop program over 30 independent runs.

  • Â\(_{12}\) Statistics (Magnitude) (p-value): The effect size, the magnitude of the difference, and p-value for KT-GDE3 in comparison with DynaMOSA, WTS, MISA and DE-SS.

  • Mean over programs: The mean of the average bug coverage over all the programs.

  • vs: It stands for versus. It indicates the comparison between KT-GDE3 against alternative approaches (i.e., DynaMOSA, WTS, MISA and DE-SS) based on Vargha-Delaney (Â\(_{12}\)) statistical test.

  • \(+\)/\(=\)/\(-\): These signs indicate the number of loop programs where KT-GDE3 performs better than, equivalently to, and worse than DynaMOSA, WTS MISA and DE-SS, according to the Wilcoxon test with the p-value threshold of 0.05.

Table 3 summarizes the results achieved by KT-GDE3, DynaMOSA and WTS for bug coverage. We highlight the loop programs where KT-GDE3 achieved higher average bug coverage than DynaMOSA and WTS in gray color. As shown in Table 3, KT-GDE3 outperforms DynaMOSA and WTS in the majority of the compared programs. Furthermore, KT-GDE3 still achieves highest average bug coverage rate. Overall, KT-GDE3 achieved the highest average bug coverage rate (81.30%), while DynaMOSA and WTS achieved smaller corresponding values (68.30% and 66.97%), respectively.

Besides capturing the mean bug coverage and standard deviation, we report the effect size values from the Vargha-Delaney (Â\(_{12}\)) statistic and p-values scores obtained according to the Wilcoxon test. To better understand the performance of KT-GDE3 versus DynaMOSA and WTS, we focus on the indicators “Â\(_{12}\) Statistics (Magnitude) (p-value)” and “\(+\)/\(=\)/\(-\)”. As shown in Table 3, we find that KT-GDE3 achieves significantly higher bug coverage than DynaMOSA in six programs. Among the six programs, the magnitude of the difference is large in five and small in only one program. In the remaining 24 programs, statistically significant difference is not observed. The magnitude of the difference is negligible in all the programs. Regarding the comparison between KT-GDE3 and WTS, KT-GDE3 achieves significantly higher bug coverage than WTS in seven programs. Among the seven programs, the magnitude of the difference is large in six and small in one. Statistically significant difference is not observed in the remaining 23 programs.

DynaMOSA and WTS timed out without triggering any bug in all the 30 independent runs for three programs diamond_2-2.c, vogal-1.c, and divbin_unwindbound5.c. Figure 8 depicts a code snippet of loop program diamond_2.2.c. The program has one assert statement on Line 25. A bug is witnessed when the negated condition of the assert statement is triggered (i.e., if((x % 2) != (y % 2)). DynaMOSA and WTS do not trigger the negated condition of the assert statement in all the 30 independent runs. KT-GDE3 finds the bug in all the runs. For example, input y = 89221 triggers the negated condition of the assert statement.

To provide more insight into the distribution of the bug coverage scores, Fig. 9 highlights that KT-GDE3 leads to larger coverage scores compared with DynaMOSA and WTS over 30 independent runs for the 30 studied subjects. As shown in Fig. 9, the median value, the lower quartile bound and the lower bound of bug coverage rate in KT-GDE3 are both much larger than the corresponding value of DynaMOSA and WTS, respectively. As a result, we can draw a conclusion that KT-GDE3 outperforms DynaMOSA and WTS in the bug finding capability of the compared loop programs.

Fig. 8
figure 8

The code of diamond_2.2.c

Fig. 9
figure 9

Comparison of average coverage achieved by KT-GDE3, DynaMOSA, and WTS over 30 independent runs of the 30 studied loop programs

Table 4 summarizes the results obtained by KT-GDE3, MISA and DE-SS for bug coverage. We highlight the loop programs where KT-GDE3 achieved higher average bug coverage than MISA and DE-SS in gray color. As shown in Table 4, KT-GDE3 outperforms MISA and DE-SS in the majority of the programs. Overall, KT-GDE3 achieved the highest average bug coverage rate (81.30%), while MISA and DE-SS achieved smaller corresponding values (51.76% and 48.17%), respectively. MISA and DE-SS timed out without triggering any bug in all the 30 independent runs for programs diamond_2-2.c, vogal-1.c, string-2.c, dijkastra-u_unwindbound10.c, mannadiv_unwindbound10.c and diamond_2-2.c.

Table 4 also shows the effect size values from the Vargha-Delaney (Â\(_{12}\)) statistic and p-values scores obtained according to the Wilcoxon test. To better understand the performance of KT-GDE3 versus MISA and DE-SS, we use the indicators “Â\(_{12}\) Statistics (Magnitude) (p-value)” and “\(+\)/\(=\)/\(-\)”. KT-GDE3 achieves significantly higher bug coverage than MISA in 16 programs. Among the 16 programs, the magnitude of the difference is large in 14 programs, negligible in one program and small in one program. In the remaining 14 programs, statistically significant difference is not observed. The magnitude of the difference is negligible in all those programs. Regarding the comparison between KT-GDE3 and DE-SS, KT-GDE3 achieves significantly higher bug coverage than DE-SS in 17 programs. In those programs, the magnitude of the difference is large in 16 and small in one. Statistically significant difference is not observed in the remaining 13 programs. The magnitude of the difference is negligible in all the programs.

To provide more insight into the distribution of the bug coverage scores, Fig. 10 highlights that KT-GDE3 leads to larger coverage scores compared with MISA and DE-SS. As shown in Fig. 10, the median value, the lower quartile bound and the lower bound of bug coverage rate in KT-GDE3 are both much larger than the corresponding value of MISA and DE-SS, respectively. As a result, we can conclude that KT-GDE3 outperforms MISA and DE-SS in the bug finding capability of the compared loop programs.

Fig. 10
figure 10

Comparison of average coverage achieved by KT-GDE3, MISA and DE-SS over 30 independent runs of the 30 studied loop programs

Table 5 Bug Coverage, Standard Deviation, Effect size, and Statistical Significance achieved by KT-GDE3 and KT-GDE3-unarchived

RQ2: How does the knowledge transfer scheme affect the bug finding capability of KT-GDE3 in multi-path loops?

Table 5 summarizes the experimental results of average bug coverage rate, the corresponding standard deviation, and the statistic analysis results between KT-GDE3 and KT-GDE3-unarchived for 30 benchmark loop programs. To better understand the results, the indicators used in Table 5 are presented as follows.

  • Bug Coverage (%) (Standard Deviation \(\sigma \)): The average bug coverage and standard deviation value achieved for KT-GDE3 and KT-GDE3-unarchived in each loop program over 30 independent runs.

  • Â\(_{12}\) Statistics (Magnitude) (p-value): The effect size, the magnitude of the difference, and p-value for KT-GDE3 comparing the KT-GDE3-unarchived.

  • Mean over programs: The mean of the average bug coverage over all the programs.

  • vs: It stands for versus. It indicates the comparison between KT-GDE3 and KT-GDE3-unarchived based on Vargha-Delaney (Â\(_{12}\)) statistical test.

  • \(+\)/\(=\)/\(-\): These signs indicate the number of programs where KT-GDE3 performs better than, equivalently to, and worse than KT-GDE3-unarchived, according to the Wilcoxon test with the p-value threshold of 0.05.

We highlight the experimental results that KT-GDE3 outperforms KT-GDE3-unarchived in gray color. As shown in Table 5, KT-GDE3 achieves significantly higher bug coverage values than KT-GDE3-unarchived in five benchmarks. We observed that in majority of the programs, the standard deviation value of the bug coverage rate achieved by KT-GDE3 is zero. Therefore, the performance of KT-GDE3 is stable. Furthermore, KT-GDE3 has the highest overall mean bug coverage value (81.30%), which is larger than the value of KT-GDE3-unarchived (58.63%). It means that the knowledge transfer scheme is essential for improving the optimization process for KT-GDE3 during bug finding.

To better understand the performance of KT-GDE3 versus KT-GDE3-unarchived, we focus on the indicators “Â\(_{12}\) Statistics (Magnitude) (p-value)” and “\(+\)/\(=\)/\(-\)”. As shown in Table 5, it can be observed that KT-GDE3 has significantly higher average bug coverage value than KT-GDE3-unarchived in eight programs. The magnitude of the difference is large in all the eight benchmarks. In the remaining 22 programs, statistically significant difference is not observed. The magnitude of the difference is negligible in 21 programs and small in one program. In addition, KT-GDE3-unarchived timed out without triggering any bug in all the 30 independent runs for six programs string-1.c, veris.c_OpenSER_cases1_stripFullBoth_arr.c, vogal-1.c, vogal-2.c, invert_string-3.c and divbin_unwindbound5.c.

Figure 11 depicts an overview of the coverage scores achieved by KT-GDE3 and KT-GDE3-unarchived. It highlights that KT-GDE3 has larger coverage scores compared with its variant KT-GDE3-unarchived.

Fig. 11
figure 11

Comparison of average coverage achieved by KT-GDE3 and KT-GDE3-unarchived over 30 independent runs of the 30 studied loop programs

Conclusions and future work

We have presented a test case generation approach to explore bug-inducing paths in multi-path loop programs with a limited search time budget. To fulfil this task, we built a framework that incorporates a knowledge transfer scheme into a many-objective test case generation algorithm to cover the loop paths. In particular, KT-GDE3 organizes similar loop paths into groups and neighbor sets. The test cases covering loop paths in similar groups are retrieved for covering the loop paths from the same neighbor set. The proposed method was applied to 30 loop and reach-safety benchmarks from the competition on software verification 2016 and 2021 (SV’COMP 16 and SV’COMP 21). The experimental result shows that KT-GDE3 has a higher bug finding capability in multi-path loops with complex interleaving relationships than four existing state-of-the-art algorithms (DynaMOSA, WTS, MISA and DE-SS) and its variant KT-GDE3-unarchived.

The experimental results show the following conclusions. First, KT-GDE3 outperforms the compared state-of-the-art test case generation algorithms and its variant in bug finding for multi-path loops. Second, the proposed knowledge transfer scheme can improve the optimization process for many-objective algorithms during bug finding in multi-path loops. In the knowledge transfer scheme, the archives serve as a repository of previously covered test cases, which are reintroduced into the population during the optimization process. This utilization of the archives enhances the exploitation phase of multi-objective algorithms such as GDE3, leveraging the knowledge gained from past solutions to guide the search for uncovered paths. In addition, sampling test cases previously covering paths in the same neighbor set provides initial starting points to reduce redundant exploration and accelerate the convergence toward solutions traverse the remaining paths.

Considering the results reported in this paper, there are potential directions for future works. The presence of infeasible paths in multi-path loops presents a major obstacle to bug finding. We plan extend our approach to handle infeasible paths to minimize redundant consumption of the search budget.