Keywords

figure a
figure b

1 Introduction

Localizing system faults has always been one of the most time-consuming and expensive tasks. Given a buggy program, fault localization (FL) involves identifying locations in the program that could cause a faulty behaviour (bug).

Given a faulty program and a test suite with failing test cases, current formula-based fault localization (FBFL) methods encode the localization problem into several optimization problems to identify a minimal set of faulty statements (diagnoses) within a program. Typically, these methods find a minimal diagnosis considering each failing test case individually rather than simultaneously with all failing test cases. Moreover, these FBFL methods enumerate all Minimal Correction Subsets (MCSes) [22] to cover all diagnoses.

For instance, BugAssist  [17, 18], a prominent FBFL tool, implements a ranking mechanism for bug locations. For each failing test, BugAssist enumerates all diagnoses of a Maximum Satisfiability (MaxSAT) formula corresponding to bug locations. Subsequently, BugAssist ranks diagnoses based on their frequency of appearance in each failing test. Other FBFL tools, like SNIPER  [21], also enumerate all diagnoses for each failing test. However, the set of SNIPER ’s diagnoses is obtained by taking the Cartesian product of the diagnoses gathered using each failing test. As a result, while FBFL methods can determine minimal diagnoses per failing test, BugAssist cannot guarantee a minimal diagnosis considering all failing tests, and SNIPER may enumerate a significant number of redundant diagnoses that are not minimal [16]. These limitations may pose challenges for programs with multiple faulty statements, as shown in Example 1.

figure c
Table 1. Test-suite.
Table 2. Number of diagnoses (faulty statements) generated by BugAssist  [17] and SNIPER  [21] per test.

Example 1 (Motivation)

Consider the program presented in Listing 1.1, which aims to determine the maximum among three given numbers. However, based on the test suite shown in Table 1, the program is faulty, as its output differs from the expected. The set of minimally faulty lines in this program is {5, 8, 11}, as all three if-conditions are incorrect according to the test suite. Fixing any subset of these lines would be insufficient to repair the program. One possible fix is to replace all these conditions with the suggested fixes in lines {6, 9, 12}.

In a typical FBFL approach, the minimal set of statements identified as faulty might include, for example, lines 4 and 5. Removing the scanf statement and an if-statement would allow an FBFL tool to assign any value to the input variables in order to always produce the expected output. However, considering an approach that prioritizes identifying faulty statements within the program’s logic before evaluating issues in the input/output statements (such as scanf and printf), one might identify lines {5, 8, 11} as the faulty statements. When applying BugAssist ’s and SNIPER ’s approach on the program in Listing 1.1 with the described optimization criterion and utilizing the inputs/outputs detailed in Table 1 as specification, distinct sets of faults are identified for each failing test. Table 2 presents the diagnosis (set of faulty lines) produced by each tool, along with the number of diagnoses enumerated for each failing test case and the total number of unique diagnoses after aggregating the diagnoses from all tests, using each tool’s respective method.

In the case of BugAssist, diagnoses are prioritized based on their occurrence frequency. Consequently, BugAssist yields 32 unique diagnoses and selects {4, 13} since this diagnosis is identified in every failing test. In contrast, SNIPER computes the Cartesian product of all diagnoses, resulting in 1297 unique diagnoses. Note that BugAssist ’s diagnoses may not adequately identify all faulty program statements. Conversely, SNIPER ’s diagnosis {5, 8, 11} is minimal, even though it enumerates an additional 1296 diagnoses. Hence, existing FBFL methods do not ensure a minimal diagnosis across all failing tests (e.g., BugAssist) or may produce an overwhelming number of redundant sets of diagnoses (e.g., SNIPER), especially for programs with multiple faults.

This paper tackles this challenge by formulating the FL problem as a single optimization problem in Sect. 3. We leverage MaxSAT and the theory of Model-Based Diagnosis (MBD), integrating all failing test cases simultaneously. This approach allows us to generate only minimal diagnoses to identify all faulty program components within a C program. Furthermore, we have implemented the MBD problem with multiple test cases in CFaults, a fault localization tool for ANSI-C programs, presented in Sect. 4. CFaults begins by unrolling and instrumentalizing C programs at the code-level, ensuring independence from the bounded model checker. Next, CFaults utilizes CBMC  [5], a well-known bounded model checker for C, to generate a trace formula of the program. Finally, CFaults encodes the problem into MaxSAT to identify the minimal set of diagnoses corresponding to the buggy statements.

Experimental results presented in Sect. 5 on two benchmarks of C programs, TCAS  [10] (industrial), and C-Pack-IPAs  [29] (programming exercises), show that CFaults effectively detects minimal sets of diagnoses. In contrast, SNIPER and BugAssist either generate an overwhelming number of redundant diagnoses or fail to produce a minimal set required to fix each program.

To summarize, the contributions of this work are: (1) we tackle the fault localization problem in C programs using a Model-Based Diagnosis (MBD) approach considering multiple failing test cases, and formulating it as a unified optimization problem; (2) we implement this MBD approach in a publicly available tool called CFaults  [30]Footnote 1 that unrolls and instrumentalizes C programs at the code level, making it independent of the bounded model checker used; (3) CFaults allows refinement of localized faults to pinpoint the bug’s location more precisely; (4) we evaluate CFaults on two sets of C programs (TCAS and C-Pack-IPAs), showing that CFaults is fast and only produces subset-minimal diagnoses, unlike other state-of-the-art formula-based fault localization tools.

2 Preliminaries

This section provides definitions and notations that are used throughout the paper. We start by presenting basic definitions of propositional logic and programs and then address standard model-based diagnosis (MBD) definitions.

The Boolean Satisfiability (SAT) problem is the decision problem for propositional logic [3]. A propositional formula in Conjunctive Normal Form (CNF) is a conjunction of clauses where each clause is a disjunction of literals. A literal is a propositional variable \(x_i\) or its negation \(\lnot x_i\). Given a CNF formula \(\phi \), the SAT problem corresponds to deciding if there is an assignment to the variables in \(\phi \) such that \(\phi \) is satisfied or prove that no such assignment exists. When applicable, set notation will be used for formulas and clauses. A formula can be represented as a set of clauses (meaning its conjunction) and a clause as a set of literals (meaning its disjunction).

The Maximum Satisfiability (MaxSAT) problem is an optimization version of the SAT problem. Given a CNF formula \(\phi \), the goal is to find an assignment that maximizes the number of satisfied clauses in \(\phi \). In partial MaxSAT, \(\phi \) is split into hard clauses (\(\phi _h\)) and soft clauses (\(\phi _s\)). Given a formula \(\phi = (\phi _h, \phi _s)\), the goal is to find an assignment that satisfies all hard clauses in \(\phi _h\) while minimizing the number of unsatisfied soft clauses in \(\phi _s\). Moreover, in the weighted version of the partial MaxSAT problem, each soft clause is assigned a weight, and the goal is to find an assignment that satisfies all hard clauses and minimizes the sum of the weights of the unsatisfied soft clauses. Let \(\phi = (\phi _h, \phi _s)\) be a partial MaxSAT formula. A Minimal Correction Subset (MCS) \(\mu \) of \(\phi \) is a subset \(\mu \subseteq \phi _s\) where \(\phi _h \cup (\phi _s \setminus \mu )\) is satisfiable and, for all \(c \in \mu \), \(\phi _h \cup (\phi _s \setminus \mu ) \cup \lbrace c \rbrace \) is unsatisfiable. A dual concept of MCSes are Minimal Unsatisfiable Subsets (MUSes) [16, 22].

Programs. A program is considered sequential, comprising standard statements such as assignments, conditionals, loops, and function calls, each adhering to their conventional semantics in C. A program is deemed to contain a bug when an assertion violation occurs during its execution with input I. Conversely, if no assertion violation occurs, the program is considered correct for input I. In cases where a bug is detected for input I, it is possible to define an error trace, representing the sequence of statements executed by program P on input I.

A Trace Formula (TF) is a propositional formula that is SAT iff there exists an execution of the program that terminates with a violation of an assert statement while satisfying all assume statements. For further information on TFs, interested readers are referred to [5, 8].

Model-Based Diagnosis (MBD). The following definitions are commonly used in the MBD theory [16, 24, 34]. A system description \(\mathcal {P}\) is composed of a set of components \(\mathcal {C} = \{c_1, \ldots , c_n\}\). Each component in \(\mathcal {C}\) can be declared healthy or unhealthy. For each component \(c \in \mathcal {C}\), \(h(c) = 0\) if c is unhealthy, otherwise, \(h(c) = 1\). As in prior works [16, 25], \(\mathcal {P}\) is described by a CNF formula, where \(\mathcal {F}_c\) denotes the encoding of component c:

$$\begin{aligned} \mathcal {P} \triangleq \bigwedge \nolimits _{c \in \mathcal {C}} { ( \lnot h(c) \vee \mathcal {F}_c )} \end{aligned}$$
(1)

Observations represent deviations from the expected system behaviour. An observation, denoted as o, is a finite set of first-order sentences [16, 34], which is assumed to be encodable in CNF as a set of unit clauses. In this work, the failing test cases represent the set of observations.

A system \(\mathcal {P}\) is considered faulty if there exists an inconsistency with a given observation o when all components are declared healthy. The problem of model-based diagnosis (MBD) aims to identify a set of components which, if declared unhealthy, restore consistency. This problem is represented by the 3-tuple \(\langle \mathcal {P}, \mathcal {C}, o\rangle \), and can be encoded as a CNF formula:

$$\begin{aligned} \mathcal {P} \wedge o \wedge \bigwedge \nolimits _{c \in \mathcal {C}} { h(c) } \vDash \bot \end{aligned}$$
(2)

For a given MBD problem \(\langle \mathcal {P}, \mathcal {C}, o\rangle \), a set of system components \(\varDelta \subseteq \mathcal {C}\) is a diagnosis iff:

$$\begin{aligned} \mathcal {P} \wedge o \wedge \bigwedge \nolimits _{c \in \mathcal {C} \setminus \varDelta } { h(c) } \wedge \bigwedge \nolimits _{c \in \varDelta } { \lnot h(c) } \nvDash \bot \end{aligned}$$
(3)

A diagnosis \(\varDelta \) is minimal iff no subset of \(\varDelta \), \(\varDelta ' \subsetneq \varDelta \), is a diagnosis, and \(\varDelta \) is of minimal cardinality if there is no other diagnosis \(\varDelta '' \subseteq \mathcal {C}\) with \(|\varDelta ''| < |\varDelta |\).

A diagnosis is redundant if it is not subset-minimal [16].

To encode the Model-Based Diagnosis problem with one observation with partial MaxSAT, the set of clauses that encode \(\mathcal {P}\) (1) represents the set of hard clauses. The soft clauses consists of unit clauses that aim to maximize the set of healthy components, i.e., \(\bigwedge _{c \in \mathcal {C}} { h(c) }\) [24, 36]. This MaxSAT encoding of MBD enables enumerating minimum cardinality diagnoses and subset minimal diagnoses, considering a single observation. Furthermore, a minimal diagnosis is a minimal correction subset (MCS) of the MaxSAT formula. Given an inconsistent formula that encodes the MDB problem (2), a minimal diagnosis \(\varDelta \) satisfies (3), thereby making \(\varDelta \) an MCS of the MaxSAT formula. BugAssist  [18], SNIPER  [21], and other model-based diagnosis (MBD) tools for fault localization in circuits [16, 24, 36] encode the localization problem with partial MaxSAT.

More recently, the MaxSAT encoding for MBD [16] has been generalized to multiple inconsistent observations. Let \(\mathcal {O} = \{o_1,\ \dots \ o_m\}\) be a set of observations. Each observation is associated with a replica \(\mathcal {P}_i\) of the system \(\mathcal {P}\). The system remains unchanged given different observations, where the components are replicated for each observation, but the healthy variables are shared. For a given observation \(o_i\), a diagnosis is given by the following:

$$\begin{aligned} \mathcal {P}_i \wedge o_i \wedge \bigwedge \nolimits _{c \in \mathcal {C} \setminus \varDelta } { h(c) } \wedge \bigwedge \nolimits _{c \in \varDelta } { \lnot h(c) } \nvDash \bot \end{aligned}$$
(4)

The goal is to find a minimal diagnosis \(\varDelta \subseteq \mathcal {C}\), such that \(\varDelta \) is a minimal set of components when deactivated the system becomes consistent with all observations \(\mathcal {O} = \{o_1,\ \dots \ o_m\}\). Moreover, when considering multiple observations, an aggregated diagnosis is a subset of components that includes one possible diagnosis for each given observation.

3 Model-Based Diagnosis with Multiple Test Cases

This paper encodes the fault localization problem as a Model-Based Diagnosis with multiple observations using a single optimization problem. We simultaneously integrate all failing test cases (observations) in a single MaxSAT formula. This approach allows us to generate only minimal diagnoses capable of identifying all faulty components within the system, in our case, a C program.

Given m observations, \(\mathcal {O} = \{o_1, \dots , o_m\}\), a distinct replica of the system, denoted as \(\mathcal {P}_i\), is required for each observation \(o_i\). The hard clauses, \(\phi _h\), in our MaxSAT formulation correspond to each observation’s encoding (\(o_i\)) and m system replicas, one for each observation, \(\mathcal {P}_i\). Hence, \(\phi _h = \bigwedge _{o_i \in \mathcal {O}} {( \mathcal {P}_i \wedge o_i )}\). Additionally, we aim to maximize the set of healthy components. Therefore, the soft clauses are formulated as: \(\phi _s = \bigwedge _{c \in \mathcal {C} } { h(c) }\). Thus, given the MaxSAT solution of \((\phi _h, \phi _s)\), its complement, i.e., the set of unhealthy components (\(h(c)=0\)), corresponds to a subset-minimal aggregated diagnosis. This diagnosis is a subset-minimal of components that, when declared unhealthy (deactivated), make the system consistent with all observations, as follows:

$$\begin{aligned} \bigwedge \nolimits _{o_i \in \mathcal {O}} {( \mathcal {P}_i \wedge o_i )} \wedge \bigwedge \nolimits _{c \in \mathcal {C} \setminus \varDelta } { h(c) } \wedge \bigwedge \nolimits _{c \in \varDelta } { \lnot h(c) } \nvDash \bot \end{aligned}$$
(5)

We assume that the system remains unchanged given different observations, where the components are replicated for each observation, but the healthy variables are shared. This is necessary because we analyze all observations jointly, which can affect the component’s behaviour. In our work, the observations consist of a test suite containing failing test cases.

The HSD [16] algorithm was proposed to localize single faults in circuits given multiple observations. The HSD algorithm is based on hitting set dualization (HSD). For each observation \(o_i\), this algorithm computes minimal unsatisfiable subsets (MUSes) of the MaxSAT formula encoded by (4). Next, the HSD algorithm computes a minimum hitting set \(\mathcal {H}\) on the MUSes, and checks if \(\mathcal {H}\) makes the system consistent with each observation individually. Hence, to compute all subset-minimal aggregated diagnoses of a faulty system \(\mathcal {P}\), the algorithm performs at least m oracle calls for each minimum hitting set computed, where m is the number of observations. Each oracle call uses a different system replica (4).

Our approach encodes the problem into a single MaxSAT formula, while HSD [16] divides the problem into m MaxSAT formulas, one for each observation. Additionally, for each minimal hitting set computed in HSD, m oracle calls are needed to check if a diagnosis is consistent with all observations. However, in our case, we just need to perform a single MaxSAT call that returns a minimal diagnosis, which is, by definition, consistent with all observations since all observations are encoded into the formula. Furthermore, the HSD algorithm was solely evaluated using single faults in circuits given multiple observations, and it was not implemented to work with programs. A potential drawback is that our MaxSAT formula grows with the number of observations. This could result in a large formula and affect the performance of the MaxSAT solver. However, this scenario was not observed in our experimental results (see Sect. 5).

4 cfaults: MBD with Multiple Observations for C

CFaults is a new model-based diagnosis (MBD) tool for fault localization in C programs with multiple test cases. Unlike previous works, CFaults uses the approach proposed in Sect. 3, and C programs are relaxed at the code level, enabling users to leverage other bounded model checkers effectively. Figure 1 provides an overview of CFaults consisting of six main steps: program unrolling, program instrumentalization, bounded model checking (CBMC), encoding to MaxSAT, an Oracle (MaxSAT solver), and a refinement step. Hence, CFaults formulates the MBD problem with multiple test cases as the 3-tuple \(\langle \mathcal {P}, \mathcal {C}, \mathcal {O} \rangle \), where the observations \(\mathcal {O}\) consist of failing test cases (inputs and assertions), the components \(\mathcal {C}\) represent the set of program statements, and the system description \(\mathcal {P}\) is a trace formula of the unrolled and instrumentalized program. The program is instrumented at the code level with relaxation variables corresponding to our healthy variables.

Fig. 1.
figure 1

Overview of CFaults.

Program Unrolling. CFaults starts the unrolling process by expanding the faulty program using the set of failed tests from the test suite. In this context, an unrolled program signifies the original program expanded m times (m program scopes), where m denotes the number of failed test cases. An unrolled program encodes the execution of all failing tests within the program, along with their corresponding inputs and specifications (assertions).

The unrolling process encompasses three primary steps. Initially, CFaults generates fresh variables and functions for each of the m program scopes, ensuring each scope possesses unique variables and functions. Subsequently, CFaults establishes variables representing the inputs and outputs for each program scope corresponding to the failing tests. Input operations, such as scanf, undergo translation into read accesses to arrays corresponding to the inputs, while output operations, such as printf, are replaced by write operations into arrays representing the program’s output. Every exit point of the program (e.g., a return statement in the main function) is replaced with a goto statement directing the

figure d

program flow to the next failing test’s scope. Lastly, at the end of the unrolled program, CFaults embeds an assertion capturing all the specifications of the failing tests. Consequently, the unrolled program encapsulates the execution of all failing tests within a single program.

Listing 1.2 exhibits a program segment generated through the unrolling process applied to Listing 1.1. CFaults establishes global variables to represent the inputs and outputs of each failing test (lines 1–3, Listing 1.2). For the sake of simplicity, the depicted listing illustrates solely the initial scope corresponding to test 0 from the test suite outlined in Table 1. Distinct variables are introduced for each failing test. Furthermore, the scanf function call is substituted with input array operations (lines 8–10), while the printf calls are replaced with CFaults ’ print functions, akin to sprintf functions, which direct output to a buffer. Lastly, the unrolled program concludes with an assertion representing the disjunction of the negation of all failing test assertions. For instance, suppose there are m failing tests, where \(A_i\) denotes the assertion of test \(t_i\). In this scenario, CFaults injects the following assertion into the program: \(\lnot A_1 \vee \dots \vee \lnot A_m\).

Program Intrumentalization. After integrating all possible executions and assertions from failing tests during the unrolling step, CFaults proceeds to instrumentalize the unrolled C program by introducing relaxation variables for each program component (statement/instruction). Each relaxation variable activates (or deactivates) the program component being relaxed when assigned to true (or false) respectively. CFaults ensures that there are no conflicts between the names of the relaxation variables and the names of the program’s original variables. For this step, CFaults needs to receive a maximum number of iterations that the program should be unwound.

The relaxation process introduces relaxation variables that deactivate or activate program components. This process involves four distinct relaxation rules for: (1) conditions of if-statements, (2) expression lists (e.g., an expression list executed at the beginning of a for-loop), (3) loop conditions, and (4) other program statements.

Example 2

Listings 1.3 shows a code snippet that sums all the numbers between 1 and n. Listings 1.4 depicts the same program statements after undergoing relaxation by CFaults. For the sake of simplicity, all relaxation variables’ and offsets’ names were simplified.

figure e
figure f

In more detail, the rule for relaxing a general program statement is to envelop the statement with an if-statement, whose condition is a relaxation variable. For example, consider lines 5 and 6 in the program on Listings 1.3. These lines are relaxed by CFaults using relaxation variables _rv1 and _rv2 respectively, appearing as lines 11 and 12 on Listings 1.4.

Furthermore, when relaxing if-statements, the statements inside the then and else blocks adhere to the previously explained relaxation rule. However, the conditions of if-statements are relaxed using a ternary operator, as shown in line 14 of Listings 1.4. Note that if the relaxation variable is assigned true, then the original if condition is executed. Otherwise, a different relaxation variable (e.g., _ev4 in Listings 1.4) determines whether the program execution enters the then-block or the else-block (if one exists). These relaxation variables (else’s relaxation variables) are local to each failing test scope and enable different tests to determine whether to enter the then or else-block.

When handling expression lists, CFaults adopts a comparable strategy to that of generic program statements, enclosing each expression within a ternary operator instead of an if-statement. If the program component is deactivated, the expression is replaced by 1. For example, the initialization of variable i in line 11 of Listings 1.3 is relaxed into the ternary operation in line 17 of Listings 1.4.

Lastly, all relaxation variables inside a loop are Boolean vectors to relax statements within a loop. Each entry of these vectors relaxes the loop’s statements for a given iteration. The maximum number of iterations of the loops is defined by the CFaults user. CFaults follows a similar approach for inner loops, creating arrays of arrays. Thus, for simple program statements within a loop, CFaults encapsulates them with if-statements, with the relaxation variables indexed to the iteration number. Line 20 of Listings 1.4 illustrates a relaxed statement inside a loop. The loop’s condition is relaxed by implication of the relaxation variable, as demonstrated in line 18 of Listings 1.4. Furthermore, each loop has its own offsets to index relaxation variables. These offsets are initialized just before the loop and incremented at the end of each iteration (e.g., line 19 in Listing 1.4).

When handling auxiliary functions, CFaults declares the relaxation variables needed in the main scope of the program and passes these variables as parameters. Hence, CFaults ensures that the same variables are used throughout the auxiliary functions’ calls.

Listing 1.5 depicts the program resulting from the instrumentalization process of Listing 1.2 performed by CFaults. The same program components (statements/instructions) across different failing test scopes are assigned the same relaxation variable declared in the main scope. Consequently, if a relaxation variable is set to 0, the corresponding program component is deactivated across all test executions. Additionally, the relaxation variables are left uninitialized, allowing CFaults to determine the minimal number of faulty components requiring deactivation. Note that relaxation variables are not declared as global variables but as local variables within the main scope. This is to prevent the C compiler from automatically initializing all these variables to 0.

CBMC. After unrolling and instrumentalizing the C program, CFaults invokes CBMC, a bounded model checker for C [5]. CBMC initially transforms the unrolled and relaxed program into Static Single Assignment (SSA) form, an intermediate representation ensuring that variables are assigned values only once and are defined before use [9]. SSA achieves this by converting existing variables into multiple versions, each uniquely representing an assignment. Next, CBMC translates the SSA representation into a CNF formula, which represents the trace formula of the program. During the CNF formula generation, CBMC negates the program’s assertion (\(\lnot (\lnot A_1 \vee \dots \vee \lnot A_m)\)) to compute a counter-example. Moreover, the CNF formula, \(\phi \), encodes each failing test’s input (\(I_i\)), assertion (\(A_i\)), and all execution paths of the unrolled and relaxed incorrect program encoded by the trace formula (P), i.e., \(\phi = (I_1\ \wedge \ \dots \ \wedge \ I_m)\ \wedge \ P\ \wedge \ (A_1 \wedge \dots \wedge A_m)\). Thus, if \(\phi \) is SAT, an assignment exists that activates or deactivates each relaxation variable and makes all failing test assertions true. Hence, each satisfiable assignment is a diagnosis of the C program, considering all failing tests.

figure g

MaxSAT Encoder. Let \(\phi \) denote the CNF formula generated by CBMC in the previous step. Next, CFaults generates a weighted partial MaxSAT formula \((\mathcal {H}, \mathcal {S})\) to maximize the satisfaction of relaxation variables in the program, aiming to minimize the necessary code alterations. The set of hard clauses is defined by CBMC ’s CNF formula (i.e., \(\mathcal {H} = \phi \)), while the soft clauses consist of unit clauses representing relaxation variables used to instrument the C program, expressed as \(\mathcal {S} = \bigwedge _{c \in \mathcal {C}} { ({rv}_c) }\). Additionally, we assign a hierarchical weight to each relaxation variable based on the height of its sub-AST (Abstract Syntax Tree). For instance, in the case of an if-statement without an else-block, the relaxation variable for its condition will be assigned a weight equal to the sum of the weights of the relaxation variables within the then-block. Furthermore, to prioritize the identification of faulty statements within the program’s logic over evaluating issues in the input/output, these statements (such as scanf and printf) are assigned a significantly higher cost compared to other program statements. Moreover, due to the use of hierarchical weights in the relaxation variables, CFaults enumerates all MaxSAT solutions to identify all subset-minimal diagnoses since there can be more than one MaxSAT solution (with the same cost) that differ in the number of relaxed program statements.

Oracle. CFaults invokes a MaxSAT solver to determine the program’s minimal set of faulty statements, aligning with the principles of Model-Based Diagnosis (MBD) theory. By consolidating all failing tests into a unified, unrolled, and instrumentalized program, the MaxSAT solution identifies the minimum subset of statements requiring removal to fulfil the assertions of all failing tests.

Refinement. The standard Model-Based Diagnosis (MBD) theory focuses on faulty components (program statements) whose removal can rectify the system (program’s assertions). However, addressing program faults in software may necessitate introducing, relocating, or replacing statements. Hence, CFaults incorporates a refinement step that introduces nondeterminism into the program, enabling the Oracle to simulate actions such as introducing, reallocating or replacing existing program statements. During the first iteration of CFaults, the refinement step is invoked to introduce non-determinism, with the aim of minimizing the number of faulty statements. This step can improve fault localization by conducting a more detailed analysis of previously identified faulty statements. For example, in the scenario outlined in Example 1, refining line 5 into

figure h

enables CFaults to determine that only the left part of the binary operation (\(\texttt {f < s}\)) is faulty, while the right part remains unaffected. This fine-grained approach allows for more precise detection of program faults. When the refinement step is triggered, CFaults instrumentalizes the program again, introducing nondeterminism exclusively to the statements previously identified as faulty during the initial Oracle call. Through this process, CFaults aims to reduce the set of faulty program components by executing them or assigning them to nondeterministic functions. All remaining program components are executed, meaning their relaxation variables are activated during this step.

5 Experimental Results

All of the experiments were conducted on an Intel(R) Xeon(R) Silver computer with 4210R CPUs @ 2.40 GHz running Linux Debian 10.2, using a memory limit of 32 GB and a timeout of 3600 s, for each program. CFaults has been evaluated using two distinct benchmarks of C programs: TCAS  [10] and C-Pack-IPAs  [27]. TCAS stands out as a well-known program benchmark extensively utilized in the fault localization literature [18, 21]. This benchmark comprises a C program from Siemens and 41 versions with intentionally introduced faults, with known positions and types of these faults. Conversely, C-Pack-IPAs is a set of student programs collected during an introductory programming course. For this evaluation, we used the first lab class of C-Pack-IPAs, which consists of ten programming assignments, comprising 486 faulty programs and 799 correct implementations. C-Pack-IPAs has proven successful in evaluating various works across program analysis [32], program transformation [28], and clustering [31].

CFaults uses pycparser [33] for unrolling and instrumentalizing C programs. Additionally, CBMC version 5.11 is used to encode C programs into CNF formulas. Furthermore, since the source code of BugAssist and SNIPER is either unavailable or no longer maintained (resulting in compilation and linking issues), prototypes of their algorithms were implemented. It is worth noting that the original version of SNIPER could only analyze programs that utilized a subset of ANSI-C, lacked support for loops and recursion, and could only partially handle global variables, arrays, and pointers. In this work, both SNIPER and BugAssist handle ANSI-C programs, as their algorithms are built on top of CFaults ’s unroller and instrumentalizer modules. For the MaxSAT oracle, RC2Stratified [15] from the PySAT toolkit [14] (v. 0.1.7.dev19) was used.

Furthermore, all three FBFL algorithms evaluated (CFaults, BugAssist, and SNIPER) consistently generate diagnoses that are consistent with (5), indicating that all proposed diagnoses undergo validation by CBMC once the algorithm provides a diagnosis. However, this validation primarily serves to verify diagnoses generated by BugAssist, as it has the capability to produce diagnoses that may not align with all failing test cases. In contrast, CFaults ’ MaxSAT solution, by definition, aligns with all observations, and SNIPER ’s aggregation method (Cartesian product) produces only valid diagnoses, although they may not always be subset-minimal. When considering BugAssist, we iterate through all computed diagnoses based on BugAssist ’s voting score, until we identify one diagnosis that is consistent with all observations, i.e., conforms to (5).

Table 3. BugAssist, SNIPER and CFaults fault localization results.
Fig. 2.
figure 2

Comparison between BugAssist ’s, SNIPER ’s and CFaults ’ diagnoses.

Table 3 provides an overview of the results obtained using SNIPER, BugAssist, and CFaults on the two benchmarks of C programs. The TCAS program comprises approximately 180 lines of code and has a maximum of 131 failing tests for each program. This leads SNIPER to reach the memory limit of 32 GB for almost 83% of the programs when aggregating the sets of MCSes computed for each failing test. Additionally, a higher rate of timeouts is observed for SNIPER and BugAssist than for CFaults. Figure 2a and 2b depict cactus plots that present the CPU time spent on fault localization in each program (y-axis) versus the number of programs with all faults successfully localized (x-axis) using BugAssist, SNIPER, and CFaults (with and without refinement) on TCAS and C-Pack-IPAs, respectively. Notably, CFaults generally exhibits faster performance compared to BugAssist and SNIPER across both benchmarks. In Fig. 2a, SNIPER ’s performance is due to its memout rate on TCAS.

In TCAS, CFaults, whether invoking the refinement step or not, identifies faults in the entire dataset. However, in C-Pack-IPAs, CFaults localizes faults in one additional program when the refinement step is not called. Even if the refinement step reaches the time limit, CFaults still possesses a subset-minimal diagnosis from the preceding step that has not undergone refinement. The refinement step slightly slows down CFaults, as shown in Fig. 2a and 2b. Nonetheless, Fig. 2c illustrates a scatter plot comparing the optimum costs (MaxSAT solution’s cost) achieved by CFaults with and without calling the refinement step on C-Pack-IPAs. Each point on this plot represents a faulty program, where the x-value (resp. y-value) represents the optimum cost of CFaults ’ with refinement (resp. without refinement) diagnosis. If a point lies above the diagonal, it indicates that a non-refined diagnosis has a higher cost than a refined diagnosis for the same program. Therefore, while the refinement step may marginally slow down CFaults, it enables CFaults to identify smaller diagnoses at a reduced cost in approximately 16% of C-Pack-IPAs ’s programs. Moreover, this observation was not noted in the TCAS dataset, as each program contains a maximum of two faults, and the refinement step did not yield improved outcomes in this particular dataset.

Additionally, Fig. 2d illustrates a scatter plot comparing the diagnoses’ costs achieved by CFaults (x-axis) against BugAssist (y-axis) on C-Pack-IPAs. BugAssist fails to provide an optimal diagnosis in almost 6% of cases. In the TCAS benchmark, although BugAssist manages to localize faults in all programs, it yields a non-optimal diagnosis in 10% of the programs. Furthermore, Fig. 2e depicts a scatter plot comparing the number of diagnoses generated by CFaults (x-axis) against SNIPER (y-axis). While CFaults needs to enumerate all MaxSAT solutions due to the weighted MaxSAT formula, it is evident that SNIPER generates significantly more diagnoses than CFaults. This discrepancy suggests that SNIPER overlooks the possibility of redundant diagnoses being computed. The number of such redundant diagnoses is much larger than the subset-minimal diagnoses generated by CFaults. Figure 2e illustrates that in some instances, SNIPER may enumerate up to 100K diagnoses, whereas CFaults generates less than 10.

As a validation step for our implementation, we analyzed all three fault localization methods on the collection of 799 correct programs in C-Pack-IPAs. This was done to ensure that all methods yielded zero faults for all correct implementations of each programming exercise. Moreover, we conducted a comparison between CFaults and the HSD algorithm [16] (see Sect. 3) on the ISCAS85 dataset [13], which is a widely studied collection of single-fault circuits. It is worth noting that HSD’s implementation currently only supports fault localization in circuits. We encountered no performance issues during this comparison, and both approaches successfully localized all faults within each circuit.

6 Related Work

Fault localization (FL) techniques typically fall into two main families: spectrum-based (SBFL) and formula-based (FBFL). SBFL methods [1, 2, 26, 38,39,40] estimate the likelihood of a statement being faulty based on test coverage information from both passing and failing test executions. While SBFL techniques are generally fast, they may lack precision, as not all identified statements are likely to be the cause of failures [23, 35]. In contrast, FBFL approaches [11, 12, 17,18,19,20,21, 41, 42] are considered exact. FBFL methods encode the fault localization problem into several optimization problems aimed at identifying the minimum number of faulty statements within a program. Typically, these methods perform a MaxSAT call for each failing test, allowing them to individually identify a minimal set of faults for each failing test case rather than simultaneously addressing all failing test cases. Program slicing [35, 37, 43] has also emerged as a technique for localizing faults within programs. A more syntactic FBFL approach [35] is to use program slicing to enumerate all minimal sets of repairs for a given faulty program. Another method for identifying the causes of faulty program behaviour involves analyzing the variances between various versions of the software [43]. Refinement has a long-standing tradition in verification; particularly for refining abstractions of reachable states [4, 6, 7]. In that sense, our form of refinement is different because it enables us to more precisely pinpoint faults of the user, at the sub-expression level.

7 Conclusion

This paper introduces a novel formula-based fault localization technique for C programs capable of addressing any number of faults. Leveraging Model-Based Diagnosis (MBD) with multiple observations, CFaults consolidates all failing test cases into a unified MaxSAT formula, ensuring consistency in the fault localization process. Experimental evaluations on TCAS and C-Pack-IPAs, show that CFaults is faster than other FBFL approaches like BugAssist and SNIPER. Furthermore, CFaults only generates minimal diagnoses of faulty statements, while other methods tend to produce redundant diagnoses.