1 Introduction

Machine learning (ML) is a rapidly growing field with a wide range of applications, including safety-critical, high-risk systems in the fields of health care [18], aviation [38] and autonomous driving [12]. Despite their success, ML models, and especially deep neural networks (DNNs), remain “black-boxes” — they are incomprehensible to humans and are prone to unexpected behaviour and errors. This issue can result in major catastrophes [13, 73], and also in poor decision-making due to brittleness or bias [7, 24].

In order to render DNNs more comprehensible to humans, researchers have been working on explainable AI (XAI), where we seek to construct models for explaining and interpreting the decisions of DNNs [50, 55,56,57]. Work to date has focused on heuristic approaches, which provide explanations, but do not provide guarantees about the correctness or succinctness of these explanations [14, 32, 44]. Although these approaches are an important step, their limitations might result in skewed results, possibly failing to meet the regulatory guidelines of institutions and organizations such as the European Union, the US government, and the OECD [51]. Thus, producing DNN explanations that are provably accurate remains of utmost importance.

More recently, the formal verification community has proposed approaches for providing formal and rigorous explanations for DNN decision making [27, 31, 51, 59]. Many of these approaches rely on the recent and rapid developments in DNN verification [1, 8, 9, 39]. These approaches typically produce an abductive explanation (also known as a prime implicant, or PI-explanation) [31, 58, 59]: a minimum subset of input features, which by themselves already determine the classification produced by the DNN, regardless of any other input features. These explanations afford formal guarantees, and can be computed via DNN verification [31].

Abductive explanations are highly useful, but there are two major difficulties in computing them. First, there is the issue of scalability: computing locally minimal explanations might require a polynomial number of costly invocations of the underlying DNN verifier, and computing a globally minimal explanation is even more challenging [10, 31, 48]. The second difficulty is that users may sometimes prefer “high-level” explanations, not based solely on input features, as these may be easier to grasp and interpret compared to “low-level”, complex, feature-based explanations.

To tackle the first difficulty, we propose here new approaches for more efficiently producing verification-based abductive explanations. More concretely, we propose a method for provably approximating minimum explanations, allowing stakeholders to use slightly larger explanations that can be discovered much more quickly. To accomplish this, we leverage the recently discovered dual relationship between explanations and contrastive examples [30]; and also take advantage of the sensitivity of DNNs to small adversarial perturbations [64], to compute both lower and upper bounds for the minimum explanation. In addition, we propose novel heuristics for significantly expediting the underlying verification process.

In addressing the second difficulty, i.e. the interpretability limitations of “low-level” explanations, we propose to construct explanations in terms of bundles, which are sets of related features. We empirically show that using our method to produce bundle explanations can significantly improve the interpretability of the results, and even the scalability of the approach, while still maintaining the soundness of the resulting explanations.

To summarize, our contributions include the following: (i) We are the first to suggest a method that formally produces sound and minimal abductive explanations that provably approximate the global-minimum explanation. (ii) Our three suggested novel heuristics expedite the search for minimal abductive explanations, significantly outperforming the state of the art. (iii) We suggest a novel approach for using bundles to efficiently produce sound and provable explanations that are more interpretable and succinct.

For evaluation purposes, we implemented our approach as a proof-of-concept tool. Although our method can be applied to any ML model, we focused here on DNNs, where the verification process is known to be NP-complete [39], and the scalable generation of explanations is known to be challenging [31, 58]. We used our tool to test the approach on DNNs trained for digit and clothing classification, and also compared it to state-of-the-art approaches [31, 32]. Our results indicate that our approach was successful in quickly producing meaningful explanations, often running 40% faster than existing tools. We believe that these promising results showcase the potential of this line of work.

The rest of the paper is organized as follows. Sec. 2 contains background on DNNs and their verification, as well as on formal, minimal explanations. Sec. 3 covers the main method for calculating approximations of minimum explanations, and Sec. 4 covers methods for improving the efficiency of calculating these approximations. Sec. 5 covers the use of bundles in constructing “high-level”, provable explanations. Next, we present our evaluation in Sec. 6. Related work is covered in Sec. 7, and we conclude in Sec. 8.

2 Background

DNNs. A deep neural network (DNN) [46] is a directed graph composed of layers of nodes, commonly called neurons. In feed-forward NNs the data flows from the first (input) layer, through intermediate (hidden) layers, and onto an output layer. A DNN’s output is calculated by assigning values to its input neurons, and then iteratively calculating the values of neurons in subsequent layers. In the case of classification, which is the focus of this paper, each output neuron corresponds to a specific class, and the output neuron with the highest value corresponds to the class the input is classified to.

Fig. 1.
figure 1

A simple DNN.

Fig. 1 depicts a simple, feed-forward DNN. The input layer includes three neurons, followed by a weighted sum layer, which calculates an affine transformation of values from the input layer. Given the input \(V_1=[1,1,1]^T\), the second layers computes the values \(V_2=[6,9,11]^T\). Next comes a ReLU layer, which computes the function \(\text {ReLU} (x)=\max (0,x)\) for each neuron in the preceding layer, resulting in \(V_3=[6,9,11]^T\). The final (output) layer then computes an affine transformation, resulting in \(V_4=[15,-4]^T\). This indicates that input \(V_1=[1,1,1]^T\) is classified as the category corresponding to the first output neuron, which is assigned the greater value.

DNN Verification. A DNN verification query is a tuple \(\langle P, N, Q\rangle \), where N is a DNN that maps an input vector x to an output vector \(y=N(x)\), P is a predicate on x, and Q is a predicate on y. A DNN verifier needs to decide whether there exists an input \(x_0\) that satisfies \(P(x_0) \wedge Q(N(x_0))\) (the SAT case) or not (the UNSAT case). Typically, P and Q are expressed in the logic of real arithmetic [49]. The DNN verification problem is known to be NP-Complete [39].

Formal Explanations. We focus here on explanations for classification problems, where a model is trained to predict a label for each given input. A classification problem is a tuple \(\langle F, D, K, N\rangle \) where (i) \(F=\{1,...,m\}\) denotes the features; (ii) \(D=\{D_1,D_2...,D_m\}\) denotes the domains of each of the features, i.e. the possible values that each feature can take. The entire feature (input) space is hence \(\mathbb {F}={D_1 \times D_2 \times ...\times D_m}\); (iii) \(K=\{c_1,c_2,...,c_n\}\) is a set of classes, i.e. the possible labels; and (iv) \(N:F\rightarrow K\) is a (non-constant) classification function (in our case, a neural network). A classification instance is the pair (vc), where \(v\in \mathbb {F}\), \(c\in K\), and \(c=N(v)\). In other words, v is mapped by the neural network N to class c.

Looking at (vc), we often wish to know why v was classified as c. Informally, an explanation is a subset of features \(E\subseteq F\), such that assigning these features to the values assigned to them in v already determines that the input will be classified as c, regardless of the remaining features \(F\setminus E\). In other words, even if the values that are not in the explanation are changed arbitrarily, the classification remains the same. More formally, given input \(v=(v_1,...v_m)\in \mathbb {F}\) with the classification \(N(v)=c\), an explanation (sometimes referred to as an abductive explanation, or an AXP) is a subset of the features \(E\subseteq F\), such that:

$$\begin{aligned} \forall (x\in \mathbb {F}).\quad [\bigwedge _{i\in E}(x_{i}=v_{i})\rightarrow (N(x)=c)] \end{aligned}$$
(1)

We continue with the running example from Fig. 1. For simplicity, we assume that each input neuron can only be assigned the values 0 or 1. It can be observed that for input \(V_1=[1,1,1]^T\), the set \(\{ v_1^1, v_1^2 \}\) is an explanation; indeed, once the first two entries in \(V_1\) are set to 1, the classification remains the same for any value of the third entry (see Fig. 2). We can prove this by encoding a verification query \(\langle P,N,Q\rangle = \langle E=v,N,Q_{\lnot c}\rangle \), where E is the candidate explanation, and \(E=v\) means that we restrict the features in E to their values in v; and \(Q_{\lnot c}\) implies that the classification is not c. An UNSAT result for this query indicates that E is an explanation for instance (vc).

Fig. 2.
figure 2

\(\{ v_1^1, v_1^2 \}\) is an explanation for input \(V_1=[1,1,1]^T\).

Clearly, the set of all features constitutes a trivial explanation. However, we are interested in smaller explanation subsets, which can provide useful information regarding the decision of the classifier. More precisely, we search for minimal explanations and minimum explanations. A subset \(E\subseteq F\) is a minimal explanation (also referred to as a local-minimal explanation, or a subset-minimal explanation) of instance (vc) if it is an explanation that ceases to be an explanation if even a single feature is removed from it:

$$\begin{aligned} \begin{aligned}&(\forall (x\in \mathbb {F}).[\wedge _{i\in E}(x_{i}=v_{i})\rightarrow (N(x)=c)]) \wedge \\&(\forall (j\in E).[ \exists (y\in \mathbb {F}).[\wedge _{i\in E\setminus j}(y_{i}=v_{i})\wedge (N(y)\ne c)]) \end{aligned} \end{aligned}$$
(2)

Fig. 3 demonstrates that \(\{ v_1^1, v_1^2 \}\) is a minimal explanation in our running example: removing any of its features allows mis-classification.

Fig. 3.
figure 3

\(\{ v_1^1, v_1^2 \}\) is a minimal explanation for input \(V_1=[1,1,1]^T\).

A minimum explanation (sometimes referred to as a cardinal minimal explanation or a PI-explanation) is defined as a minimal explanation of minimum size; i.e., if E is a minimum explanation, then there does not exist a minimal explanation \(E' \ne E\) such that \(|E'|<|E|\). Fig. 4 demonstrates that \(\{ v_1^3 \} \) is a minimum explanation for our running example.

Fig. 4.
figure 4

\(\{ v_1^3 \} \) is a minimum explanation for input \(V_1=[1,1,1]^T\).

Contrastive Example. A subset of features \(C\subseteq F\) is called a contrastive example or a contrastive explanation (CXP) if altering the features in C is sufficient to cause the misclassification of a given classification instance (vc):

$$\begin{aligned} \exists (x\in \mathbb {F}).[\wedge _{i\in F\setminus C}(x_{i}=v_{i})\wedge (N(x)\ne c)] \end{aligned}$$
(3)
Fig. 5.
figure 5

\(\{ v_1^2, v_1^3 \}\) is a contrastive example for \(V_1=[1,1,1]^T\).

A contrastive example for our running example is shown in Fig. 5. Notice that the question of whether a set is a contrastive example can be encoded into a verification query \(\langle P,N,Q\rangle = \langle (F\setminus C)=v,N,Q_{\lnot c}\rangle \), where a SAT result indicates that C is a contrastive example. As with explanations, smaller contrastive examples are more valuable than large ones. One useful notion is that of a contrastive singleton: a contrastive example of size one. A contrastive singleton could represent a specific pixel in an image, the alteration of which could result in misclassification. Such singletons are leveraged in “one-pixel attacks” [64] (see Fig. 16 in the appendix of the full version of this paper [11]). Contrastive singletons have the following important property:

Lemma 1

Every contrastive singleton is contained in all explanations.

The proof appears in Sec. A of the appendix of the full version of this paper [11]. Lemma 1 implies that each contrastive singleton is contained in all minimal/minimum explanations.

We consider also the notion of a contrastive pair, which is a contrastive example of size 2. Clearly, for any pair of features (uv) where u or v are contrastive singletons, (uv) is a contrastive pair; however, when we next refer to contrastive pairs, we consider only pairs that do not contain any contrastive singletons. Likewise, for every \(k>2\), we can consider contrastive examples of size k, and we exclude from these any contrastive examples of sizes \(1,\ldots ,k-1\) as subsets.

We state the following theorem, whose proof also appears in Sec. A of the appendix of the full version of this paper [11]:

Lemma 2

All explanations contain at least one element of every contrastive pair.

The theorem can be generalized to any \(k>2\); and can be used in showing that the minimum hitting set (MHS) of all contrastive examples is exactly the minimum explanation [29, 54] (see Sec. B of the appendix of the full version of this paper [11]). Further, the theorem implies a duality between contrastive examples and explanations [30, 34]: a minimal hitting set of all contrastive examples constitutes a minimal explanation, and a minimal hitting set of all explanations constitutes a minimal contrastive example.

3 Provable Approximations for Minimal Explanations

State-of-the-art approaches for finding minimum explanations exploit the MHS duality between explanations and contrastive examples [31]. The idea is to iteratively compute contrastive examples, and then use their MHS as an under-approximation for the minimum explanation. Finding this MHS is an NP-complete problem, and is difficult in practice as the number of contrastive examples increases [20]; and although the MHS can be approximated using maximum satisfiability (MaxSAT) or mixed integer linear programming (MILP) solvers [26, 47], existing approaches tackle simpler ML models, such as decision trees [33, 36], but face scalability limitations when applied to DNNs [31, 58]. Further, enumerating all contrastive examples may in itself take exponential time. Finally, recall that DNN verification is an NP-Complete problem [39]; and so dispatching a verification query to identify each explanation or contrastive example is also very slow, when the feature space is large. Finding minimal explanations may be easier [31], but may converge to larger and less meaningful explanations, while still requiring a linear number of calls to the underlying verifier. Our approach, described next, seeks to mitigate these difficulties.

Our overall approach is described in Algorithm 1. It is comprised of two separate threads, intended to be run in parallel. The upper bounding thread (\(T_{\text {UB}}\)) is responsible for computing a minimal explanation. It starts with the entire feature space, and then gradually reduces it, until converging to a minimal explanation. The size of the presently smallest explanation is regarded as an upper bound (UB) for the size of the minimum explanation. Symmetrically, the lower bounding thread (\(T_{\text {LB}}\)) attempts to construct small contrastive sets, used for computing a lower bound (LB) on the size of the minimum explanation. Together, these two bounds allow us to compute the approximation ratio between the minimal explanation that we have discovered and the minimum explanation. For instance, given a minimal explanation of size 7 and a lower bound of size 5, we can deduce that our explanation is at most times larger than the minimum. The two threads share global variables that indicate the set of contrastive singletons (Singletons), the set of contrastive pairs (Pairs), the upper and lower bounds (UB, LB), and the set of features that were determined not to participate in the explanation and are “free” to be set to any value (Free). The output of our algorithm is a minimal explanation (F\(\setminus \)Free), and the approximation ratio (). We next discuss each of the two threads in detail.

figure c

The Upper Bounding Thread (\(T_\text {UB} \)). This thread, whose pseudocode appears in Algorithm 2, follows the framework proposed by Ignatiev et al. [31]: it seeks a minimal explanation by starting with the entire feature space, and then iteratively attempting to remove individual features. If removing a feature allows misclassification, we keep it as part of the explanation; otherwise, we remove it and continue. This process issues a single verification query for each feature, until converging to a minimal explanation (lines 28). Although this naïve search is guaranteed to converge to a minimal explanation, it needs not to converge to a minimum explanation; and so we apply a more sophisticated ordering scheme, similar to the one proposed by [32], where we use some heuristic model as a way for assigning weights of importance to each input feature. We then check the “least important” input features first, since freeing them has a lower chance of causing a misclassification, and they are consequently more likely to be successfully removed. We then continue iterating over features in ascending order of importance, hopefully producing small explanations.

figure d

The Lower Bounding Thread ( \(T_{\text {LB}}\)). The pseudocode for the lower bounding thread (\(T_{\text {LB}}\)) appears in Algorithm 3. In lines 16, the thread searches for contrastive singletons. Neural networks were shown to be very sensitive to adversarial attacks [25] — slight input perturbations that cause misclassification (e.g., the aforementioned one-pixel attack [64]) — and this suggests that contrastive sets, and in particular contrastive singletons, exist in many cases. We observe that identifying contrastive singletons is computationally cheap: by encoding Eq. 3 as a verification query, once for each feature, we can discover all singletons; and in these queries all features but one are fixed, which empirically allows verifiers to dispatch them quickly.

figure e

The rest of \(T_{\text {LB}}\) (lines 913) performs a similar process, but with contrastive pairs (which do not contain contrastive singletons as one of their features). We use verification queries to identify all such pairs, and then attempt to find their MHS. We observe that finding the MHS of all contrastive pairs is the 2-MHS problem, which is a reformalization of the minimum vertex cover problem (see Sec. B of the appendix of the full version of this paper [11]). Since this is an easier problem than the general MHS problem, solving it with MAX-SAT or MILP often converges quickly. In addition, the minimum vertex cover algorithm has a linear 2-approximating greedy algorithm, which can be used for finding a lower bound in cases of large feature spaces.

More formally, \(T_{\text {LB}}\) performs an efficient computation of the following bound:

(4)

where MVC is the minimum vertex cover, Cxps denotes the set of all contrastive examples, and \(E_M\) is the size of the minimum explanation.

It is worth mentioning that this approach can be extended to use contrastive examples of larger sizes (\(k=3,4,\ldots \)), as specified in Sec. C of the appendix of the full version of this paper. The fact that small contrastive examples, such as singletons, exist in large, state-of-the-art DNNs with large inputs [21, 64] suggests that useful approximations exist in large DNNs. In our experiments, we observed that using only singletons and pairs affords good approximations, without incurring overly expensive computations by the underlying verifier.

4 Finding Minimal Explanations Efficiently

Algorithm 1 is the backbone of our approach, but it suffers from limited scalability — particularly, in \(T_{\text {UB}}\) . As the execution of \(T_{\text {UB}}\) progresses, and as additional features are “freed”, the quickly growing search space slows down the underlying verifier. Here we propose three different methods for expediting this process, by reducing the number of verification queries required.

Method 1: Using Information from \(T_{\text {LB}}\). We suggest to leverage the contrastive examples found by \(T_{\text {LB}}\) to expedite \(T_{\text {UB}}\) . The process is described in Algorithm 4. In line 3, \(T_{\text {LB}}\) is queried for the current set of contrastive singletons, which we know must be part of any minimal explanation. These are subtracted from the RemainingFeatures set (features left for \(T_{\text {UB}}\) to query), and consequently will not be added to the Free set — i.e., they are marked as part of the current explanation. In addition, for any contrastive pair (ab) found by \(T_{\text {LB}}\) , either a or b must appear in any minimal explanation; and so, our algorithm skips checking the case where both a and b are removed from F (Line 8). (the method could also be extended to contrastive sets of greater cardinality.)

figure f

Method 2: Binary Search. Sorting the features being considered in ascending order of importance can have a significant effect on the size of the explanation found by Algorithm 2. Intuitively, a “perfect” heuristic model would assign the greatest weights to all features in the minimum explanation, and so traversing features in ascending order would first discover all the features that can be removed (UNSAT verification queries), followed by all the features that belong in the explanation (SAT queries). In this case, a sequential traversal of the features in ascending order is quite wasteful, and it is much better to perform a binary search to find the point where the answer flips from UNSAT to SAT.

Of course, in practice, the heuristic models are not perfect, leading to potential cases with multiple “flips” from SAT to UNSAT, and vice versa. Still, if the heuristic is good in practice (which is often the case; see Sec. 6), these flips are scarce. Thus, we propose to perform multiple binary searches, each time identifying one SAT query (i.e., a feature added to the explanation). Observe that each time we hit an UNSAT query, this indicates that all the queries for features with lower priorities would also yield UNSAT — because if “freeing” multiple features cannot change the classification, changing fewer features certainly cannot. Thus, we are guaranteed to find the first SAT query in each iteration, and soundness is maintained. This process is described in Algorithm. 6 and in Fig. 14 in the appendix of the full version of this paper [11].

Method 3: Local-Singleton Search. Let N be a DNN, and let x be an input point whose classification we seek to explain. As part of Algorithm 2, \(T_{\text {UB}}\) iteratively “frees” certain input features, allowing them to take arbitrary values, as it continues to search for features that must be included in the explanation. The increasing number of free features enlarges the search space that the underlying verifier must traverse, thus slowing down verification. We propose to leverage the hypothesis that input points nearby x that are misclassified tend to be clustered; and so, it is beneficial to fix the free features to “bad” values, as opposed to letting them take on arbitrary values. We speculate that this will allow the verifier to discover satisfying assignments much more quickly.

This enhancement is shown in Algorithm 5. Given a set Free of features that were previously freed, we fix their values according to some satisfying assignment previously discovered. Thus, the verification of any new feature that we consider is similar to the case of searching for contrastive singletons, which, as we already know, is fairly fast. See Fig. 15 in the appendix of the full version of this paper [11] for an illustration. The process can be improved further by fixing the freed features to small neighborhoods of the previously discovered satisfying assignment (instead of its exact values), to allow some flexibility while still keeping the query’s search space small.

figure g

5 Minimal Bundle Explanations

Fig. 6.
figure 6

Partition input’s features into bundles.

So far, we presented methods for generating explanations within a given approximation ratio of the minimum explanation (Sec. 3), and for expediting the computation of these explanations (Sec. 4) — in order to improve the scalability of our explanation generation mechanism. Next, we seek to tackle the second challenge from Sec. 1, namely that these explanations may be too low-level for many users. To address this challenge, we focus on bundles, which is a topic well covered in the ML [63] and heuristic XAI literature [50, 55] (commonly known as “super-pixels” for computer-vision tasks). Intuitively, bundles are a partitioning of the features into disjoint sets (an illustration appears in Fig. 6). The idea, which we later validate empirically, is that providing explanations in terms of bundles is often easier for humans to comprehend. As an added bonus, using bundles also curtails the search space that the verifier must traverse, expediting the process even further.

Given a feature space \(F=\{1,...,m\}\), a bundle b is just a subset \(b\subseteq F\). When dealing with the set of all bundles \(B=\{b_{1},b_{2},...b_{n}\}\), we require that they form a partitioning of F, namely . We define a bundle explanation \(E_B\) for a classification instance (vc) as a subset of bundles, \(E_B\subseteq B\), such that:

$$\begin{aligned} \forall (x\in \mathbb {F}).[\wedge _{i\in \cup E_B}(x_{i}=v_{i})\rightarrow (N(x)=c)] \end{aligned}$$
(5)

The following theorem then connects bundle explanations and explicit, non-bundle explanations:

Theorem 1

The union of features in a bundle explanation is an explanation.

The proof directly follows from Eqs. 1 and 5. We note that this definition of bundles implies that features that are not part of the bundle explanation (i.e. features contained in “free” bundles) are “free” to be set to any possible value. Another possible alternative for defining bundles could be to allow features in “free” bundles to only change in the same, coordinated manner. We focus here on the former definition, and leave the alternative definition for future work.

Many of the aforementioned results and definitions for explanations can be extended to bundle explanations. In a similar manner to Eq. 5, we can define the notions of minimal and minimum bundle explanations, a contrastive bundle singleton, and contrastive bundle pairs (see Sec. D of the appendix of the full version of this paper [11]). Theorems 1 and 2 can be extended to bundle explanations in a straightforward manner. It then follows that all bundle explanations contain all contrastive singleton bundles, and that all bundle explanations contain at least one bundle of any contrastive bundle pair.

Our method from Secs. 3 and 4 can be similarly performed on bundles rather than on features, and \(T_{\text {UB}}\) would then be used for calculating a minimal bundle explanation, rather than a minimal explanation. Regarding the aforementioned approximation ratio, we discuss and evaluate two different methods for obtaining it. The first, natural approach is to apply our techniques from Sec. 3 on bundle explanations, thus obtaining a provable approximation for a minimum bundle explanation. The upper bound is trivially derived by the size of the bundle explanation found by \(T_{\text {UB}}\), whereas the lower bound calculation requires assigning a cost to each bundle, representing the number of features it contains. This is done via a known notion of minimum hitting sets of bundles (MHSB) [6] and using minimum weighted vertex cover for the approximation of contrastive bundle pairs. This method, which is almost identical to the one mentioned in Sec. 3, is formalized in Sec. D of the appendix of the full version of this paper [11].

The second approach is to calculate an approximation ratio with respect to a regular, non-bundle minimum explanation. The minimal bundle explanation found by \(T_{\text {UB}}\) is an upper bound on the minimum non-bundle explanation following theorem 5. For computing a lower bound, we can analyze contrastive bundle examples; extract from them contrastive non-bundle examples; and then use the duality property, compute an MHS of these contrastive examples, and derive lower bounds for the size of the minimum explanation. We formalize techniques for performing this calculation in Sec. E of the appendix of the full version of this paper [11].

6 Evaluation

Implementation and Setup. For evaluation purposes, we created a proof-of-concept implementation of our approach as a Python framework. Currently, the framework uses the Marabou verification engine [41] as a backend, although other engines may be used. Marabou is a Simplex-based DNN verification framework that is sound and complete [5, 39,40,41, 68, 69], and which includes support for proof production [35], abstraction [15, 16, 52, 60, 67, 72], and optimization [62]; and has been used in various settings, such as ensemble selection [3], simplification [22, 43] repair [23, 53], and verification of reinforcement-learning based systems [2, 4, 17]. For sorting features by their relevance, we used the popular XAI method LIME [55]; although again, other heuristics could be used. The MVC was calculated using the classic 2-approximating greedy algorithm. All experiments reported were conducted on x86-64 Gnu/Linux-based machines, using a single Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz core, with a 1-hour timeout.

Benchmarks. As benchmarks, we used DNNs trained over the MNIST dataset for handwritten digit recognition [45]. These networks classify \(28\times 28\) grayscale images into the digits \(0,\ldots ,9\). Additionally, we used DNNs trained over the Fashion-MNIST dataset [71], which classify \(28\times 28\) grayscale images into 10 clothing categories (“Dress”, “Coat”, etc.) For each of these datasets we trained a DNN with the following architecture: (i) an input layer (which corresponds to the image) of size 784; (ii) a fully connected hidden layer with 30 neurons; (iii) another fully connected hidden layer, with 10 neurons; and (iv) a final, softmax layer with 10 neurons, corresponding to the 10 possible output classes.

The accuracy of the MNIST DNN was 96.6%, whereas that of the Fashion-MNIST DNN was 87.6%. (We note that we configured LIME to ignore the external border pixels of each input, as these are not part of the actual image.)

In selecting the classification instances to be explained for these networks, we targeted input points where the network was not confident — i.e., where the winning label did not win by a large margin. The motivation for this choice is that explanations are most useful and relevant in cases where the network’s decision is unclear, which is reflected in lower confidence scores. Additionally, explanations of instances with lower confidence tend to be larger, facilitating the process of extensive experimentation. We thus selected the 100 inputs from the MNIST and the Fashion-MNIST datasets where the networks demonstrated the lowest confidence scores — i.e., where the difference between the winning output score and the runner-up class score was minimal.

Experiments. Our first goal was to compare our approach to that of Ignatiev et al. [31], which is the current state of the art in verification-based explainability of DNNs. Other approaches consider other ML types, such as decision trees [33, 36], or focus on alternative definitions for abductive explanations [42, 70] and are thus not comparable. Because the implementation used in [31] is unavailable, we implemented their approach, using Marabou as the underlying verifier for a fair comparison. In addition, we used the same heuristic model, LIME, for sorting the input features’ relevance. Fig. 7 depicts a comparison of the two approaches, over the MNIST benchmarks. The Fashion-MNIST results were similar, but since the Fashion-MNIST network had lower accuracy it tended to produce larger explanations with lower run-times, resulting in less meaningful evaluations (due to space limitations, these results appear in Fig. 12 in the appendix of the full version of this paper [11]). We compared the approaches according to two criteria: the portion of input features whose participation in the explanation was verified, over time (part (a) of Fig. 7), and the average size of the presently obtained explanation over time, also presented as a fraction of the total number of input features (part (b)). The results indicate that our method significantly improves over the state of the art, verifying the participation of 40.4% additional features, on average, and producing explanations that are 9.7% smaller, on average, at the end of the 1-hour time limit. Furthermore, our method timed out on 10% fewer benchmarks. We regard this as compelling evidence of the potential of our approach to produce more efficient verification-based XAI.

Fig. 7.
figure 7

Our full and ablation-based results, compared to the state of the art for finding minimal explanations on the MNIST dataset.

We also looked into comparing our approach to heuristic, non-verification-based approaches, such as LIME itself; but these comparisons did not prove to be meaningful, as the heuristic approaches typically solved benchmarks very quickly, but very often produced incorrect explanations. This matches the findings reported in previous work [14, 32].

Next, we set out to evaluate the contribution of each of the components implemented within our framework to overall performance, using an ablation study. Specifically, we ran our framework with each of the components mentioned in Sec. 4, i.e. (i) information exchange between \(T_{\text {UB}}\) and \(T_{\text {LB}}\) ; (ii) the binary search in \(T_{\text {UB}}\) ; and (iii) local-singleton search,

turned off. The results on the MNIST benchmarks appear in Fig. 7; see Fig. 12 in the appendix of the full version of this paper [11] for the Fashion-MNIST results. Our experiments revealed that each of the methods mentioned in Sec. 4 had a favorable impact on both the average portion of features verified, and the average size of the discovered explanation, over time. Fig 7a indicates that the local-singleton search method, used for efficiently proving that features are bound to be included in the explanation, was the most significant in reducing the number of features remained for verifying, thus substantially increasing the portion of verified features. Moreover, Fig. 7b indicates that the binary search method, which is used for grouping UNSAT queries and proving the exclusion of features from the explanation, was the most significant for more efficiently obtaining smaller-sized explanations, over time.

Fig. 8.
figure 8

Average approximation of minimum explanation over time.

Our second goal was to evaluate the quality of the minimum explanation approximation of our method (using the lower/upper bounds) over time. Results are averaged over all benchmarks of the MNIST dataset and are presented in Fig. 8 (similar results on Fashion-MNIST appear in Fig. 13 in the appendix of the full version of this paper [11]). The upper bound represents the average size of the explanation discovered by \(T_{\text {UB}}\) over time, whereas the lower bound represents the average lower bound discovered by \(T_{\text {LB}}\) over time. It can be seen that initially, there is a steep increase in the size of the lower bound, as \(T_{\text {LB}}\) discovered many contrastive singletons. Later, as we begin iterating over contrastive pairs, the verification queries take longer to solve, and progress becomes slower. The average approximation ratio achieved after an hour was 1.61 for MNIST and 1.19 for Fashion-MNIST.

For our third experiment, we set out to assess the improvements afforded by bundles. We repeated the aforementioned experiments, this time using sets of features representing bundles instead of the features themselves. The segmentation into bundles was performed using the quickshift method [65], with LIME again used for assigning relevance to each bundle [55]. We approximate the sizes of the bundle explanations in terms of both the minimum bundle explanation as well as the minimum (non-bundle) explanation (as mentioned in Sec. 5 and in Sec. E of the appendix of the full version of this paper [11]). The bundle configuration showed drastic efficiency improvements, with none of the experiments timing out within the 1-hour time limit, thus improving the portion of timeouts on the MNIST dataset by 84%. The efficiency improvement was obtained at the expense of explanation size, resulting in a decrease of 352% in the approximation ratios obtained for MNIST and 39% for Fashion-MNIST. Nevertheless, when calculating the approximation in terms of the minimum bundle explanation, an increase of 12% and 8% was obtained for MNIST and Fashion-MNIST (results are summarized in Table 1 in the appendix of the full version of this paper [11]). For a visual evaluation, we performed the same set of experiments for both bundle and non-bundle implementations, using instances with high confidence rates to obtain smaller-sized explanations that could be more easily interpreted. A sample of these results is presented in Fig. 9. Empirically, we observe that the bundle-produced explanations are less complex and more comprehensible.

Fig. 9.
figure 9

Minimal explanations and bundle explanations found by our method on the Fashion-MNIST dataset. White pixels are not part of the explanation.

Overall, we regard our results as compelling evidence that verification-based XAI can soundly produce meaningful explanations, and that our improvements can indeed significantly improve its runtime.

7 Related Work

Our work is another step in the ongoing quest for formal explainability of DNNs, using verification [19, 27, 31, 58]. Related approaches have applied enumeration of contrastive examples [30, 31], which is also an ingredient of our approach. Other approaches focus on producing abductive explanations around an epsilon environment [42, 70]. Similar work has been carried out for decision sets [33], lists [28] and trees [36], where the problem appears to be simpler to solve [36]. Our work here tackles DNNs, which are known to be more difficult to verify [39].

Prior work has also sought to produce approximate explanations, e.g., by using \(\delta \)-relevant sets [37, 66]. This line of work has focused on probabilistic methods for generating explanations, which jeopardizes soundness. There has also been extensive work in heuristic XAI [50, 55, 56, 61], but here, too, the produced explanations are not guaranteed to be correct.

8 Conclusion

Although DNNs are becoming crucial components of safety-critical systems, they remain “black-boxes”, and cannot be interpreted by humans. Our work seeks to mitigate this concern, by providing formally correct explanations for the choices that a DNN makes. Since discovering the minimum explanations is difficult, we focus on approximate explanations, and suggest multiple techniques for expediting our approach — thus significantly improving over the current state of the art. In addition, we propose to use bundles to efficiently produce more meaningful explanations. Moving forward, we plan to leverage lightweight DNN verification techniques for improving the scalability of our approach [49], as well as extend it to support additional DNN architectures.