1 Introduction

Deep Neural Networks (DNNs) [33] have demonstrated human-level capabilities in several intractable machine learning tasks including image classification [10], natural language processing [56] and speech recognition [19]. These impressive achievements raised the expectations for deploying DNNs in real-world applications, especially in safety-critical domains. Early-stage applications include air traffic control [25], medical diagnostics [34] and autonomous vehicles [5]. The responsibilities of DNNs in these applications vary from carrying out well-defined tasks (e.g., detecting abnormal network activity [11]) to controlling the entire behaviour system (e.g., end-to-end learning in autonomous vehicles [5]).

Despite the anticipated benefits from a widespread adoption of DNNs, their deployment in safety-critical systems must be characterized by a high degree of dependability. Deviations from the expected behaviour or correct operation, as expected in safety-critical domains, can endanger human lives or cause significant financial loss. Arguably, DNN-based systems should be granted permission for use in the public domain only after exhibiting high levels of trustworthiness [6].

Software testing is the de facto instrument for analysing and evaluating the quality of a software system [24]. Testing enables at one hand to reduce the risk by proactively finding and eliminating problems (bugs), and on the other hand to evidence, through using the testing results, that the system actually achieves the required levels of safety. Research contributions and advice on best practices for testing conventional software systems are plentiful; [63], for instance, provides a comprehensive review of the state-of-the-art testing approaches.

Nevertheless, there are significant challenges in applying traditional software testing techniques for assessing the quality of DNN-based software [54]. Most importantly, the little correlation between the behaviour of a DNN and the software used for its implementation means that the behaviour of the DNN cannot be explicitly encoded in the control flow structures of the software [51]. Furthermore, DNNs have very complex architectures, typically comprising thousand or millions of parameters, making it difficult, if not impossible, to determine a parameter’s contribution to achieving a task. Likewise, since the behaviour of a DNN is heavily influenced by the data used during training, collecting enough data that enables exercising all potential DNN behaviour under all possible scenarios becomes a very challenging task. Hence, there is a need for systematic and effective testing frameworks for evaluating the quality of DNN-based software [6].

Recent research in the DNN testing area introduces novel white-box and black-box techniques for testing DNNs [20, 28, 36, 37, 48, 54, 55]. Some techniques transform valid training data into adversarial through mutation-based heuristics [65], apply symbolic execution [15], combinatorial [37] or concolic testing [55], while others propose new DNN-specific coverage criteria, e.g., neuron coverage [48] and its variants [35] or MC/DC-inspired criteria [52]. We review related work in Section 6. These recent advances provide evidence that, while traditional software testing techniques are not directly applicable to testing DNNs, the sophisticated concepts and principles behind these techniques, if adapted appropriately, could be useful to the machine learning domain. Nevertheless, none of the proposed techniques uses fault localization [4, 47, 63], which can identify parts of a system that are most responsible for incorrect behaviour.

In this paper, we introduce DeepFault, the first fault localization-based whitebox testing approach for DNNs. The objectives of DeepFault are twofold: (i) identification of suspicious neurons, i.e., neurons likely to be more responsible for incorrect DNN behaviour; and (ii) synthesis of new inputs, using correctly classified inputs, that exercise the identified suspicious neurons. Similar to conventional fault localization, which receives as input a faulty software and outputs a ranked list of suspicious code locations where the software may be defective [63], DeepFault analyzes the behaviour of neurons of a DNN after training to establish their hit spectrum and identifies suspicious neurons by employing suspiciousness measures. DeepFault employs a suspiciousness-guided algorithm to synthesize new inputs, that achieve high activation values for suspicious neurons, by modifying correctly classified inputs. Our empirical evaluation on the popular publicly available datasets MNIST [32] and CIFAR-10 [1] provides evidence that DeepFault can identify neurons which can be held responsible for insufficient network performance. DeepFault can also synthesize new inputs, which closely resemble the original inputs, are highly adversarial and increase the activation values of the identified suspicious neurons. To the best of our knowledge, DeepFault is the first research attempt that introduces fault localization for DNNs to identify suspicious neurons and synthesize new, likely adversarial, inputs.

Overall, the main contributions of this paper are:

  • The DeepFault approach for whitebox testing of DNNs driven by fault localization;

  • An algorithm for identifying suspicious neurons that adapts suspiciousness measures from the domain of spectrum-based fault localization;

  • A suspiciousness-guided algorithm to synthesize inputs that achieve high activation values of potentially suspicious neurons;

  • A comprehensive evaluation of DeepFault on two public datasets (MNIST and CIFAR-10) demonstrating its feasibility and effectiveness;

The reminder of the paper is structured as follows. Section 2 presents briefly DNNs and fault localization in traditional software testing. Section 3 introduces DeepFault and Section 4 presents its open-source implementation. Section 5 describes the experimental setup, research questions and evaluation carried out. Sections 6 and 7 discuss related work and conclude the paper, respectively.

Fig. 1.
figure 1

A four layer fully-connected DNN that receives inputs from vehicle sensors (camera, LiDAR, infrared) and outputs a decision for speed, steering angle and brake.

2 Background

2.1 Deep Neural Networks

We consider Deep Learning software systems in which one or more system modules is controlled by DNNs [13]. A typical feed-forward DNN comprises multiple interconnected neurons organised into several layers: the input layer, the output layer and at least one hidden layer (Fig. 1). Each DNN layer comprises a sequence of neurons. A neuron denotes a computing unit that applies a nonlinear activation function to its inputs and transmits the result to neurons in the successive layer. Commonly used activation functions are sigmoid, hyperbolic tangent, ReLU (Rectified Linear Unit) and leaky ReLU [13]. Except from the input layer, every neuron is connected to neurons in the successive layer with weights, i.e., edges, whose values signify the strength of a connection between neuron pairs. Once the DNN architecture is defined, i.e., the number of layers, neurons per layer and activation functions, the DNN undergoes a training process using a large amount of labelled training data to find weight values that minimise a cost function.

In general, a DNN could be considered as a parametric multidimensional function that consumes input data (e.g, raw image pixels) in its input layer, extracts features, i.e., semantic concepts, by performing a series of nonlinear transformations in its hidden layers, and, finally, produces a decision that matches the effect of these computations in its output layer.

2.2 Software Fault Localization

Fault localization (FL) is a white box testing technique that focuses on identifying source code elements (e.g., statements, declarations) that are more likely to contain faults. The general FL process [63] for traditional software uses as inputs a program P, corresponding to the system under test, and a test suite T, and employs an FL technique to test P against T and establish subsets that represent the passed and failed tests. Using these sets and information regarding program elements \(p \in P\), the FL technique extracts fault localization data which is then employed by an FL measure to establish the “suspiciousness” of each program element p. Spectrum-based FL, the most studied class of FL techniques, uses program traces (called program spectra) of successful and failed test executions to establish for program element p the tuple \((e_s,e_f,n_s,n_f)\). Members \(e_s\) and \(e_f\) (\(n_s\) and \(n_f\)) represent the number of times the corresponding program element has been (has not been) executed by tests, with success and fail, respectively. A spectrum-based FL measure consumes this list of tuples and ranks the program elements in decreasing order of suspiciousness enabling software engineers to inspect program elements and find faults effectively. For a comprehensive survey of state-of-the-art FL techniques, see [63].

3 DeepFault

In this section, we introduce our DeepFault whitebox approach that enables to systematically test DNNs by identifying and localizing highly erroneous neurons across a DNN. Given a pre-trained DNN, DeepFault, whose workflow is shown in Fig. 2, performs a series of analysis, identification and synthesis steps to identify highly erroneous DNN neurons and synthesize new inputs that exercise erroneous neurons. We describe the DeepFault steps in Sections 3.1, 3.2 and 3.3.

We use the following notations to describe DeepFault. Let \(\mathcal {N}\) be a DNN with l layers. Each layer \(L_i, 1\le i \le l\), consists of \(s_i\) neurons and the total number of neurons in \(\mathcal {N}\) is given by \(s = \sum _{i=1}^l s_i\). Let also \(n_{i,j}\) be the j-th neuron in the i-th layer. When the context is clear, we use \(n \in \mathcal {N}\) to denote any neuron which is part of the DNN \(\mathcal {N}\) irrespective of its layer. Likewise, we use \(N_H\) to denote the neurons which belong to the hidden layers of N, i.e., \(N_H = \{n_{ij} | 1< i < l, 1\le j \le s_j\}\). We use \(\mathcal {T}\) to denote the set of test inputs from the input domain of \(\mathcal {N}\), \(t \in \mathcal {T}\) to denote a concrete input, and \(u \in t\) for an element of t. Finally, we use the function \(\phi (t, n)\) to signify the output of the activation function of neuron \(n \in \mathcal {N}\).

3.1 Neuron Spectrum Analysis

The first step of DeepFault involves the analysis of neurons within a DNN to establish suitable neuron-based attributes that will drive the detection and localization of faulty neurons. As highlighted in recent research [18, 48], the adoption of whitebox testing techniques provides additional useful insights regarding internal neuron activity and network behaviour. These insights cannot be easily extracted through black-box DNN testing, i.e., assessing the performance of a DNN considering only the decisions made given a set of test inputs \(\mathcal {T}\).

Fig. 2.
figure 2

DeepFault workflow.

DeepFault initiates the identification of suspicious neurons by establishing attributes that capture a neuron’s execution pattern. These attributes are defined as follows. Attributes \(attr_n^{\text {as}}\) and \(attr_n^{\text {af}}\) signify the number of times neuron n was active (i.e., the result of the activation function \(\phi (t, n)\) was above the predefined threshold) and the network made a successful or failed decision, respectively. Similarly, attributes \(attr_n^{\text {ns}}\) and \(attr_n^{\text {nf}}\) cover the case in which neuron n is not active. DeepFault analyses the behaviour of neurons in the DNN hidden layers, under a specific test set \(\mathcal {T}\), to assemble a Hit Spectrum (HS) for each neuron, i.e., a tuple describing its dynamic behaviour. We define formally the HS as follows.

Definition 1

Given a DNN \(\mathcal {N}\) and a test set \(\mathcal {T}\), we say that for any neuron \(n \in \mathcal {N}_H\) its hit spectrum is given by the tuple \(HS_n = (attr_n^\text {as}, attr_n^\text {af},\) \(attr_n^\text {ns}, attr_n^\text {nf})\).

Note that the sum of each neuron’s HS should be equal to the size of \(\mathcal {T}\).

Clearly, the interpretation of a hit spectrum (cf. Definition 1) is meaningful only for neurons in the hidden layers of a DNN. Since neurons within the input layer \(L_1\) correspond to elements from the input domain (e.g., pixels from an image captured by a camera in Fig. 1), we consider them to be “correct-by-construction”. Hence, these neurons cannot be credited or held responsible for a successful or failed decision made by the network. Furthermore, input neurons are always active and thus propagate one way or another their values to neurons in the following layer. Likewise, neurons within the output layer \(L_l\) simply aggregate values from neurons in the penultimate layer \(L_{l-1}\), multiplied by the corresponding weights, and thus have limited influence in the overall network behaviour and, accordingly, to decision making.

3.2 Suspicious Neurons Identification

During this step, DeepFault consumes the set of hit spectrums, derived from DNN analysis, and identifies suspicious neurons which are likely to have made significant contributions in achieving inadequate DNN performance (low accuracy/high loss). To achieve this identification, DeepFault employs a spectrum-based suspiciousness measure which computes a suspiciousness score per neuron using spectrum-related information. Neurons with the highest suspiciousness score are more likely to have been trained unsatisfactorily and, hence, contributing more to incorrect DNN decisions. This indicates that the weights of these neurons need further calibration [13]. We define neuron suspiciousness as follows.

Table 1. Suspiciousness measures used in DeepFault
figure a

Definition 2

Given a neuron \(n \in \mathcal {N}_H\) with \(\textit{HS}_n\) being its hit spectrum, a neuron’s spectrum-based suspiciousness is given by the function \(\textsc {Susp}_n : HS_n \rightarrow \mathbb {R}\).

Intuitively, a suspiciousness measure facilitates the derivation of correlations between a neuron’s behaviour given a test set \(\mathcal {T}\) and the failure pattern of \(\mathcal {T}\) as determined by the overall network behaviour. Neurons whose behaviour pattern is close to the failure pattern of \(\mathcal {T}\) are more likely to operate unreliably, and consequently, they should be assigned higher suspiciousness. Likewise, neurons whose behaviour pattern is dissimilar to the failure pattern of \(\mathcal {T}\) are considered more trustworthy and their suspiciousness values should be low.

In this paper, we instantiate DeepFault with three different suspiciousness measures, i.e., Tarantula [23], Ochiai [42] and D* [62] whose algebraic formulae are shown in Table 1. The general principle underlying these suspiciousness measures is that the more often a neuron is activated by test inputs for which the DNN made an incorrect decision, and the less often the neuron is activated by test inputs for which the DNN made a correct decision, the more suspicious the neuron is. These suspiciousness measures have been adapted from the domain of fault localization in software engineering [63] in which they have achieved competitive results in automated software debugging by isolating the root causes of software failures while reducing human input. To the best of our knowledge, DeepFault is the first approach that proposes to incorporate these suspiciousness measures into the DNN domain for the identification of defective neurons.

figure b

The use of suspiciousness measures in DNNs targets the identification of a set of defective neurons rather than diagnosing an isolated defective neuron. Since the output of a DNN decision task is typically based on the aggregated effects of its neurons (computation units), with each neuron making its own contribution to the whole computation procedure [13], identifying a single point of failure (i.e., a single defective neuron) has limited value. Thus, after establishing the suspiciousness of neurons in the hidden layers of a DNN, the neurons are ordered in decreasing order of suspiciousness and the \(k, 1 \le l \le s\), most probably defective (i.e., “undertrained”) neurons are selected. Algorithm 1 presents the high-level steps for identifying and selecting the k most suspicious neurons. When multiple neurons achieve the same suspiciousness score, DeepFault resolves ties by prioritising neurons that belong to deeper hidden layers (i.e., they are closer to the output layer). The rationale for this decision lies in fact that neurons in deeper layers are able to learn more meaningful representations of the input space [69].

3.3 Suspiciousness-Guided Input Synthesis

DeepFault uses the selected k most suspicious neurons (cf. Section 3.2) to synthesize inputs that exercise these neurons and could be adversarial (see Section 5). The premise underlying the synthesis is that increasing the activation values of suspicious neurons will cause the propagation of degenerate information, computed by these neurons, across the network, thus, shifting the decision boundaries in the output layer. To achieve this, DeepFault applies targeted modification of test inputs from the test set \(\mathcal {T}\) for which the DNN made correct decisions (e.g., for a classification task, the DNN determined correctly their ground truth classes) aiming to steer the DNN decision to a different region (see Fig. 2).

Algorithm 2 shows the high-level process for synthesising new inputs based on the identified suspicious neurons. The synthesis task is underpinned by a gradient ascent algorithm that aims at determining the extent to which a correctly classified input should be modified to increase the activation values of suspicious neurons. For any test input \(t \in T_s\) correctly classified by the DNN, we extract the value of each suspicious neuron and its gradient in lines 6 and 7, respectively. Then, by iterating over each input dimension \(u \in t\), we determine the gradient value \(u_{gradient}\) by which u will be perturbed (lines 11–12). The value of \(u_{gradient}\) is based on the mean gradient of u across the suspicious neurons controlled by the function GradientConstraints. This function uses a test set specific step parameter and a distance d parameter to facilitate the synthesis of realistic test inputs that are sufficiently close, according to \(L_\infty \)-norm, to the original inputs. We demonstrate later in the evaluation of DeepFault (cf. Table 4) that these parameters enable the synthesis of inputs similar to the original. The function DomainConstraints applies domain-specific constraints thus ensuring that u changes due to gradient ascent result in realistic and physically reproducible test inputs as in [48]. For instance, a domain-specific constraint for an image classification dataset involves bounding the pixel values of synthesized images to be within a certain range (e.g., 0–1 for the MNIST dataset [32]). Finally, we append the updated u to construct a new test input \(t'\) (line 13).

As we experimentally show in Section 5, the suspiciousness measures used by DeepFault can synthesize adversarial inputs that cause the DNN to misclassify previously correctly classified inputs. Thus, the identified suspicious neurons can be attributed a degree of responsibility for the inadequate network performance meaning that their weights have not been optimised. This reduces the DNN’s ability for high generalisability and correct operation in untrained data.

4 Implementation

To ease the evaluation and adoption of the DeepFault approach (cf. Fig. 2), we have implemented a prototype tool on top of the open-source machine learning framework Keras (v2.2.2) [9] with Tensorflow (v1.10.1) backend [2]. The full experimental results summarised in the following section are available on DeepFault project page at

5 Evaluation

5.1 Experimental Setup

We evaluate DeepFault on two popular publicly available datasets. MNIST [32] is a handwritten digit dataset with 60,000 training samples and 10,000 testing samples; each input is a 28 \(\times \) 28 pixel image with a class label from 0 to 9. CIFAR-10 [1] is an image dataset with 50,000 training samples and 10,000 testing samples; each input is a 32 \(\times \) 32 image in ten different classes (e.g., dog, bird, car).

For each dataset, we study three DNNs that have been used in previous research [1, 60] (Table 2). All DNNs have different architecture and number of trainable parameters. For MNIST, we use fully connected neural networks (dense) and for CIFAR-10 we use convolutional neural networks with max-pooling and dropout layers that have been trained to achieve at least 95% and 70% accuracy on the provided test sets, respectively. The column ‘Architecture’ shows the number of fully connected hidden layers and the number of neurons per layer. Each DNN uses a leaky ReLU [38] as its activation function \((\alpha \,=\,0.01)\), which has been shown to achieve competitive accuracy results [67].

We instantiate DeepFault using the suspiciousness measures Tarantula [23], Ochiai [42] and D* [62] (Table 1). We analyse the effectiveness of DeepFault instances using different number of suspicious neurons, i.e., \(k\in \{1,2,3, 5, 10\}\) and \(k \in \{10, 20, 30, 40, 50\}\) for MNIST and CIFAR models, respectively. We also ran preliminary experiments for each model from Table 2 to tune the hyper-parameters of Algorithm 2 and facilitate replication of our findings. Since gradient values are model and input specific, the perturbation magnitude should reflect these values and reinforce their impact. We determined empirically that \(step=1\) and \(step=10\) are good values, for MNIST and CIFAR models, respectively, that enable our algorithm to perturb inputs. We also set the maximum allowed distance d to be at most \(10\%\) (\(L_\infty \)) with regards to the range of each input dimension (maximum pixel value). As shown in Table 4, the synthesized inputs are very similar to the original inputs and are rarely constrained by d. Studying other step and d values is part of our future work. All experiments were run on an Ubuntu server with 16 GB memory and Intel Xeon E5-2698 2.20 GHz.

Table 2. Details of MNIST and CIFAR-10 DNNs used in the evaluation.

5.2 Research Questions

Our experimental evaluation aims to answer the following research questions.

  • RQ1 (Validation): Can DeepFault find suspicious neurons effectively? If suspicious neurons do exist, suspiciousness measures used by DeepFault should comfortably outperform a random suspiciousness selection strategy.

  • RQ2 (Comparison): How do DeepFault instances using different suspiciousness measures compare against each other? Since DeepFault can work with multiple suspiciousness measures, we examined the results produced by DeepFault instances using Tarantula [23], Ochiai [42] and D* [62].

  • RQ3 (Suspiciousness Distribution): How are suspicious neurons found by DeepFault distributed across a DNN? With this research question, we analyse the distribution of suspicious neurons in hidden DNN layers using different suspiciousness measures.

  • RQ4 (Similarity): How realistic are inputs synthesized by DeepFault? We analysed the distance between synthesized and original inputs to examine the extent to which DeepFault synthesizes realistic inputs.

  • RQ5 (Increased Activations): Do synthesized inputs increase activation values of suspicious neurons? We assess whether the suspiciousness-guided input synthesis algorithm produces inputs that reinforce the influence of suspicious neurons across a DNN.

  • RQ6 (Performance): How efficiently can DeepFault synthesize new inputs? We analysed the time consumed by DeepFault to synthesize new inputs and the effect of suspiciousness measures used in DeepFault instances.

5.3 Results and Discussion

RQ1 (Validation). We apply the DeepFault workflow to the DNNs from Table 2. To this end, we instantiate DeepFault with a suspiciousness measure, analyse a pre-trained DNN given the dataset’s test set \(\mathcal {T}\), identify k neurons with the highest suspiciousness scores and synthesize new inputs, from correctly classified inputs, that exercise these suspicious neurons. Then, we measure the prediction performance of the DNN on the synthesized inputs using the standard performance metrics: cross-entropy loss, i.e., the divergence between output and target distribution, and accuracy, i.e., the percentage of correctly classified inputs over all given inputs. Note that DNN analysis is done per class, since the activation pattern of inputs from the same class is similar to each other [69].

Table 3 shows the average loss and accuracy for inputs synthesized by DeepFault instances using Tarantula (T), Ochiai (O), D\(^*\) (D) and a random selection strategy (R) for different number of suspicious neurons k on the MNIST (top) and CIFAR-10 (bottom) models from Table 2. Each cell value in Table 3, except from random R, is averaged over 100 synthesized inputs (10 per class). For R, we collected 500 synthesized inputs (50 per class) over five independent runs, thus, reducing the risk that our findings may have been obtained by chance.

As expected (see Table 3), DeepFault using any suspiciousness measure (T, O, D) obtained considerably lower prediction performance than R on MNIST models. The suspiciousness measures T and O are also effective on CIFAR-10 model, whereas the performance between D and R is similar. These results show that the identified k neurons are actually suspicious and, hence, their weights are insufficiently trained. Also, we have sufficient evidence that increasing the activation value of suspicious neurons by slightly perturbing inputs that have been classified correctly by the DNN could transform them into adversarial.

We applied the non-parametric statistical test Mann-Whitney with 95% confidence level [61] to check for statistically significant performance difference between the various DeepFault instances and random. We confirmed the significant difference among T-R and O-R (p-value < 0.05) for all MNIST and CIFAR-10 models and for all k values. We also confirmed the interesting observation that significant difference between D-R exists only for MNIST models (all k values). We plan to investigate this observation further in our future work.

Another interesting observation from Table 3 is the small performance difference of DeepFault instances for different k values. We investigated this further by analyzing the activation values of the next \(k'\) most suspicious neurons according to the suspiciousness order given by Algorithm 1. For instance, if \(k=2\) we analysed the activation values of the next \(k'\in \{3,,5,10\}\) most suspicious neurons. We observed that the synthesized inputs frequently increase the activation values of the \(k'\) neurons whose suspiciousness scores are also high, in addition to increasing the values of the top k suspicious neurons.

Considering these results, we have empirical evidence about the existence of suspicious neurons which can be responsible for inadequate DNN performance. Also, we confirmed that DeepFault instances using sophisticated suspiciousness measures significantly outperform a random strategy for most of the studied DNN models (except from the D-R case on CIFAR models; see RQ3).

Table 3. Accuracy and loss of inputs synthesized by DeepFault on MNIST (top) and CIFAR-10 (bottom) datasets. The best results per suspiciousness measure are shown in bold. (\(k{:}\#\)suspicious neurons, T:Tarantula, O:Ochiai, D:D*, R:Random)

RQ2 (Comparison). We compare DeepFault instances using different suspiciousness measures and carried out pairwise comparisons using the Mann-Whitney test to check for significant difference between T, O, and D\(^*\). We show the results of these comparisons on the project’s webpage. Ochiai achieves better results on MNIST_1 and MNIST_3 models for various k values. This result suggests that the suspicious neurons reported by Ochiai are more responsible for insufficient DNN performance. D\(^*\) performs competitively on MNIST_1 and MNIST_3 for \(k \in \{3,5,10\}\), but its performance on CIFAR-10 models is significantly inferior to Tarantula and Ochiai. The best performing suspiciousness measure in CIFAR models for most k values is, by a great amount, Tarantula.

These findings show that multiple suspiciousness measures could be used for instantiating DeepFault with competitive performance. We also have evidence that DeepFault using D\(^*\) is ineffective for some complex networks (e.g., CIFAR-10), but there is insufficient evidence for the best performing DeepFault instance. Our findings conform to the latest research on software fault localization which claims that there is no single best spectrum-based suspiciousness measure [47].

RQ3 (Suspiciousness Distribution). We analysed the distribution of suspicious neurons identified by DeepFault instances across the hidden DNN layers. Figure 3 shows the distribution of suspicious neurons on MNIST_3 and CIFAR_3 models with \(k=10\) and \(k=50\), respectively. Considering MNIST_3, the majority of suspicious neurons are located at the deeper hidden layers (Dense 4-Dense 8) irrespective of the suspiciousness measure used by DeepFault. This observation holds for the other MNIST models and k values. On CIFAR_3, however, we can clearly see variation in the distributions across the suspiciousness measures. In fact, D\(^*\) suggests that most of the suspicious neurons belong to initial hidden layers which is in contrast with Tarantula’s recommendations. As reported in RQ2, the inputs synthesized by DeepFault using Tarantula achieved the best results on CIFAR models, thus showing that the identified neurons are actually suspicious. This difference in the distribution of suspicious neurons explains the inferior inputs synthesized by D\(^*\) on CIFAR models (Table 3).

Another interesting finding concerns the relation between the suspicious neurons distribution and the “adversarialness” of synthesized inputs. When suspicious neurons belong to deeper hidden layers, the likelihood of the synthesized input being adversarial increases (cf. Table 3 and Fig. 3). This finding is explained by the fact that initial hidden layers transform input features (e.g., pixel values) into abstract features, while deeper hidden layers extract more semantically meaningful features and, thus, have higher influence in the final decision [13].

Fig. 3.
figure 3

Suspicious neurons distribution on MNIST_3 (left) and CIFAR_3 (right) models.

RQ4 (Similarity). We examined the distance between original, correctly classified, inputs and those synthesized by DeepFault, to establish DeepFault’s ability to synthesize realistic inputs. Table 4 (left) shows the distance between original and synthesized inputs for various distance metrics (\(L_1\) Manhattan, \(L_2\) Euclidean, \(L\infty \) Chebyshev) for different k values (# suspicious neurons). The distance values, averaged over inputs synthesized using the DeepFault suspiciousness measures (T, O and D\(^*\)), demonstrate that the degree of perturbation is similar irrespective of k for MNIST models, whereas for CIFAR models the distance decreases as k increases. Given that a MNIST input consists of 784 pixels, with each pixel taking values in [0, 1], the average perturbation per input is less than \(5.28\%\) of the total possible perturbation (\(L_1\) distance). Similarly, for a CIFAR input that comprises 3072 pixels, with each pixel taking values in \(\{0,1,...,255\}\), the average perturbation per input is less that \(0.03\%\) of the total possible perturbation (\(L_1\) distance). Thus, for both datasets, the difference of synthesized inputs to their original versions is very small. We qualitatively support our findings by showing in Fig. 4 the synthesized images and their originals for an example set of inputs from the MNIST and CIFAR-10 datasets.

We also compare the distances between original and synthesized inputs based on the suspiciousness measures (Table 4 right). The inputs synthesized by DeepFault instances using T, O or D\(^*\) are very close to the inputs of the random selection strategy (\(L_1\) distance). Considering these results, we can conclude that DeepFault is effective in synthesizing highly adversarial inputs (cf. Table 3) that closely resemble their original counterparts.

Table 4. Distance between synthesized and original inputs. The values shown represent minimal perturbation to the original inputs (\(<5\%\) for MNIST and \(<1\%\) for CIFAR-10).
Fig. 4.
figure 4

Synthesized images (top) and their originals (bottom). For each dataset, suspicious neurons are found using (from left to right) Tarantula, Ochiai, D\(^*\) and Random.

Table 5. Effectiveness of suspiciousness-guided input synthesis algorithm to increase activations values of suspicious neurons.

RQ5 (Increasing Activations). We studied the activation values of suspicious neurons identified by DeepFault to examine whether the synthesized inputs increase the values of these neurons. The gradients of suspicious neurons used in our suspiciousness-guided input synthesis algorithm might be conflicting and a global increase in all suspicious neurons’ values might not be feasible. This can occur if some neurons’ gradients are negative, indicating a decrease in an input feature’s value, whereas other gradients are positive and require to increase the value of the same feature. Table 5 shows the percentage of suspicious neurons k, averaged over all suspiciousness measures for all considered MNIST and CIFAR-10 models from Table 2, whose values were increased by the inputs synthesized by DeepFault. For MNIST models, DeepFault synthesized inputs that increase the suspicious neurons’ values with success at least 97\(\%\) for \(k\in \{1,2,3,5\}\), while the average effectiveness for CIFAR models is 90%. These results show the effectiveness of our suspiciousness-guided input synthesis algorithm in generating inputs that increase the activation values of suspicious neurons (see

RQ6 (Performance). We measured the performance of Algorithm 2 to synthesize new inputs ( The average time required to synthesize a single input for MNIST and CIFAR models is 1 s and 24.3 s, respectively. The performance of the algorithm depends on the number of suspicious neurons (k), the distribution of those neurons over the DNN and its architecture. For CIFAR models, for instance, the execution time per input ranges between 3 s (\(k=10\)) and 48 s (\(k=50\)). We also confirmed empirically that more time is taken to synthesize an input if the suspicious neurons are in deeper hidden layers.

5.4 Threats to Validity

Construct validity threats might be due to the adopted experimental methodology including the selected datasets and DNN models. To mitigate this threat, we used widely studied public datasets (MNIST [32] and CIFAR-10 [1]), and applied DeepFault to multiple DNN models of different architectures with competitive prediction accuracies (cf. Table 2). Also, we mitigate threats related to the identification of suspicious neurons (Algorithm 1) by adapting suspiciousness measures from the fault localization domain in software engineering [63].

Internal validity threats might occur when establishing the ability of DeepFault to synthesize new inputs that exercise the identified suspicious neurons. To mitigate this threat, we used various distance metrics to confirm that the synthesized inputs are close to the original inputs and similar to the inputs synthesized by a random strategy. Another threat could be that the suspiciousness measures employed by DeepFault accidentally outperform the random strategy. To mitigate this threat, we reported the results of the random strategy over five independent runs per experiment. Also, we ensured that the distribution of the randomly selected suspicious neurons resembles the distribution of neurons identified by DeepFault suspiciousness measures. We also used the non-parametric statistical test Mann-Whitney to check for significant difference in the performance of DeepFault instances and random with a 95% confidence level.

External validity threats might exist if DeepFault cannot access the internal DNN structure to assemble the hit spectrums of neurons and establish their suspiciousness. We limit this threat by developing DeepFault using the open-source frameworks Keras and Tensorflow which enable whitebox DNN analysis. We also examined various spectrum-based suspiciousness measures, but other measures can be investigated [63]. We further reduce the risk that DeepFault might be difficult to use in practice by validating it against several DNN instances trained on two widely-used datasets. However, more experiments are needed to assess the applicability of DeepFault in domains and networks with characteristics different from those used in our evaluation (e.g., LSTM and Capsule networks [50]).

6 Related Work

DNN Testing and Verification. The inability of blackbox DNN testing to provide insights about the internal neuron activity and enable identification of corner-case inputs that expose unexpected network behaviour [14], urged researchers to leverage whitebox testing techniques from software engineering [28, 35, 43, 48, 54]. DeepXplore [48] uses a differential algorithm to generate inputs that increase neuron coverage. DeepGauge [35] introduces multi-granularity coverage criteria for effective test synthesis. Other research proposes testing criteria and techniques inspired by metamorphic testing [58], combinatorial testing [37], mutation testing [36], MC/DC [54], symbolic execution [15] and concolic testing [55].

Formal DNN verification aims at providing guarantees for trustworthy DNN operation [20]. Abstraction refinement is used in [49] to verify safety properties of small neural networks with sigmoid activation functions, while AI\(^2\) [12] employs abstract interpretation to verify similar properties. Reluplex [26] is an SMT-based approach that verifies safety and robustness of DNNs with ReLUs, and DeepSafe [16] uses Reluplex to identify safe regions in the input space. DLV [60] can verify local DNN robustness given a set of user-defined manipulations.

DeepFault adopts spectrum-based fault localization techniques to systematically identify suspicious neurons and uses these neurons to synthesize new inputs, which is mostly orthogonal to existing research on DNN testing and verification.

Adversarial Deep Learning. Recent studies have shown that DNNs are vulnerable to adversarial examples [57] and proposed search algorithms [8, 40, 41, 44], based on gradient descent or optimisation techniques, for generating adversarial inputs that have a minimal difference to their original versions and force the DNN to exhibit erroneous behaviour. These types of adversarial examples have been shown to exist in the physical world too [29]. The identification of and protection against these adversarial attacks, is another active area of research [45, 59]. DeepFault is similar to these approaches since it uses the identified suspicious neurons to synthesize perturbed inputs which as we have demonstrated in Section 5 are adversarial. Extending DeepFault to support the synthesis of adversarial inputs using these adversarial search algorithms is part of our future work.

Fault Localization in Traditional Software. Fault localization is widely studied in many software engineering areas including including software debugging [46], program repair [17] and failure reproduction [21, 22]. The research focus in fault localization is the development of identification methods and suspiciousness measures that isolate the root causes of software failures with reduced engineering effort [47]. The most notable fault localization methods are spectrum-based [3, 23, 30, 31, 62], slice-based [64] and model-based [39]. Threats to the value of empirical evaluations of spectrum-based fault localization are studied in [53], while the theoretical analyses in [66, 68] set a formal foundation about desirable formal properties that suspiciousness measures should have. We refer interested readers to a recent comprehensive survey on fault localization [63].

7 Conclusion

The potential deployment of DNNs in safety-critical applications introduces unacceptable risks. To reduce these risks to acceptable levels, DNNs should be tested thoroughly. We contribute in this effort, by introducing DeepFault, the first fault localization-based whitebox testing approach for DNNs. DeepFault analyzes pre-trained DNNs, given a specific test set, to establish the hit spectrum of each neuron, identifies suspicious neurons by employing suspiciousness measures and synthesizes new inputs that increase the activation values of the suspicious neurons. Our empirical evaluation on the widely-used MNIST and CIFAR-10 datasets shows that DeepFault can identify neurons which can be held responsible for inadequate performance. DeepFault can also synthesize new inputs, which closely resemble the original inputs, are highly adversarial and exercise the identified suspicious neurons. In future work, we plan to evaluate DeepFault on other DNNs and datasets, to improve the suspiciousness-guided synthesis algorithm and to extend the synthesis of adversarial inputs [44]. We will also explore techniques to repair the identified suspicious neurons, thus enabling to reason about the safety of DNNs and support safety case generation [7, 27].