Detecting failure modes in image reconstructions with interval neural network uncertainty

Purpose The quantitative detection of failure modes is important for making deep neural networks reliable and usable at scale. We consider three examples for common failure modes in image reconstruction and demonstrate the potential of uncertainty quantification as a fine-grained alarm system. Methods We propose a deterministic, modular and lightweight approach called Interval Neural Network (INN) that produces fast and easy to interpret uncertainty scores for deep neural networks. Importantly, INNs can be constructed post hoc for already trained prediction networks. We compare it against state-of-the-art baseline methods (MCDrop, ProbOut). Results We demonstrate on controlled, synthetic inverse problems the capacity of INNs to capture uncertainty due to noise as well as directional error information. On a real-world inverse problem with human CT scans, we can show that INNs produce uncertainty scores which improve the detection of all considered failure modes compared to the baseline methods. Conclusion Interval Neural Networks offer a promising tool to expose weaknesses of deep image reconstruction models and ultimately make them more reliable. The fact that they can be applied post hoc to equip already trained deep neural network models with uncertainty scores makes them particularly interesting for deployment.


Introduction
The reconstruction of unknown signals from indirect measurements plays an important role in many applications, including medical imaging [2,14]. Typically, such tasks are modeled as finite-dimensional linear inverse problems where x ∈ R n is the signal of interest, A ∈ R m×n denotes the forward operator representing a physical measurement process, and η ∈ R m is modeling noise in the measurements. Important examples include magnetic resonance imaging and computed tomography, where A is a subsampled discrete Fourier or Radon transform, respectively. Solving the inverse problem (1) requires computing an approximate reconstruction of x from the observed measurements y. Classical reconstruction methods, e.g., based on sparse regularization models, constitute the state of the art for solving (1) in many cases and are backed by theoretical guarantees [8]. Recently, data-driven deep learning methods are increasingly gaining attention and are repeatedly able to outperform traditional solvers in terms of empirical reconstruction performance or speed, see for example [2].
Despite the advantages, the use of deep learning methods in sensitive applications such as clinical diagnosis is still a concern [23], due to questions regarding the reliability and robustness of the obtained reconstructions when compared to traditional approaches [1,13]. What is more, erroneous artifacts in the reconstructed signals can be hard to detect as they tend to "blend in" well with the rest of the signal.
Various approaches for incorporating uncertainty quantification (UQ) into deep learning have been proposed to address these issues [10,16,18,22]. However, as we demonstrate, existing UQ approaches come with limitations regarding their capacity to detect failure modes or their post hoc applicability to trained deep learning models.
In this work, we consider a straight-forward approach to solving (1) by employing a neural network to post-process a standard model-based inversion as in [14]. This reconstruction is given by where Φ : R n → R n is a neural network trained to minimize the loss x − Φ( A † ( y)) 2 2 and A † : R m → R n denotes the non-learned model-based inversion (e.g., the filtered backprojection in the case of Radon measurements). We will denote z = A † ( y) in the following. Given y or z, a UQ method is supposed to extend the predicted reconstruction Φ(z) by a component-wise uncertainty score u(z) that provides additional information regarding the reliability of the reconstruction. Therefore, u(z) should be correlated with the component-wise error |x − Φ(z)|. We evaluate this for three different failure modes [7] that can arise during inference (see "Experiment B (i): general prediction error detection" section to "Experiment B (iii): Atypical Artifact Detection" section for more details): (i) Errors caused solely by the ill-posedness of (1), which is mostly determined by the strength of measurement noise and the amount of undersampling, (ii) Errors caused by adversarial perturbations to the network inputs, (iii) Errors caused by atypical artifacts that have not been seen during the training.
Our main contributions can be summarized as follows: We present a deterministic, modular and fast UQ-method for deep neural networks (DNNs), called Interval Neural Networks (INN). We evaluate INNs for the detection of the three different image reconstruction failure modes and demonstrate that they provide improved results compared to two existing UQ methods.

Related work
Whereas a number of methods from classical statistical learning theory, such as Gaussian processes and approximations thereof [6,19], come with built-in uncertainty estimates, DNNs have been limited in this regard. A surge of efforts to treat neural networks from a variational perspective [3,16] started to change that. In addition, there exist strands of research in deep learning explicitly occupied with the detection of failure modes caused by adversarial and out of distribution (OoD) inputs. These include Maximum Mean Discrepancy, Kernel Density Estimation and other tools, see [5] or the Minimum Covariance Determinant method [26], Support Vector Data Description [28], among others. We refer to [27] for a comprehensive overview. The detection of adversarial and OoD inputs in these works is typically done in the classification setting. We emphasize that image-to-image regression is a fundamentally different task: While classification is inherently discontinuous, image reconstruction addresses a problem that allows for stable solution methods in many cases, e.g., by sparse regularization. Furthermore, we are not interested in a crude, outright rejection of data points in the input space but rather seek to obtain fine-grained information about erroneous artifacts in the output space. More closely related to our goal is Monte Carlo dropout (MCDrop) [10] and direct variance estimation (ProbOut) [12], where epistemic and aleatoric uncertainty quantification was considered for segmentation and depthestimation tasks. Hence, we include their approaches as baseline comparison methods, see "Baseline UQ methods" section.

Methods
Popular existing UQ frameworks for DNNs place parametric densities, most commonly Gaussian densities, over the DNN parameters or predictions. Instead of using specific parametrized densities, our INN method relies on bounding distributions using intervals. This results in a flexible and modular method that can be applied post hoc to a given DNN Φ that has already been trained. A schematic illustration is provided in Fig. 1: The INN is formed by wrapping additional weight and bias intervals around the weights and biases of the underlying prediction DNN. This allows us to equip the DNN Φ with uncertainty capabilities without the need to modify Φ itself. After training the INN we obtain prediction intervals that are guaranteed to contain the original prediction of the underlying network and are easy to interpret. They provide exact upper and lower bounds for the range of possible values that the DNN prediction may take when slightly modifying the network parameters within the prescribed weight and bias intervals. Previously, the capacity of neural networks with interval weights and biases was evaluated for fitting interval-valued functions [11]. In contrast to [11], our targets x i are neither interval-valued nor univariate, leading to a different loss function which allows us to equip trained neural networks with uncertainty capabilities post hoc. For a direct comparison, see 3 in 3.2 and Equation (18) in [11]. Further, [17,30] explored neural networks implementing interval arithmetic for robust classifications. However, in their setting, the focus is purely on representing the inputs or outputs as intervals but not the weights and biases. In contrast, our proposed INNs determine interval bounds for all network parameters with the goal of providing uncertainty scores for the predictions of an underlying DNN.

Arithmetic of Interval Neural Networks
We will now give a description of those INN mechanisms that deviate from standard DNNs. The forward propagation of a single input z through a DNN is replaced by the forward propagation of a component-wise interval-valued input [z, z] through the INN. This can be expressed similarly to standard feed-forward neural networks but using interval arithmetic instead. For interval-valued weight matrices [W , W ] and bias vectors [b, b], the propagation through the -th network layer can be expressed as For nonnegative [z, z] ( ) , for example when using a nonnegative activation function such as the ReLU in the previous layer, we can explicitly rewrite (2) as where the maximum and minimum are computed componentwise. Similarly, for point intervals z ( ) = z ( ) =: z ( ) , for example, as inputs to the first network layer, we can rewrite (2) as regardless of whether z ( ) is nonnegative or not. Optimizing the INN parameters requires obtaining the gradients of these operations. This can be achieved using automatic differentiation (backpropagation) in the same way as for standard neural networks.

Training Interval Neural Networks
Let W ( ) and b ( ) be the weights and biases of the underlying prediction network Φ and let Φ : R n → R n and Φ : R n → R n denote the functions mapping a point interval input z to the upper and the lower interval bounds in the output layer of the INN respectively. Given data samples subject to the constraints is always guaranteed. The first two terms in (3) encour- should contain the target signal x i , while penalizing each component that lies outside with the squared distance to the nearest interval bound. The second term penalizes the interval size, so that the predicted intervals cannot grow arbitrarily large. While a quadratic penalty of the interval size is also possible and leads to similar theoretical bounds as in (4), we choose to minimize the 1 -norm to make the intervals more outlier inclusive. In addition, the tightness parameter β > 0 can further tune the outlier-sensitivity of the intervals. This allows for a calibration of the INN uncertainty scores according to an application specific risk-budget. In practice, we found that choosing β similar to the mean absolute error of the underlying predic-tion network yields a good trade-off between coverage [9] and tightness.

Properties of Interval Neural Networks
The uncertainty estimate of an INN is given by the width of the prediction interval, i.e., u(z) = Φ(z) − Φ(z). In terms of computational overhead, INNs scale linearly in the cost of evaluating the underlying prediction DNN with a constant factor 2. In contrast, the popular MCDrop [10] scales linearly with a factor T which is proportional to the number of stochastic forward passes and at least T = 10 is recommended by the authors, see "Baseline UQ methods" section. Further, INNs come with theoretical coverage guarantees that can be derived from the Markov inequality: Assuming that the loss (3) is optimized during training to yield an INN with vanishing expected gradient with respect to the data distribution, we obtain for any λ > 0. In other words, for input and target pair (z, x) the probability of any component of the target lying inside the predicted interval enlarged by λβ is at least 1 − 1 λ . As β is usually very small, this ensures a fast decay of the probability of the components of x lying outside the predicted interval bounds. Consequently, a component with a small uncertainty score was correctly reconstructed up to small error with a high probability. Of course, the training distribution needs to be well representative of the true data distribution to extrapolate this property to unseen data.
Finally, the optimization of the loss (3) yields additional information: If the prediction Φ(z) lies closer to one boundary of the predicted interval, the true target x has a higher probability of lying on the other side of the interval. Consequently, INNs can provide directional uncertainty scores. A quantitative assessment of this capability is given in Fig. 3c+d. We note that it is also possible to explore asymmetric uncertainty estimates in the probabilistic setting, e.g., via exponential family distributions [29] or quantile regression [24]. In contrast to INNs, these methods cannot be applied post hoc as they require substantial modifications to the underlying prediction network.

Baseline UQ methods
In addition to our INN approach, we consider two other related and popular UQ baseline methods for comparison. First, Monte Carlo dropout (MCDrop) [10] obtains uncertainty scores as the sample variance of multiple stochastic forward passes of the same input signal. In other words, if Φ 1 , . . . , Φ T are realizations of independent draws of ran-dom dropout masks for the same underlying network Φ, the component-wise uncertainty estimate is u MCDrop Second, a direct variance estimation (ProbOut) was proposed in [22] and later expanded in [12]. Here, the number of output components of the prediction network is doubled and trained to approximate the mean and variance of a Gaussian distribution. The resulting network Φ ProbOut : 1 . The component-wise uncertainty score of ProbOut is u ProbOut (z) = (Φ var (z)) 1/2 . Note that, in contrast to INN and MCDrop, the ProbOut approach requires the incorporation of UQ already during training. Thus, it cannot be employed as a post hoc evaluation of an already trained, underlying network Φ. The role of the actual prediction network is taken by Φ mean .

Experiments
We present experiments for two different inverse problems. First, a deconvolution task with 1D signals, and second a tomography task on real-world 2D image signals. Both setups are described in more detail below. The description of all hyperparameters for the experiments is kept brief and we refer to our publicly available code at https://github.com/ luisoala/inn for full details.

Case study A: deconvolution of 1D signals
We start with a synthetic, didactic experiment, inspired by a one-dimensional deconvolution task, to demonstrate the properties of INNs discussed in "Properties of Interval Neural Networks" section. For this purpose, we choose n = m = 512 and A = D S D, where D is a discrete cosine transform (Type I DCT) and S is a diagonal matrix with entries s j = n− j n−1 ν ∈ [0, 1], that decay with a fixed exponent ν = 8. We draw synthetically generated signals x from a distribution of piecewise constant functions with random jump positions and heights, see Fig. 2. The corresponding measurements y are computed according to (1). We generate a data set consisting of 2000 sample pairs ( y i , x i ), 1600 of which were used for training, 200 for validation and 200 for testing. The underlying prediction network Φ is a convolutional neural network (consisting of ten convolutional layers and three dropout layers in between) trained to directly map y to x, i.e., we use A † = Id and thus z = A † y = y in this experiment. We trained the underlying network Φ for 100 epochs using Adam [15]. The interval parameters of the INN were subsequently trained for another 100 epochs with β = 2 · 10 −3 . For the MCDrop comparison, we use T = 64 Fig. 2 Results for the deconvolution task for one exemplary signal without noise (left) and with additive Gaussian noise (σ = 0.05) on both the measurements y and signal x (right). The first row shows inputs z = y and targets x. Below the target x, prediction Φ(z) and uncertainty score u(z) as well as the uncertainty compared to the absolute error |Φ(z) − x| are shown for the three UQ methods.
samples. The ProbOut model was trained in the same way as Φ using 100 Adam epochs. Note that all subsequent evaluations, as well as the plots in Fig. 2 are computed using test samples.
In order to evaluate the UQ methods' abilities to capture uncertainty due to noisy data, we consider additive Gaussian noise η ∼ N(0, σ 2 · Id) on the measurements over a range of noise levels σ (Fig. 3a) as well as η 1 , η 2 ∼ N(0, σ 2 · Id) on the measurements and targets, where (1) is adjusted to y = A(x + η 1 ) + η 2 ( Fig. 3b and right column of Fig. 2). In this case, INNs are able to capture the additional uncertainty of η 1 using the bias parameters of the final network layer. In Fig. 3, it can be observed how in contrast to MCDrop, our method and ProbOut are able to capture independent noise in the data with ProbOut reacting to a lesser degree than the INN. Note also that in Fig. 3 some of the ProbOut evaluations are shifted to the right, indicating a reduced reconstruction performance compared to the other methods.
Finally, we determine the directional information of the INN uncertainty scores as discussed in "Properties of Interval Neural Networks" section. For this, we define the component-wise directionality ratio by DR(z) = max{Φ(z)

− Φ(z), Φ(z) − Φ(z)}/min{Φ(z) − Φ(z), Φ(z) − Φ(z)},
i.e., as the ratio between the larger and smaller part of the interval [Φ(z), Φ(z)] when divided by the prediction Φ(z). The directionality accuracy (DA) is the relative frequency of target components corresponding to a given DR that are contained in the larger interval part. As displayed in Fig. 3c, d, INNs achieve a DA consistently above 0.5 (chance), indicating that the interval uncertainty scores contain directional information.

Case study B: limited angle computed tomography
Next, we consider a 2D computed tomography (CT) task on real-world data in order to evaluate the detection capabilities of the UQ methods with respect to the three failure modes (i)-(iii). More precisely, we consider limited angle CT, which has applications in dental tomography, breast tomosynthesis or electron tomography. For this, A is a subsampled discrete Radon transform with subsampling corresponding to a moderate missing wedge of 30 • . Limited angle measurements are simulated according to (1) and the non-learned inversion A † is based on the filtered backprojection algorithm (FBP) [21]. The underlying prediction network is a U-Net [25] variant. Our experiments are based on a data set consisting of 512 × 512 human CT scans from the AAPM Low Dose CT Grand Challenge data [20]. 1 In total, it contains 2580 full-dose images with a slice thickness of 3mm from 10 patients. Eight of these ten patients were used for training (2036 samples), one for validation (214 samples) and one for testing (330 samples). We trained the underlying network Φ for 400 epochs using Adam [15]. The interval parameters of the INN were subsequently trained for another 15 epochs with β = 10 −4 . We limited the interval training to the last twelve layers. For the MCDrop comparison, we use T = 128 samples. The ProbOut model was trained in the same way as Φ using 400 Adam epochs.

Experiment B (i): general prediction error detection
First, we evaluate how helpful UQ scores are for estimating the prediction error caused by the ill-posedness of the challenging CT task, see Fig. 4. The wedge of missing angles in the measurements results in reconstruction artifacts especially at vertical edges in the images. In order to best visualize these geometric effects of the very structured null-space of the limited angle CT forward operator, we do not add noise in this experiment. INNs are clearly able to reveal the reconstruction uncertainty along the "missing edges." For a more quantitative comparison of the UQ methods, we use the performance weighted correlation coefficient PWCC(z, x) = corr(|Φ(z) − x|, u(z))/ Φ(z) − x 2 2 between the uncertainty score u and the absolute prediction error. Performance weighting (normalizing by the mean squared error of the prediction) is necessary to discourage rewards for poor prediction models with high uncertainties everywhere. The average results over the test set for three independent complete experimental runs are summarized in Table 1. Both INNs and MCDrop are able to detect prediction errors, with INNs achieving slightly higher correlations. In Fig. 3d, the directional accuracy of the INN is illustrated analogously to the corresponding experiment in "Case study A: deconvolution of 1D signals" section. Again it is consistently above 0.5 (chance).

Experiment B (ii): Adversarial Artifact Detection
Second, we assess the capacity of UQ methods to capture artifacts in the output that were caused by adversarial perturbations. To that end, we create perturbed inputs for each input sample z in the test set by employing the box-constrained L-BFGS algorithm [4] to minimize Φ(z adv ) − x adv. tar. Fig. 4 Results of three UQ methods for the Error Detection experiment for one exemplary data sample of the limited angle CT task. The plotting windows are equally adjusted for better contrast. , attacking subsequently to a model-based inversion) is a realistic scenario in the context of inverse problems. However, for our purposes, such a simple setup (see also [13]) is sufficient. We refer to [1], where adversarial noise is mapped to the measurement domain. In order to assess the detection capacity for this failure mode, the different UQ schemes are then used to produce uncertainty heatmaps for the generated adversarial inputs. A quantitative evaluation is carried out by computing the mean Pearson correlation coefficient between the pixel-wise change in the uncertainty heatmaps |u(z) − u(z adv )| and the change of reconstructions |x rec − Φ(z adv )|. The results are summarized in Table 1 and illustrated in Fig. 5. We observe that both INN and ProbOut are able to detect the image region of adversarial perturbations, with INN achieving the highest correlation. This shows that both methods are able to visually highlight the effect that visually almost imperceptible input perturbations can have on the reconstructions.

Experiment B (iii): Atypical Artifact Detection
The third experiment is designed analogous to the setup described by [1], i.e., an atypical artifact, which was not present in the training data, is randomly placed in the input to produce z OoD . More precisely, the silhouette of a peace dove is inserted in each image of the test set; see Fig. 5. The simulation of the measurements and model-based inversions is carried out as before. A quantitative evaluation is carried out by computing the mean Pearson correlation coefficient between the change in the uncertainty heatmaps |u(z) − u(z OoD )| and a binary mask marking the region of change in the inputs. This evaluation isolates the uncertainty caused by atypical artifacts and allows us to verify in a controlled manner how the uncertainty scores of each UQ method react to the artifacts. During deployment, such controlled isolation is not possible. Instead, the joint uncertainty heatmaps u(z OoD ) will also capture other sources of uncertainty, thus providing a more comprehensive alarm system. The results are summarized in Table 1 and illustrated in Fig. 5. All three UQ methods are correlated with the input change; however, INN again achieves the highest correlation. This shows that UQ in general, and INNs in particular, can serve as a warning system for inputs containing atypical features that might otherwise lead to unnoticed and possibly erroneous reconstruction artifacts.

Conclusion
We introduced INNs as a deterministic, post hoc and fast approach for computing upper and lower bounds and subsequently uncertainty maps for pre-trained neural networks. We INNs are able to capture uncertainty due to noise and can be used to obtain directional information. They perform well as an alarms system for errors due ill-posedness, adversarial noise and atypical artifacts and thus offer a promising tool to expose the weaknesses of deep image reconstruction models.