Introduction

Deep neural networks (DNNs) have revolutionized various applied fields, including engineering and computer science (such as AI, language processing and computer vision)1,2,3,4, as well as the classical sciences (such as biology, physics, and medicine)5,6,7,8. DNNs can learn abstract concepts and extract desirable information from some high dimensional input. This is done through stacks of convolutions followed by appropriate non-linear rectifiers. DNNs alleviate the need for time-consuming hand-engineered algorithms. Due to the high model complexity, DNNs require a huge amount of data to regularize training and prevent the networks from over-fitting the training examples. This reduces their applicability in settings where data are scarce. This is often the case in scenarios where data collection is expensive or time-consuming, e.g. annotation of computed tomography scans by radiologists.

More importantly, popular deep learning models are often trained with maximum likelihood (ML) or maximum a posteriori (MAP) procedures, thus producing a point estimate but not an uncertainty value. In a classifier model, for example, the probability vector obtained at the end of the pipeline (the softmax output) is often erroneously interpreted as model confidence. In reality, a model can be uncertain in its predictions even with a high softmax output. In other words, the softmax probability is the probability that an input is a given class relative to the other classes; it does not help explain the model’s overall confidence9.

In applications of automated decision making or recommendation systems, which might involve life-threatening situations, information about the reliability of automated decisions is crucial to improve the system’s safety. In other words, it is necessary to know how confident the model is about its predictions10,11. Understanding if the model is under-confident or falsely over-confident can inform users to perform necessary actions to ensure safety9. Take an automated cancer detection system as an example which might encounter an out-of-distribution test sample. A traditional DNN-based system makes unreasonable suggestions, and as a result may unjustifiably bias the expert. Given information about the model’s confidence, an expert could rely more on his own judgment when the automated system is essentially guessing at random.

Most of the studies on uncertainty estimation techniques are inspired by Bayesian statistics. Bayesian Neural Networks (BNNs)12 are the probabilistic version of the traditional NNs with a prior distribution on the weights of the network. Such networks are intrinsically suitable for generating uncertainty estimates as they produce a distribution over the output for a given input sample13. These probabilistic systems are computationally expensive for large neural network models due to the huge number of parameters and the intractable inference of the model posterior. This limitation has prompted the scientific community to develop scalable, approximated BNNs.

Variational inference14 is the most common approach used for approximating the model posterior using a simple variational distribution such as the Gaussian distribution15. The parameters of the distribution are then set in a way that minimizes the difference to the true distribution (usually by minimizing the Kullback-Leibler divergence). The use of the Gaussian distribution considerably increases the required number of parameters and makes it computationally expensive. In this paper, we propose a mathematically-grounded method called Monte Carlo DropConnect (MC-DropConnect) to approximate variational inference in BNNs. The main contributions of this paper are:

  1. 1.

    We propose imposing the Bernoulli distribution directly to the weights of the deep neural network to estimate the posterior distribution over its weight matrices. We derive the required equations to show that this generalization provides a computationally tractable approximation of a BNN, only using the existing tools and no additional model parameters.

  2. 2.

    We propose metrics to evaluate the uncertainty estimation performance of the Bayesian models in the classification and segmentation settings. Using these metrics, we show that our method is superior compared to the recently proposed technique called MC-Dropout.

  3. 3.

    We make an in-depth analysis of the uncertainty estimations in both classification and segmentation settings to investigate the robust generalization of MC-DropConnect. Our extensive evaluations show that the proposed uncertainty-informed decision is able to significantly improve the prediction accuracy compared to standard techniques.

Our experimental results (achieved using the proposed method and metrics) provide a new benchmark for other researchers to evaluate and compare their uncertainty estimation in pursuit of safer and more reliable deep networks. The rest of this paper is organized as follows: works related to approximating Bayesian inference and estimating uncertainty are presented in Related Work section. The Methodology section explains our proposed method along with the mathematical proofs to approximate variational inference in deep neural networks. We then present our findings and their interpretations in the Experimental Results and Discussion section. Finally, Conclusion section concludes the paper with future research directions.

Related work

In recent years, many studies have been conducted on approximate Bayesian inference for neural networks using deterministic approaches13, Markov Chain Monte Carlo with Hamiltonian Dynamics16, and variational inference15. In particular, Neal et al. introduced the Hamiltonian Monte Carlo for Bayesian neural network learning which gives a set of posterior samples16. This method does not require the direct calculation of the posterior but is computationally prohibitive.

Recently, Gal et al.17 showed that Dropout, a well-known regularization technique18, is mathematically equivalent to approximate variational inference in the deep Gaussian process19. This method, commonly known as MC-Dropout, uses a Bernoulli approximating variational distribution on the network units and introduces no additional parameters for the approximate posterior. The main disadvantage of this method is that it often requires many forward-pass sampling which makes it resource-intensive20. Moreover, a fully Bayesian network approximated using this method (i.e. dropout applied to all layers) results in excessive regularization21 that learns slowly and does not achieve high prediction accuracy. While Bernoulli dropout is the most common approach used in the literature due to its ease of use and computation speed, several dropout variations with other distributions such as Gaussian dropout have been studied18,22. Concrete dropout23 was later proposed to use a continuous relaxation of dropout’s discrete masks to allow for automatic tuning of the dropout probability in large models. However, it introduces bias to the gradients of the model and reduces its prediction performance. Motivated by concrete dropout, Boluki et al. proposed a learnable Bernoulli dropout (LBD) mechanism for general deep neural networks. In LBDs, the dropout probabilities are defined as variational parameters and are jointly trained with the other parameters of the DNN24. Their experimental results show that LBD is able to achieve improved accuracy and uncertainty estimates in image classification and semantic segmentation. Multiplicative Normalizing Flows25 is another technique which is introduced as a family of approximate posteriors for the parameters of a variational BNN, capable of producing uncertainty estimates; this technique does not scale well with very large convolutional networks.

Another proposed approach is Deep Ensembles26 which have been shown to achieve high-quality uncertainty estimates. This method takes the frequentist approach to estimate the model uncertainty by training several models and calculating the variance of their output prediction. This technique is quite resource-intensive as it requires the storage of several separate models while performing forward passes through all of them to generate the inference. An alternative to such methods was proposed by Devries et al. which proposes to learn uncertainty from the given input27.

Several approaches have been designed to compute the uncertainty estimates in the segmentation setting. The most commonly used approach is to induce a probability distribution by using dropout over extracted feature values to obtain independent pixel-wise probabilities21,28. However, these approaches have been shown to be prone to result in inconsistent outputs which is not plausible29. In contrast, a body of work designed various approaches that can result in a diverse set of outcomes to account for the inherent ambiguities observed in real-world applications. Several approaches trained models with oracle set loss which only accounts for the closest prediction to the ground truth30,31,32. Kohl et al. proposed the probabilistic U-Net, in which a separate network named prior-net is trained along with the base segmentation network and maps the input to an embedding hypothesis space29. Thus this network is able to generate multiple plausible segmentations with sampling different points from the learned hypothesis embedding space.

Methodology

In this section, we address the limitations of BNNs, variational inference as the standard technique in Bayesian modeling, and DropConnect as a method for regularizing NNs. We then use these tools to approximate Bayesian networks using standard NNs equipped with Bernoulli distributions applied directly to their weights. Finally, we explain the methods used for measuring and evaluating model uncertainty.

Bayesian neural networks

From a probabilistic perspective, standard NN training via optimization is equivalent to maximum likelihood estimation (MLE) for the weights. Using MLE ignores any uncertainty that we may have in the proper weight values. BNNs are the extension over NNs to address this shortcoming by placing a prior distribution (often a Gaussian) over a NN’s weight. This brings vital advantages like automatic model regularization and uncertainty estimates on predictions13,15.

Given a BNN model with L layers parametrized by weights \({\mathbf {w}}=\{{\mathbf {W}}_i\}_{i=1}^L\) and a dataset \({\mathscr {D}}=(\mathbf {X, y})\), Bayesian inference calculates the posterior distribution of the weights given the data, \(p({\mathbf {w}}|{\mathscr {D}})\). The predictive distribution of an unknown label \({\mathbf {y}}^*\) of a test input data \({\mathbf {x}}^*\) is given by:

$$\begin{aligned} p({\mathbf {y}}^*|{\mathbf {x}}^*,{\mathscr {D}}) ={\mathbb {E}}_{p({\mathbf {w}}|{\mathscr {D}})}[p({\mathbf {y}}^*|{\mathbf {x}}^*, {\mathbf {w}})] =\int p({\mathbf {y}}^*|{\mathbf {x}}^*, {\mathbf {w}}) p({\mathbf {w}}|{\mathscr {D}}) \, d{\mathbf {w}} \end{aligned} $$
(1)

which shows that making a prediction about the unknown label is equivalent to using an ensemble of an infinite number of neural networks with various configuration of the weights. This is computationally intractable for neural networks of any size; the posterior distribution \(p({\mathbf {w}}|{\mathscr {D}})\) cannot generally be evaluated analytically. This limitation has prompted the scientific community to develop ways to approximate BNNs to make them easier to train33,34.

One common approach is to use variational inference to approximate the posterior distribution of the weights. It introduces a variational distribution, \(q_\theta ({\mathbf {w}})\), parametrized on \(\theta \) that minimizes the Kullback-Leibler (KL) divergence between q and the true posterior distribution:

$$\begin{aligned} \text {KL}(q_\theta ({\mathbf {w}}) || p({\mathbf {w}}|{\mathscr {D}})) \end{aligned}$$
(2)

Minimising the KL divergence is equivalent to minimizing the negative evidence lower bound (ELBO):

$$\begin{aligned} {\mathscr {L}}(\theta ) = -\int q_\theta ({\mathbf {w}}) \, \text {log}\,p({\mathbf {y}}|\mathbf {X, w}) \,d{\mathbf {w}} + \text {KL}(q_\theta ({\mathbf {w}}) || p({\mathbf {w}})) \end{aligned} $$
(3)

with respect to variational parameter \(\theta \). The first term (commonly referred to as the expected log likelihood) encourages \(q_{\theta }({\mathbf {w}})\) to place its mass on configurations of the latent variable that explain the observed data. The second term (referred to as prior KL) encourages \(q_\theta ({\mathbf {w}})\) to be similar to the prior, preventing the model from over-fitting. The prior KL term can be analytically evaluated to properly select the prior and variational distributions, while the expectation (i.e. integral term) cannot be computed exactly for a non-linear neural network. Our goal in the next section is to develop an explicit and accurate approximation for this expectation. Our approach extends on the results of Gal et al.35 and uses Bernoulli approximating variational inference and Monte-Carlo sampling.

DropConnect

DropConnect36, known as the generalized version of Dropout18, is a method used for regularizing deep neural networks. Here, we briefly review Dropout and DropConnect applied to a single fully-connected layer of a standard NN. For a single \(K_{i-1}\) dimensional input \({\mathbf {v}}\), the ith layer of an NN with \(K_i\) units would output a \(K_i\) dimensional activation vector \({\mathbf {a}}_i = \sigma ({\mathbf {W}}_i{\mathbf {v}})\) where \({\mathbf {W}}_i\) is the \(K_i \times K_{i-1}\) weight matrix and \(\sigma (.)\) is the nonlinear activation function (biases included in the weight matrix with a corresponding fixed input of one for the ease of notation).

When Dropout is applied to the output of a layer, the output activations can be written as \({\mathbf {a}}_{i}^{\text {\tiny DO}} = \sigma ({\mathbf {z}}_i \odot ({\mathbf {W}}_i{\mathbf {v}}))\) where \(\odot \) signifies the Hadamard product and \({\mathbf {z}}_i\) is a \(K_i\) dimensional binary vector with its elements drawn independently from \(z_{i}^{(k)}\sim \text {Bernoulli(}p_i\text {)}\) for \(k=1, \ldots , K_i\) and \(p_i\) to be the probability of keeping the output activation. DropConnect is the generalization of Dropout where the Bernoulli dropping is applied directly to each weight, rather than each output unit, thus the output activation is re-written as \({\mathbf {a}}_{i}^{\text {\tiny DC}} = \sigma (({\mathbf {Z}}_i \odot {\mathbf {W}}_i) {\mathbf {v}})\). Here, \({\mathbf {Z}}_i\) is the binary matrix of the same shape as \({\mathbf {W}}_i\), i.e. \(K_i \times K_{i-1}\). Wan et al.36 showed that adding DropConnect helps regularize large neural network models and outperforms Dropout on a range of data sets.

DropConnect for approximate Bayesian neural network

Assume the same Bayesian NN with L layers parametrized by weights \({\mathbf {w}}=\{{\mathbf {W}}_i\}_{i=1}^L\). We perform variational learning by approximating the variational distribution \(q({\mathbf {W}}_i|\varvec{\Theta }_i)\) for every layer i as:

$$\begin{aligned} {\mathbf {W}}_i = \varvec{\Theta }_i \, \odot \, {\mathbf {Z}}_i \end{aligned}$$
(4)

where \(\varvec{\Theta }_i\) is the matrix of variational parameters to be optimised, and \({\mathbf {Z}}_i\) the binary matrix whose elements are distributed as:

$$\begin{aligned} z_{i}^{(l, k)} \sim \text {Bernoulli}(p_i) \;\;\;\text {for} \;\; i=1, \ldots , L \end{aligned}$$
(5)

Here, \(z_{i}^{(l, k)}\) is the random binary value associated with the weight connecting the lth unit of the \((i-1)\)th layer to the kth unit of the ith layer. \(p_i\) is the probability that the random variables \(z_{i}^{(l, k)}\) take the value one (assuming the same probability for all the weights in a layer). Therefore, \(z_{i}^{(l, k)}=0\) corresponds to the weight being dropped out.

We start with rewriting the first term of Eq. (3) as a sum over all samples. Then we use Eq. (4) to re-parametrize the integrand so that it only depends on the Bernoulli distribution instead of \({\mathbf {w}}\) directly. We estimate the intractable integral with Monte Carlo sampling over \({\mathbf {w}}\) with a single sample as:

$$\begin{aligned} -\int q_\theta ({\mathbf {w}}) \, \text {log}\,p({\mathbf {y}}|\mathbf {X, w}) \,d{\mathbf {w}} =\sum _{n=1}^{N}\int -q_\theta ({\mathbf {w}}) \, \text {log}\,p({\mathbf {y}}_n|{\mathbf {x}}_n, {\mathbf {w}}) =\frac{1}{N}\sum _{n=1}^{N} -\text {log}\,p({\mathbf {y}}_n|{\mathbf {x}}_n, {\hat{\mathbf {w}}}_n) \end{aligned} $$
(6)

Note that \(\hat{{\mathbf {w}}}_n\) is not maximum a posteriori estimate, but random variable realisations from the Bernoulli distribution, \(\hat{{\mathbf {w}}}_n \sim q_\theta ({\mathbf {w}})\), which is identical to applying DropConnect to the weights of the network. The final sum of the log probabilities is the loss of the NN, thus we set:

$$\begin{aligned} \text {I}_{\text {NN}}({\mathbf {y}}_n, \hat{{\mathbf {y}}}({\mathbf {x}}_n, \hat{{\mathbf {w}}}_n)) = -\text {log}\,p({\mathbf {y}}_n|{\mathbf {x}}_n, \hat{{\mathbf {w}}}_n) \end{aligned}$$
(7)

where \(\hat{{\mathbf {y}}}({\mathbf {x}}_n, \hat{{\mathbf {w}}}_n)\) is the random output of the BNN. \(\text {I}_{\text {NN}}\) is defined according to the task with the sum of squared loss and softmax loss commonly selected for the regression and classification respectively.

The second term in Eq. (3) can be approximated following35. It has been shown that the KL term is equivalent to \(\sum _{i=1}^{L} ||\Theta _i||_2^2\). Thus, the objective function can be re-written as:

$$\begin{aligned} \hat{{\mathscr {L}}}_{\text {MC}}=\frac{1}{N}\sum _{n=1}^N \text {I}_{\text {NN}}({\mathbf {y}}_n, {\hat {\mathbf{y}}}_n)+\lambda \sum _{i=1}^L ||\Theta _i||_2^2 \end{aligned}$$
(8)

which is a scaled unbiased estimator of Eq. (3). More interestingly, it is identical to the objective function used in a standard neural network with L2 weight regularization and DropConnect applied to all the weights of the network. Therefore, training such a neural network with stochastic gradient descent has the same effect as minimizing the KL term in Eq. (2). This scheme, similar to a BNN, results in a set of parameters that best explains the observed data while preventing over-fitting.

After training the NN with DropConnect and proper regularization, we follow Eq. (1) to generate our inference. We replace the posterior \(p({\mathbf {w}}|{\mathscr {D}})\) with the approximate posterior distribution \(q_\theta ({\mathbf {w}})\) and approximate the integral with Monte Carlo integration:

$$\begin{aligned} p({\mathbf {y}}^*|{\mathbf {x}}^*,{\mathscr {D}}) \approx \int p({\mathbf {y}}^*|{\mathbf {x}}^*, {\mathbf {w}}) q_\theta ({\mathbf {w}}) \, d{\mathbf {w}} \approx \frac{1}{T}\sum _{t=1}^T p({\mathbf {y}}^*|{\mathbf {x}}^*, \hat{{\mathbf {w}}}_t)=p_{\text {\tiny {MC}}}(\mathbf {y^*}|\mathbf {x^*}) \end{aligned} $$
(9)

with \(\hat{{\mathbf {w}}}_t \sim q_{\theta }({\mathbf {w}})\). This means that at test time, unlike common practice, the DropConnect layers is kept active to keep the Bernoulli distribution over the network weights. Then each forward pass through the trained network generates a Monte Carlo sample from the posterior distribution. Several of such forward passes are needed to approximate the posterior distribution of softmax class probabilities. According to Eq. (9), the mean of these samples can be interpreted as the network prediction. We call this approach MC DropConnect which is a generalization over the previous work referred to as MC Dropout35 and will show its superiority in terms of achieving higher prediction accuracy and more precise uncertainty estimation in different ML tasks.

Measuring the model uncertainty

Generally, there are two types of uncertainty in Bayesian modeling10. Model uncertainty, also known as Epistemic uncertainty, measures what the model does not know due to the lack of training data. This uncertainty captures our ignorance about which model generated our collected data, thus can be explained away given enough data9. Aleatoric uncertainty, however, captures noise (such as motion or sensor noise) that is inherently present in the data and cannot be reduced by collecting more data28.

After computing the result of stochastic forward passes through the model, we can estimate the model confidence to its output. In the classification setting, several metrics are introduced to measure uncertainty. One straightforward approach used by Kendall et al. is to take the variance of the MC samples from the posterior distribution as the output model uncertainty for each class21. Predictive entropy is also suggested by Gal et al. which captures both epistemic and aleatoric uncertainty; in our case, this is not the proper choice as we are interested in regions of the data space where the model is uncertain9.

To specifically measure the model uncertainty for a new test sample \(\mathbf {x^*}\), we can see it as the amount of information we would gain about the model parameters if we were to receive the true label \(\mathbf {y^*}\). Theoretically, if the model is well-established in a region, knowing the output label conveys little information. In contrast, knowing the label would be informative in regions of data space where the model is uncertain37. Therefore, the mutual information (MI) between the true label and the model parameters are defined as as:

$$\begin{aligned} I(\mathbf {\mathbf {y^*, w| {\mathbf {x}}^*, {\mathscr {D}}}}) = H({\mathbf {y}}^*|{\mathbf {x}}^*, {\mathscr {D}})-{\mathbb {E}}_{p({\mathbf {w}}|{\mathscr {D}})}H[p({\mathbf {y}}^*|{\mathbf {x}}^*, {\mathbf {w}})] \end{aligned}$$
(10)

where given the training data set \({\mathscr {D}}\), \({\mathbf {y}}^*\), \(I(\mathbf {\mathbf {y^*, w| {\mathbf {x}}^*, {\mathscr {D}}}})\) measures the amount of information we gain about the model parameters \({\mathbf {w}}\) by receiving a test input \({\mathbf {x}}^*\) and its corresponding true label, \({\mathbf {y}}^*\). This can be approximated using the Bayesian interpretation of DropConnect derived earlier. H is the entropy, commonly referred to as the predictive entropy, which captures the existing amount of information in the predictive distribution:

$$\begin{aligned} H(\mathbf {\mathbf {y^*| {\mathbf {x}}^*, {\mathscr {D}}}}) = - \sum _c p({\mathbf {y}}^*=c|{\mathbf {x}}^*, {\mathscr {D}})\,\text {log} \, p({\mathbf {y}}^*=c|{\mathbf {x}}^*, {\mathscr {D}}) \end{aligned}$$
(11)

where c ranges over all classes. This is not analytically tractable for deep NNs; we use Eq. (9) to approximate it as:

$$\begin{aligned} {\hat{H}}(\mathbf {\mathbf {y^*| {\mathbf {x}}^*, {\mathscr {D}}}}) = - \sum _c p_{\text {\tiny {MC}}}({\mathbf {y}}^*=c|{\mathbf {x}}^*) \, \text {log} \, p_{\text {\tiny {MC}}}({\mathbf {y}}^*=c|{\mathbf {x}}^*) \end{aligned}$$
(12)

where \(p_{\text {\tiny {MC}}}({\mathbf {y}}^*=c|{\mathbf {x}}^*)\) is the average of the softmax probabilities of input \({\mathbf {x}}^*\) being in class c over T Monte Carlo samples. Finally, MI can be re-written as:

$$\begin{aligned} {\hat{I}}(\mathbf {\mathbf {y^*,w | {\mathbf {x}}^*, {\mathscr {D}}}}) = {\hat{H}}({\mathbf {y}}^*|{\mathbf {x}}^*, {\mathscr {D}}) + \sum _c \frac{1}{T} \sum _{t=1}^T p(\mathbf {y^*}=c|{\mathbf {x}}^*, \hat{{\mathbf {w}}}_t) \, \text {log} \, p(\mathbf {y^*}=c|{\mathbf {x}}^*, \hat{{\mathbf {w}}}_t) \end{aligned} $$
(13)

which can be computed for each model configuration at tth Monte Carlo run, \(\hat{{\mathbf {w}}}_t\), obtained by the DropConnect. Note that the range of the obtained uncertainty values is not fixed across different data sets, network architectures, number of MC samples, etc. Therefore, we use the normalized mutual information \(I_{\text {norm}} \in [0, 1]\) computed as \(I_{\text {norm}} = \frac{I - I_{\text {min}}}{I_{\text {max}}-I_{\text {min}}}\) to report our results and facilitate the comparison across various sets and configurations. \(I_{\text {min}}\) and \(I_{\text {max}}\) are the minimum and maximum uncertainty values computed over the whole data set.

Uncertainty evaluation metrics

The proposed MC-DropConnect approach is a light-weight, scalable method to approximate Bayesian inference in deep neural networks. This enables us to perform inference and estimate the uncertainty in DNNs at once. Unlike model predictions, there is no ground truth for uncertainty values which makes evaluating the uncertainty estimates a challenging task. Therefore, there is no clear and direct approach to define a good uncertainty estimate.

We propose metrics that incorporate the ground-truth label, model prediction, and uncertainty value to evaluate the uncertainty estimation performance of such models. Figure 1 shows the required processing steps to prepare these quantities for our metrics in a segmentation example. Note that these metrics can be used for both classification and semantic segmentation tasks; semantic segmentation is identical to pixel-wise classification. The conversions applied to a pixel explains the classification task.

Figure 1
figure 1

Overview of the proposed approximate Bayesian model (Left) and metrics to evaluate the uncertainty quality (Right) in a semantic segmentation example. Model uncertainty (I) is estimated as the amount of mutual information between the model parameters and the true label. \(I_{T}\) is the uncertainty threshold which divides the prediction into certain (\(I_{\text {norm}} < I_{T}\)) and uncertain (\(I_{\text {norm}} > I_{T}\)) groups. Since segmentation is identical to pixel-wise classification, similar computations hold true for the classification task.

We first compute the map of correct and incorrect values (correctness map) by matching the ground truth labels and model predictions. Likewise, we can apply a threshold \(I_{T} \in [0, 1]\) on the continuous uncertainty estimation values of \(I_{\text {norm}}\) to split the predictions into certain (\(I_{\text {norm}} < I_{T}\)) and uncertain (\(I_{\text {norm}} > I_{T}\)) groups. Therefore, when making inference in the Bayesian setting, we generally face four scenarios which are incorrect-uncertain (iu), correct-uncertain (cu), correct-certain (cc), and incorrect-certain (ic) predictions (see Fig. 1). The following metrics reflects the characteristics of a good uncertainty estimator:

1. Correct-certain ratio (\(R_{cc}\)): If a model is certain about its prediction, the prediction has the highest probability of being correct. This can be written as a conditional probability:

$$\begin{aligned} R_{cc}(I_T) = P_{I_T}(\text {correct}|\text {certain}) =\frac{P(\text {correct}, \text {certain})}{P(\text {certain})} = \frac{\text {N}_{cc}}{\text {N}_{cc}+\text {N}_{ic}} \end{aligned} $$
(14)

where N represents the count for each combination and R represents the ratio.

2. Incorrect-uncertain ratio (\(R_{iu}\)): If a model is making an incorrect prediction, it is desirable for the uncertainty to be high.

$$ \begin{aligned} R_{iu}(I_T) = P_{I_T}(\text {uncertain}|\text {incorrect}) =\frac{P(\text {uncertain}, \text {incorrect})}{P(\text {incorrect})} = \frac{\text {N}_{iu}}{\text {N}_{iu}+\text {N}_{ic}} \end{aligned} $$
(15)

In this scenario, the model is capable of flagging a wrong prediction with a high epistemic uncertainty value to help the user take further precautions.

Note that the converse of the above two assumptions is not necessarily the case. This means that if a model is making a correct prediction on a sample, it does not necessarily need to be certain. A model might, for instance, be able to correctly detect an object, but with a relatively higher uncertainty because it has rarely seen that instance with such a pose or condition.

3. Uncertainty Accuracy (UA): Finally, the overall accuracy of the uncertainty estimation can be measured as the ratio of the desired cases explained above (\(\text {N}_{cc}\) and \(\text {N}_{iu}\)) over all possible cases:

$$\begin{aligned} \text {UA}(I_T) = \frac{\text {N}_{cc}+\text {N}_{iu}}{\text {N}_{cc}+\text {N}_{iu}+\text {N}_{cu}+\text {N}_{ic}} \end{aligned}$$
(16)

Clearly, for all the metrics proposed above, higher values correspond to the model that performs better. The value of these metrics depend on the uncertainty threshold, thus we plot each metric w.r.t the uncertainty threshold (\(I_{T}\)) and compare them using the area under each curve (AUC) metric. This helps to summarize the value of each metric over various uncertainty thresholds in a single scalar.

Medical data collection methodology

Our paper performs all medical data collection following relevant guidelines and regulations. Specifically, all CT scans were anonymized to remove any patient-specific information. Our protocol waives the patient consent as the data were de-identified (Protocol PA12-1084). The data collection was approved by the Institutional Review Board 4 of the MD Anderson Cancer Center whose chair designee is Vera J. DeLaCruz (IRB 4 IRB00005015).

Experimental results and discussion

In this section, we assess the performance of uncertainty estimates obtained from DropConnect CNNs on the tasks of classification and semantic segmentation. We also compare the uncertainty obtained from our proposed method with a state-of-the-art method, MC-Dropout, on a range of data sets and show considerable improvement in prediction accuracy and uncertainty estimation quality. We quantitatively evaluate the uncertainty estimates using our proposed evaluation metrics. Note that in all experiments throughout the paper, MC-Dropconnect and MC-Dropout techniques were never used simultaneously in the same network. If a network is trained with Dropout or Dropconnect regularization, it would be tested with the same Dropout or Dropconnect, respectively. All the experiments are done using TensorFlow (version 1.13.1) framework38.

Classification

We implement fully Bernoulli Bayesian CNNs using DropConnect to assess the theoretical insights explained above in the classification setting. We show that applying the mathematically principled DropConnect to all the weights of a CNN results in a test accuracy comparable with the state-of-the-art techniques in the literature while considerably improving the models’ uncertainty estimation.

We adopt the LeNet structure (described in39) for the MNIST40 and a fully-convolutional network (FCNet) for the CIFAR-10 dataset41. FCNet is composed of three blocks, each containing two convolutional layers (filter size of three and stride of one) followed by a max-pooling layer (with filter size and stride of two). The numbers of filters in the convolution layers of the three blocks are 32, 64, and 128, respectively. Each convolutional layer is also followed by a batch normalization layer and Relu non-linear activation function. We refer to the tests applied to the Bayesian CNN with DropConnect applied to all the weights of the network as “MC-DropConnect” and will compare it with “None” (no dropout or drop connect), as well as “MC-Dropout”20 which has dropout used after all layers. To make the comparison fair, Dropout and DropConnect are applied with the same rate of \(p=0.5\). We evaluate the networks using two testing techniques. The first is the standard test applied to each structure keeping everything in place (no weight or unit drop). The second test incorporates the Bayesian methodology, generating the MC test equivalent to model averaging over \(T=100\) stochastic forward passes.

Our experimental results (Table 1, Fig. 2) show that MC-DropConnect yields marginally improved prediction accuracy when applying MC-sampling. More importantly, the uncertainty estimation metrics show a significant improvement when using MC-DropConnect. Example predictions are provided in Fig. 3. We also test the LeNet networks (trained on MNIST) on rotated and background MNIST data. These are the distorted versions of MNIST which can be assumed as the out-of-distribution examples42 that the model has never seen before. This test is conducted to investigate the generalization of the predictive uncertainty to domain shift.

Figure 2
figure 2

Illustrating the quantitative uncertainty estimation performance for the classification task using the proposed evaluation metrics. Note that when varying the uncertainty threshold, our proposed MC-DropConnect approximated BNN (shown in green) generally performs better than MC-Dropout (shown in red) for both MNIST (Top) and CIFAR-10 (Bottom) datasets.

Figure 3
figure 3

Sample model prediction and uncertainty estimation results on MNIST, rotated MNIST, background MNIST, and CIFAR-10 datasets. T: ground-truth label, P: model prediction (with the average MC prediction probability of the predicted class provided in the parentheses), and I: model uncertainty estimation. For each sample, the second and third line of the provided information are corresponding to MC-Dropout and MC-DropConnect respectively. The red boundary around images highlights correct-uncertain predictions of MC-DropConnect method.

Table 1 Test prediction error (%) and uncertainty estimation performance of the LeNet and FCNet networks and their Bayesian estimates on the MNIST and CIFAR-10 datasets.

As shown in Fig. 3, MC-DropConnect BNN often yields a high uncertainty estimate when the prediction is wrong and makes accurate predictions when it is certain. We observed fewer failure cases using MC-DropConnect compared with MC-Dropout (also reflected in the \(R_{iu}\) and \(R_{cc}\) values in Fig. 2). Similar observations were made in Fig. 4 which illustrates the distribution of the model uncertainty over the correct and incorrect predictions separately. It implies that the MC-DropConnect approximation produces significantly higher model uncertainty values (Kolmogorov-Smirnov test yields p value \(<0.001\)) when the prediction is erroneous. Thus, this adds complementary information to the conventional network output which can be leveraged by the automated system to reject the prediction and send it for further inspection.

Figure 4
figure 4

Illustrating the distribution of model uncertainty values for the CIFAR-10 test samples. Distributions are plotted separately for correct and incorrect predictions and for both MC-DropConnect (Left) and MC-Dropout (Right).

MC-DropConnect is also observed to yield informative correct predictions with high uncertainty estimation values. Examples are highlighted with red boundaries in Fig. 3. These cases often correspond to visually complicated samples where the network is not confident. Such FPs are useful and can be considered red flags when a model is more likely to make inaccurate predictions.

Enhanced performance with uncertainty-informed referrals

An uncertainty estimation with such characteristics (i.e. high uncertainty as an indication of erroneous prediction, as well as informative FPs) provides valuable information in situations where the control is handed to automated systems in real-life settings, with the possibility of becoming life-threatening to humans. These include applications such as self-driving cars, autonomous control of drones, automated decision making and recommendation systems in the medical domain, etc. An automated cancer detection system, for example, trained on a limited number of data (which is often the case due to the expensive or time-consuming data collection process) could encounter test samples lying out of its observed data distribution. Therefore, it is prone to making unreasonable decisions or recommendations which could result in a biased decision being made by the expert. However, uncertainty estimation can be utilized in such scenarios to detect such undesirable behavior of the automated systems and enhance the overall performance by flagging appropriate subsets for further analysis.

We set up an experiment to test the usefulness of the proposed uncertainty estimation in mimicking the clinical work-flow, and referring samples with high uncertainty for further testing. First, the model predictions are sorted according to their corresponding epistemic uncertainty (measured by the mutual information metric). We then computed the prediction accuracy as a function of confidence. This is done by taking various levels of tolerated uncertainty and the fraction of retained data (see Fig. 5). We observed a monotonic increase in prediction accuracy with MC-DropConnect outperforming MC-Dropout for decreasing levels of tolerated uncertainty and a decreasing fraction of retained data. It is also compared with removing the same fraction of samples randomly, that is with no use of uncertainty information, which indicates the informativeness of the uncertainty about prediction performance as well. Note that in practice, the uncertainty cutoff threshold should be selected by taking the threshold that results in the best prediction performance on the validation dataset, and should not be changed when using the test set.

Figure 5
figure 5

Enhanced prediction accuracy achieved via rejecting the highly uncertain samples. The prediction accuracy is computed over the test samples of the CIFAR-10 dataset and depicted as a function of the tolerated amount of model uncertainty (Left), and retained data size. The black curve in the right panel illustrates the effect of randomly rejecting the same number of samples. It is plotted as mean (±std) over 20 samplings. This shows that uncertainty is an effective measure of prediction accuracy.

Convergence of the MC-DropConnect

Even though the proposed MC-DropConnect method results in better prediction accuracy and uncertainty estimation, it still comes with a price of prolonged test time. This is because we need to evaluate the network stochastically multiple times and average the results. Therefore, while the training time of the models and their probabilistic variant is identical, the test time is scaled by the number of averaged forward passes. This becomes more important in practice and for applications which the test-time efficiency is critical. To evaluate the MC-DropConnect approximation method, we assessed the prediction accuracy of the FCNet on CIFAR-10 dataset and over a different number of Monte Carlo simulations (T). We then reported the average results over 10 runs in Fig. 6. As can be seen, MC-DropConnect results in a significantly lower prediction error than the baseline network (the black dotted line) after only 2 samples while this number is 6 for MC-Dropout. Moreover, MC-Dropconnect achieves an error less than one standard deviation away from its best performance (at T = 90) after only 18 samples, while this number is 54 for MC-Dropout (with its best performance at T = 94).

Figure 6
figure 6

Test error of the FCNet on CIFAR-10 for different numbers of forward-passes in MC-Dropout and MC-DropConnect, averaged with 10 repetitions. The shaded area around each curve shows one standard deviation. The black dotted line shows the test error for the same neural network with no sampling.

Semantic segmentation

Here, we perform similar experiments to assess the performance of MC-DropConnect approximation of the BNNs and compare it with the benchmark MC-Dropout. The segmentation prediction performance is quantified using the pixel accuracy, mean accuracy and mean IOU metrics defined in43. Details of the data sets and network architectures used in each of the experiments are explained below briefly. Note that in all the experiments, dropout and dropconnect layers are placed in the same part of the network and with the same rate of \(p=0.5\).

CamVid with SegNet

CamVid44 is a road scene understanding data set which contains 367, 100, and 233 training, validation, and test images respectively, with 12 classes. Images are size \(360 \times 480\) and include both bright and dark scenes. We chose SegNet as the network architecture to be used for the semantic segmentation task to make the results of our approach to those of21.

CityScapes with ENet

CityScapes45 is one of the most popular data sets for the urban scene understanding with 5000, 500, and 1525 images for training, validation, and test. Images are of size \(2048 \times 1024\) collected in 50 different cities and contains 20 different classes. Due to the large size of the images and more number of classes, we chose ENet46 which is a more powerful network that requires fewer flops and parameters. The spatial dropout layers used in this framework are replaced with the regular dropout and dropconnect layers for our purpose.

3D CT-Organ with VNet

Since uncertainty estimates can play a crucial role in the medical diagnostics field, we also tested our model uncertainty estimation approach in the semantic segmentation of the body organs in abdominal 3D CT scans. The CT-Organ dataset includes 226 unique CT scans captured by General Electric and Siemens scanners at a single hospital. The study was approved by the Institutional Review Board (IRB) at the University of Texas MD Anderson Cancer Center. Informed consent requirement was waived by IRB as only deidentified data was used. The scans are down-sampled to \(512\times 512\) pixels and contain between 186 to 730 slices (mean=420, std=95). We used the volumetric CT scans from 180 patients for training and the rest are used for testing the models. We used V-Net47 which is one of the most commonly used architectures for the segmentation of the volumetric medical images. The data include six classes including background, liver, spleen, kidney, bone, and vessel.

Qualitative observations

Figure 7 shows example segmentation and model uncertainty results from the various Bayesian frameworks on different datasets. This figure also compares the qualitative performance of MC-DropConnect with that of MC-Dropout. The correctness and confidence map highlights the misclassified and uncertain pixels respectively. Our observations show that MC-Dropconnect produces high-quality uncertainty estimation maps outperforming MC-Dropout, i.e. displays higher model uncertainty when models make wrong predictions.

Figure 7
figure 7

Qualitative results for semantic segmentation and uncertainty estimates on CamVid, CityScapes, and CT-Organ datasets. Each row depicts a single sample and includes the input image with ground truth, prediction, correctness, and confidence (using the mutual information metric) maps for both MC-Dropout and MC-DropConnect. Correctness map is the binary map that shows the correct and incorrect predictions. Confidence map is the thresholded map of uncertainty values computed over all classes. In all cases, the threshold is set manually to the one that achieves the highest UA. Correct and certain regions are respectively shown in white color in the correctness and confidence maps.

We generally observe that higher uncertainty values are associated with three main scenarios. First, at the boundaries of the object classes (capturing the ambiguity in labels transition). Second, we observe a strong relationship between the frequency at which a class label appears and the model uncertainty. Models generally have significantly higher uncertainty for the rare class labels (the ones that are less frequent in the data; such as pole and sign symbol classes in CamVid). Conversely, models are more confident about class labels that are more prevalent in the data sets. Third, models are less confident in their prediction for objects that are visually difficult or ambiguous to the model. For example, (bicyclist, pedestrian) classes in CamVid and (car, truck) classes in CityScapes are visually similar which makes it difficult for the model to make a correct prediction, thus outputting higher uncertainty values.

Quantitative observations

We report the semantic segmentation results in Table 2 and Fig. 8. We find that MC-DropConnect generally improves the accuracy of the predicted segmentation masks for all three model-data set pairs.

Table 2 Quantitative prediction and uncertainty estimation performance of the various frameworks on the CamVid, CityScapes, and CT-Organ datasets.
Figure 8
figure 8

Illustrating the quantitative uncertainty estimation performance for the semantic segmentation task using the proposed evaluation metrics. Note that when varying the uncertainty threshold, our proposed MC-DropConnect approximated BNN (shown in green) generally performs better than MC-Dropout (shown in red) for CamVid (Top) and CityScapes (Middle), and CT-Organ (Bottom) datasets.

Similar to what is done in the classification task, we computed the segmentation accuracies for varying levels of model confidence. The results are provided in Table 3. For all three data set-model pairs, we observed very high levels of accuracy for the 90th percentile confidence. This indicates that the proposed method results in the model uncertainty estimate which is an effective measure of confidence in the prediction.

Table 3 Pixel-wise accuracy of the Bayesian frameworks as a function of confidence for the 0th percentile (all pixels) through to the 90th percentile (10% most certain pixels).

Conclusion

We have presented MC-DropConnect as a mathematically grounded and computationally tractable approximate inference in Bayesian neural networks. This framework outputs a measure of model uncertainty with no additional computational cost; i.e. by extracting the information from the existing models that have been thrown away so far. We also developed new metrics to evaluate the uncertainty estimation of the models in all ML tasks, such as regression, classification, semantic segmentation, etc. We created the probabilistic variants of some of the most famous frameworks (in both classification and semantic segmentation tasks) using MC-DropConnect. Then we exploited the proposed metrics to evaluate and compare the uncertainty estimation performance of various models. Empirically, we observed that the MC-DropConnect improves the prediction accuracy, and yields a precise estimation of the model confidence to its prediction. Analysis of the output uncertainty estimate via the proposed metrics shows that the model uncertainty estimates serve as an additive piece of information which can assist users in the decision-making process. We additionally recommend inserting the Dropconnect layers into non-regularized pre-trained networks and fine-tuning them in order to properly perform inference and uncertainty estimation at test time.

Future research includes the study of how imposing the Dropconnect (and with different drop probabilities) affects the trained convolutional kernels. While our method employs a fixed rate randomized weight dropping mechanism, it would be interesting to investigate a learnable weight dropping rate (similarly to Boluki et al.24) as a more flexible alternative. While we have effectively validated this method in classification and segmentation tasks, future works should investigate the feasibility of MC-Dropconnect in regression tasks. Leveraging the uncertainty in the training process to enrich the model’s knowledge of the data domain is another interesting research direction that should be investigated.