Introduction

The ability of convolutional neural networks (CNNs) to produce useful predictions is now well understood but estimating the uncertainty of these predictions remains a challenge. Deep Ensemble [1] and Monte Carlo (MC) dropout [2] are two of the most popular methods for uncertainty estimation. Both methods can be understood by the concept of ensembles, which use multiple models to get diverse predictions. Deep Ensemble can be seen as an explicit ensemble on multiple models, where each model is randomly initialized and trained independently using stochastic gradient descent [3]. On the other hand, MC-Dropout can be seen as an implicit ensemble on a single stochastic network, where randomness is achieved by dropping different parts of weights for each input. During inference, one can run the single network multiple times with a different weight configuration to obtain a set of predictions and an uncertainty estimate. They produce diverse predictions for a given input, which is achieved by introducing stochasticity into the training or testing process and then using an aggregated measure such as variance or entropy as an uncertainty estimator.

However, both of these methods have their weakness. MC-Dropout performs significantly worse than Deep Ensemble on some uncertainty estimation tasks [1, 4, 5]. We argue that the main reason for MC-Dropout’s poor performance is the high correlation between the ensemble elements that make the overall predictions insufficiently diverse. Moreover, dropping the weights randomly will result in similar weight configurations in different models obtained by sampling, consequently, less diverse predictions [6]. Deep Ensemble does not have the above problem because ensemble elements are trained independently, leading to no similar weight configurations. Despite its success, Deep Ensemble is limited in practice due to its expensive computational and memory costs, increasing linearly with the ensemble size in training and testing phases. In terms of computation, each ensemble member requires a separate neural network to forward pass its inputs. From the memory perspective, each ensemble member requires a separate copy of neural network weights, each of which can contain up to millions (sometimes billions) of parameters [7].

In this work, we introduce a ReLU-based uncertainty estimation (RBUE) method that tackles these challenges. It builds on the intuition that the poor performance of dropout-based methods is due to the high correlation between the multiple outputs, which makes the overall predictions insufficiently diverse. The lack of diversity is caused by the lack of randomness added by the Dropout module. And the uniform distribution of the activation function’s position in the neural network allows the randomness to be well transferred to the output results and gives a more diverse output, thus improving the accuracy of the uncertainty estimation. It is worth noting that RBUE can be implemented on a single model compared to Deep Ensemble. Hence, our method is designed to achieve a trade-off between accurate uncertainty estimation and an acceptable computational cost.

Inspired by [8], we add randomness to the activation function to get better diverse predictions than that of MC-Dropout, and the training cost is much smaller than Deep Ensemble. Under RBUE method, we propose strategy MC-DropReLU and develop strategy MC-RReLU. The main difference between them is the sampling distribution of the slope of the negative semi-axis of ReLU. During training, we use a random activation function to activate the input value for each input, and the operation is as simple as the standard dropout. During testing, we run the model multiple times for each input to obtain a set of predictions and an uncertainty estimate. Our method has only one key hyperparameter: the retention rate of the activation function q. We evaluate our method on several natural and synthetic datasets and demonstrate that it outperforms MC-Dropout and Bayesian neural network in accuracy of uncertainty estimation. Compared to Deep Ensemble, our method has competitive performance but is more favorable in training time and memory requirements. Furthermore, we analyze and compare the output diversity of MC-Dropout and our method from the variance perspective and obtain the relationship between the hyperparameters in both methods and the output diversity. To summarize, the main contributions of this work are as follows:

  • We propose a ReLU-based uncertainty estimation framework by adding randomness to ReLU. It is easy to implement and practical.

  • We propose the strategy MC-DropReLU and develop the strategy MC-RReLU for the concrete implementation of our framework. The experiments demonstrate that MC-DropReLU performs better than MC-Dropout at a similar computational cost, matching Deep Ensemble at a fraction of the cost. MC-RReLU provides an idea for the concrete realization of this framework.

  • We briefly analyse the prediction variance of the proposed strategy (in the case of MC-DropReLU) theoretically to demonstrate the feasibility of the method, and give the laws for setting the hyperparameters in the method.

Related work

In what follows, we provide a brief background in model uncertainty estimation, review the best-known methods. Among them, the most prominent and practical uncertainty estimation methods are Deep Ensemble [1], and MC-Dropout [2].

Background

Uncertainty Estimation (UE) is a pivotal component to equip DNNs with the ability to know what they do not know. It generates confidence in model predictions. Epistemic (aka model) uncertainty [9], as an important uncertainty, refers to uncertainty caused by a lack of knowledge. In other words, it refers to the ignorance of the DNNs, and hence to the epistemic state of the DNNs instead of any underlying random phenomenon. This uncertainty can be explained away given enough data. And it can be obtained by multiple predictions through sampling or ensemble.

Ensemble

Ensemble is one of the oldest tricks in machine learning literature [10]. By combining the outputs of several models, an ensemble can achieve better performance than any of its members [11,12,13,14]. Deep Ensemble [1] trains multiple DNNs with different initializations and uses all the predictions for uncertainty estimation. More recently, Ovadia et al. [4] and Gustafsson et al. [5] independently benchmarked existing approaches to uncertainty modeling on various datasets and architectures and observed that Deep Ensemble tends to outperform Bayesian neural networks (BNNs) in both accuracy and uncertainty estimation quality. Fort et al. [6] investigated the loss landscape and postulated that variational methods only capture local uncertainty, whereas Deep Ensemble explores different global modes. It explains why Deep Ensemble generally performs better. Despite its success on benchmarks, Deep Ensemble is limited in practice due to its expensive computational costs. During training, it needs to train multiple independent networks. Moreover, during testing, it is desirable to keep all these networks in memory.

Some methods have been proposed to tackle this issue by taking a slightly different approach towards creating an ensemble. They only need a single training to get multiple models with different weight configurations. For example, Snapshot Ensemble [12] trains a single network and uses its parameters at k different points of the training process to instantiate k networks to form the target ensemble. Snapshot Ensemble cyclically varies the learning rate, enabling the single network to converge to k local minima along its optimization path. Similarly, TreeNets [15] also train a single network, but this network is designed to branch out into k sub-networks after the first few layers. Thus, effectively every sub-network functions as a separate member of the target ensemble.

Although these methods partially solve the problem of training time, their prediction performance and calibration scores are usually worse than standard Deep Ensemble. Furthermore, the time advantage of these methods is obtained through some training skills. However, these training skills will make it difficult to guarantee the diversity between models and the ensemble’s performance.

Dropout

Another smart option to model uncertainty in DNNs is the use of dropout [16] as a way to approximate Bayesian variational inference. The simplicity of the key idea of this formulation is one of the main reasons for its popularity. By enabling dropout in training and testing phases and making multiple forward passes through the network using the same input data, the first two moments of the predictive distribution (mean and variance) can be estimated using the output distributions of the different passes. The mean is then used as an estimate and the variance as a measure of its uncertainty. This technique is called Monte Carlo dropout (MC-Dropout) [2]. Furthermore, MC-Dropout has zero memory overhead compared to a single model. However, despite its success and simplicity, different predictions made by several forward passes with randomly dropped neurons seem to be overly correlated and strongly underestimate the variance. Moreover, when using MC-Dropout in practical applications, architectural choices like where to insert the dropout layers, how many to use, and the choice of dropout rate are often either empirically made or set a priori [17,18,19], leading to possibly suboptimal performance.

Other methods

In addition to the two types of methods mentioned above, several approaches based on Bayesian neural networks (BNNs) [20,21,22] try to estimate predictive uncertainty by imposing probability distributions over model parameters instead of using point estimates, including Markov Chain Monte Carlo (MCMC) [22], Laplace approximation [23] as well as recent work on variational Bayesian methods [24]. Although BNNs provide a set of theoretical methods for uncertainty estimation, it is usually difficult to use approximate inference techniques to infer the true posteriors of the parameters. Although these techniques are theoretically grounded, Deep Ensemble and MC-Dropout often show significantly better performance in practice [4, 5], in terms of both accuracy and quality of the predictive uncertainty.

Methods

In this section, we describe the proposed method in detail. We begin with the formulation of the activation function framework for embedded randomness in Sect. “Formulation of the ReLU framework with embedded randomness”. Then, two strategies of RBUE are introduced in Sect. “Two strategies of RBUE”. Next, we introduce how to estimate uncertainty using two strategies, including the training and testing phases in Sect. “Sampling at test time to estimate model uncertainty”. Finally, we analyze the prediction diversity of our method in Sect. “Analysis of variance of predictions”.

Formulation of the ReLU framework with embedded randomness

This part describes the formulation of the ReLU framework with embedded randomness. Suppose \(x_0\) is an input vector of an L-layer fully connected neural network. Let \(x_l\) be the output of the lth layer and \(W_l\) be the weight matrix of the lth layer. Biases are neglected for the convenience of presentation.

$$\begin{aligned} \begin{array}{l} x_{l}=\left[ x_{l}[1], x_{l}[2], \ldots , x_{l}[n]\right] ^{T} \in {\mathbb {R}}^{n}, \quad W_{l} \in {\mathbb {R}}^{m \times n}. \end{array} \end{aligned}$$
(1)

Let \(x_{l+1}^{\prime }\) be the input of \((l+1)\)th activation function layer. For a standard fully connected or convolution network, the m-dimensional input vector can be written as

$$\begin{aligned} x_{l+1}^{\prime }{} & {} =W_{l} x_{l} \nonumber \\{} & {} =\left[ \sum _{i=1}^{n} W_{l}[1][i] \cdot x_{l}[i], \ldots , \sum _{i=1}^{n} W_{l}[m][i] \cdot x_{l}[i]\right] ^{T}.\nonumber \\ \end{aligned}$$
(2)

\(f(\cdot )\) is the element-wise nonlinear activation operator that maps an input vector to an output vector by applying a nonlinearity on each input. We assume \(f: {\mathbb {R}}^{m} \rightarrow {\mathbb {R}}^{m}\) and the output of \((l+1)\)th layer can be written as

$$\begin{aligned} \begin{aligned} x_{l+1}&=f_{l+1}\left( x_{l+1}^{\prime }\right) \\&= \left[ \sigma _{l+1}^{1}\left( x_{l+1}^{\prime }[1]\right) , \ldots , \sigma _{l+1}^{m}\left( x_{l+1}^{\prime }[m]\right) \right] ^{T}. \end{aligned} \end{aligned}$$
(3)

In Eq. (3), \(\sigma \) could be a ReLU, a sigmoid, or a tanh function, but we only consider the \(\sigma \) as a variant of ReLU function that is random in our paper. The randomness is given by

$$\begin{aligned} \sigma _{l+1}^{m}\left( x_{l+1}^{\prime }[m]\right) =\left\{ \begin{array}{l} x_{l+1}^{\prime }[m] \ ,\ if \ x_{l+1}^{\prime }[m] \ge 0 \\ a_{l+1}^{m} x_{l+1}^{\prime }[m] \ ,\ if \ x_{l+1}^{\prime }[m]<0 \end{array}\right. \end{aligned}$$
(4)

where \(a_{l+1}^{m}\) is a random parameter and \(a_{l+1}^{m} \sim P^{*}\). \(P^{*}\) can be a continuous random distribution like uniform distribution or a discrete random distribution like Bernoulli distribution.

From Eq. (4), it can be seen that the random component in our method is mainly due to the slope of the line on the negative half-axis of the x-axis being a random number. Such a random framework can bring two benefits.

  1. (1)

    The same neuron will receive different activation outputs for each forward propagation of the neural network, allowing the neural network not to be overly dependent on certain neurons, thus improving model generalization.

  2. (2)

    Adding randomness to the activation function is less modifying to the model than other neural network modules, and it can be applied to the estimation of model uncertainty.

Two strategies of RBUE

When \(P^{*}\) follows a Bernoulli distribution, we call this ReLU with embedded randomness as DropReLU. When \(P^{*}\) follows a uniform distribution, we call this ReLU with embedded randomness as RReLU. As shown in Fig. 1, from (a) to (c), they are ReLU, DropReLU and RReLU, respectively.

Fig. 1
figure 1

Three activation function. a is regular ReLU. b and c represent two random activation functions, DropReLU and RReLU, respectively.The main difference between b and c is the different sampling distribution of the slope of the negative half-axis of ReLU. Where the sampling distribution in b is Bernoulli, the sampling distribution in c is uniform

Fig. 2
figure 2

Framework diagram of MC-DropReLU in the image classification problem. We obtain N sets of predictions for the same image by performing N random forward passes in the neural network. The final prediction is obtained by averaging these N sets of predictions. The variance or cross-entropy of this N set of predictions is the final uncertainty. Similarly, MC-RReLU is the same framework

Strategy I: drop rectified linear unit for uncertainty estimation

In this strategy, we drop the pointwise nonlinearities in f randomly. Specifically, the m nonlinearities \(\sigma \) in the operator f are kept with probability q (or dropping them with probability \(1-q\)). Equation (4) can be rewritten as

$$\begin{aligned} \sigma _{l+1}^{m}\left( x_{l+1}^{\prime }[m]\right) =\left\{ \begin{array}{l} \ x_{l+1}^{\prime }[m] \ ,\ if \ x_{l+1}^{\prime }[m] \ge 0 \\ (1-Q_{l+1}^{m}) x_{l+1}^{\prime }[m] \ ,\ if \ x_{l+1}^{\prime }[m]<0, \end{array}\right. \nonumber \\ \end{aligned}$$
(5)

where \(Q_{l+1}^{m}\) is a random variable following a Bernoulli distribution B(q) that takes value 1 with probability q and 0 with probability \(1-q\). Intuitively, when \(Q=1\), then \(x_{l+1} =f_{l+1}\left( x_{l+1}^{\prime }\right) =ReLU(x_{l+1}^{\prime })\), meaning all the nonlinearities in this layer are kept. When \(Q=0\), then \(x_{l+1} =f_{l+1}\left( x_{l+1}^{\prime }\right) =x_{l+1}^{\prime }\), meaning all the nonlinearities are dropped. The general case lies somewhere between these two limits where the nonlinearities are kept or dropped partially. At each iteration, a different realization of Q is sampled from the Bernoulli distribution again. We use a combination of the above randomness and Monte Carlo method to estimate the model uncertainty.

In the experiments of this paper, we take q as 0.8, 0.85, 0.9, and 0.95. Among them, \(q=0.8\) and \(p=0.2\) are used as a comparison to prove the analysis of variance in Sect. 3.4.

Strategy II: random rectified linear unit for uncertainty estimation

In this strategy, Random Rectified Linear Unit (RReLU) is the random version of leaky ReLU [25] which is first proposed and used in Kaggle National Data Science Bowl (NDSB) Competition. Although RReLU has been proposed, previous researchers only paid attention to its randomness in training to reduce the risk of overfitting. They did not pay attention to its randomness in testing that can be used to estimate model uncertainty. Moreover, this feature satisfies the framework we proposed. The highlight of RReLU is that the slope of the line on the negative half-axis of the x-axis is a random variable sampled from a uniform distribution U(lu). Equation (4) can be rewritten as

$$\begin{aligned} \sigma _{l+1}^{m}\left( x_{l+1}^{\prime }[m]\right) =\left\{ \begin{array}{l} x_{l+1}^{\prime }[m] \ ,\ if \ x_{l+1}^{\prime }[m] \ge 0 \\ a_{l+1}^{m} x_{l+1}^{\prime }[m] \ ,\ if \ x_{l+1}^{\prime }[m]<0, \end{array}\right. \end{aligned}$$
(6)

where \(a_{l+1}^{m}\) is a random variable following a uniform distribution U(lu) with \(l<u\) and \(l,u \in [0,1)\). Suggested by the NDSB competition winner, \(a_{l+1}^{m}\) is sampled from \(U(\frac{1}{8},\frac{1}{3})\). We use the same configuration in this paper.

Sampling at test time to estimate model uncertainty

As shown in Fig. 2, we use DropReLU as an example to illustrate how to estimate model uncertainty.

Training phase. The network is trained just like a regular ReLU network. The only change is to replace ReLU with one of the two random ReLUs mentioned in the previous part. Moreover, such a substitution will not affect the generalization of the model, nor will it affect the training time of the model.

Testing phase. The connection of the network is the same as in the training phase and does not require any changes. Keeping the above two ReLUs enabled during test time allows us to perform multiple forward passes to get multiple networks with different parameters. We refer to this Monte Carlo estimation as MC-DropReLU (MC-RReLU). In practice, this is equivalent to performing N stochastic forward passes through the network and averaging the results. As we can see from Fig. 2, the final prediction result and predictive uncertainty are derived from the mean and entropy of N sets of outputs, just like the operation in MC-Dropout [2].

Analysis of variance of predictions

In this part, we prove that our method is better than MC-Dropout in prediction diversity by variance analysis. To simplify the analysis, we only analyze one layer in the neural network and ignore the bias. To this end, suppose that layer i is a fully connected layer, x is the output of layer i and the input of the Dropout layer or DropReLU layer after layer i.

For the Dropout layer, its output can be formulated as

$$\begin{aligned} f_\textrm{Dropout}=\sum _{k=1}^{K} P_{k} \cdot x_{k}, \end{aligned}$$
(7)

where \(P_{k} \sim B(p)\) and it takes value 0 with probability p and 1 with probability \(1-p\). K represents the number of neurons in layer i. The variance of the output of Dropout layer is

$$\begin{aligned} {\text {Var}}(f_{\textrm{Dropout}})={\text {Var}}\left( \sum _{k=1}^{K} P_{k} \cdot x_{k} \right) =p(1-p) \sum _{k=1}^{K} x_{k}^{2}. \end{aligned}$$
(8)

For the DropReLU layer, its output can be formulated as

$$\begin{aligned} f_\textrm{DropReLU}=\sum _{k=1}^{K} [(1-Q_{k}) \cdot x_{k} + Q_{k} \cdot ReLU(x)], \end{aligned}$$
(9)

where \(Q_{k} \sim B(q)\) and it takes value 0 with probability \(1-q\) and 1 with probability q. K represents the number of neurons in layer i. The variance of the output of DropReLU layer is

$$\begin{aligned} \begin{aligned}&{\text {Var}}(f_\textrm{DropReLU})\\&\quad ={\text {Var}}\left( \sum _{k=1}^{K} [(1-Q_{k}) \cdot x_{k} + Q_{k} \cdot \textrm{ReLU}(x)] \right) \\&\quad =q(1-q) \sum _{k=1}^{K} x_{k}^{2} + \epsilon , \end{aligned}\nonumber \\ \end{aligned}$$
(10)

where \(\epsilon ={\text {Var}}\left( \sum _{k=1}^{K} Q_{k} \cdot {\text {ReL}} U\left( x_{k}\right) \right) >0\). This cannot be calculated, but it can be guaranteed that it is always greater than 0.

Through theoretical analysis, it can be known that when \(q \le 1-p\), the variance of the output of DropReLU is always greater than the variance of the output of Dropout, which also means that the diversity of the output of DropReLU is better than that of Dropout. This conclusion guides us in setting up the experiment’s hyperparameters, and the experimental results also prove this conclusion. When \(q > 1-p\), because \(\epsilon \) cannot be calculated, we still need to look at the experimental results.

Experiments

In this section, we show the superiority of our proposed method by several experiments. We use these experiments to answer the following questions:

  1. Q1.

    How accurate are the predictions, and how reliable is the uncertainty estimated by MC-DropReLU and MC-RReLU under clean datasets compared to other baselines?

  2. Q2.

    How accurate are the predictions, and how reliable is the uncertainty estimated by MC-DropReLU and MC-RReLU under corruptional datasets (a kind of out-of-distribution datasets) compared to other baselines?

  3. Q3.

    How diverse of neural networks in MC-DropReLU and MC-RReLU compared with baselines?

  4. Q4.

    What effect does the position and configuration of random ReLU appearing in the neural network on the predictive accuracy and uncertainty?

Fig. 3
figure 3

Examples of CIFAR-10 images corrupted by shot noise, at severities of 0 (clean image) through 5 (maximum corruption included in CIFAR-10-C)

Preparation

Datasets

CIFAR10 and CIFAR100 consists of 60,000 32\(\times \)32 colour images in 10 and 100 classes, with 6000 and 600 images per class, respectively. There are 50,000 training images and 10,000 test images. We adopt a standard data augmentation scheme that is widely for these two datasets [26,27,28,29,30,31,32,33]. For preprocessing, we normalize the data using the channel means and standard deviations.

TinyImageNet dataset consists of 120,000 64\(\times \)64 color images in 200 classes, with 600 images per class. There are 100,000 training images, 10,000 test images and 10000 validation images.

CIFAR10-C and TinyImageNet-C datasets consist of 19 diverse corruption types applied to validation images of CIFAR10 and TinyImageNet. The corruptions are drawn from four main categories-noise, blur, weather, and digital. Each corruption type has five levels of severity since corruption can manifest itself at varying intensities. Figure 3 gives an example of the five different severity levels for shot noise. In our experiments, we test networks with CIFAR10-C and TinyImageNet-C images, but networks should not be trained on CIFAR10-C and TinyImageNet-C. Networks should be trained on datasets such as CIFAR10 and TinyImageNet. Overall, the CIFAR10-C and TinyImageNet-C datasets consist of 95 corruptions, and all are applied to CIFAR10 and TinyImageNet validation images for testing a pre-existing network.

Experiment setting

In this part, we will explain our experimental setup in detail.

The VGG [34], ResNet [26] and DenseNet [35] models are implemented using Pytorch 1.7. All the networks are trained using stochastic gradient descent (SGD) [3]. On CIFAR10 and CIFAR100, we train using batch size 128 for 200 epochs. The initial learning rate is set to 0.1 and divided by ten at 45%, 67.5%, and 90% of the training epochs. On TinyImageNet, we train using batch size 100 for 150 epochs. The initial learning rate is set to 0.01 and divided by ten at 60% and 90% of the training epochs. We use a weight decay of \(10^{-4}\) and a Nesterov momentum [36] of 0.9. For the stochastic method, we average 100 sample predictions to yield a predictive distribution.

All experiments are run on the same server with NVIDIA RTX 3090 GPU. Note that all results are presented by calculating the average value of three independent, repeated runs of the training and testing process.

Metrics

We measure classification accuracy, calibration score (ECE [40,41,42]), model size, training time, and model diversity. (The arrow behind the metric represents which direction is better.)

Expected calibration error (ECE \(\downarrow \)). Let \(B_{m}\) be a set of indices of test examples whose prediction scores for the ground-truth labels fall into interval \(\left( \frac{m-1}{M}, \frac{m}{M}\right] \) for \(m \in \{1, \ldots M\}\), where M (= 30) is the number of bins. ECE is formally defined by

$$\begin{aligned} \text {ECE}=\sum _{m=1}^{M} \frac{\left| B_{m}\right| }{n}\left| {\text {acc}}\left( B_{m}\right) -{\text {conf}}\left( B_{m}\right) \right| \end{aligned}$$
(11)

where n is the number of the test samples. Also, accuracy and confidence of each bin are given by

$$\begin{aligned} \begin{aligned} {\text {acc}}\left( B_{m}\right)&=\frac{1}{\left| B_{m}\right| } \sum _{i \in B_{m}} \mathbbm {1}\left( {\hat{y}}_{i}=y_{i}\right) \\ {\text {conf}}\left( B_{m}\right)&=\frac{1}{\left| B_{m}\right| } \sum _{i \in B_{m}} p_{i}, \end{aligned} \end{aligned}$$
(12)

where \(\mathbbm {1}\) is an indicator function, \({\hat{y}}_{i}\) and \(y_{i}\) are predicted and true label of the \(i^{th}\) example and \(p_{i}\) is its predicted confidence. We note that a low value for this calibration score means that the network is well-calibrated.

Table 1 Comparison over CIFAR10 with VGG-13, ResNet-18 and Densenet-121 models on three metrics. In terms of model size, Deep Ensemble is four times that of all other methods, which means its storage space is four times that of all other methods
Table 2 Comparison over CIFAR100 with VGG-13, ResNet-18 and Densenet-121 models on three metrics. In terms of model size, Deep Ensemble is four times that of all other methods, which means its storage space is four times that of all other methods

Model size and training time \(\downarrow \). A major motivation for our method is to match the performance of Deep Ensembles while using a smaller model that requires significantly less memory. Therefore, we use the total number of weights that parameterize our models as a proxy for that. In addition to the model size, we also report the total training time used to train any particular model.

Model diversity \(\uparrow \). The diversity between models plays an important role in the estimation method of model uncertainty. In this paper, we use two methods to measure model diversity: Jensen–Shannon Divergence (JSD) [43] and Disagreement of Predictions (DIS) [6]. They both reflect the diversity of models by measuring the inconsistency between different results obtained by different models for the same input.

Baselines

We compare our methods (i) MC-DropReLU: Monte Carlo DropReLU with different rate \(q (=0.8,0.85,0.9,0.95)\) and (ii) MC-RReLU: Monte Carlo RReLU with upper bound \(u (=\frac{1}{3})\) and lower bound \(l (=\frac{1}{8})\), to (a) Single: maximum softmax probability of single model [37], (b) MC-Dropout: Monte Carlo Dropout with different rate \(p (=0.2,0.5)\) [2], (c) Deep Ensemble: ensembles of M networks trained independently on the entire dataset using random initialization [1] (we set M = 4 in experiments below), (d) SVI: Stochastic Variational Bayesian Inference for deep learning [38], (e) Masksemble: combine the benefits of Deep Ensemble and MC-Dropout [39].

CIFAR10/CIFAR100 and CIFAR10-C

In this part, we focus on Question 1 and Question 2. Tables 1 and 2 present accuracy and ECE for several combinations of network architectures and CIFAR10/CIFAR100 datasets. Higher accuracy means better generalization performance, and lower ECE means higher quality predictive uncertainty.

We train the corresponding models with the corresponding methods and then evaluate multiple metrics separately. The results presented in both Tables 1 and 2 indicate that our proposed RBUE framework can produce reliable uncertainty estimates on par with Deep Ensemble at a significantly lower computational cost. Even if our methods do not achieve the same effect as Deep Ensemble, they are the closest. It can also be seen from Tables 1 and 2 that MC-DropReLU outperforms MC-Dropout in all metrics regardless of the value of q. However, there is still a slight difference in the effect of different q for different models. When \(q=0.95\), the ResNet and DenseNet models will have good accuracy and calibration scores, while the VGG model will have better accuracy and calibration scores when \(q=0.9\). As shown in Tables 1 and 2, it is worth mentioning that when \(p=0.2\) and \(q=0.8\), \(q \le 1-p\) is satisfied. The experimental results show that the uncertainty quality of MC-DropReLU is better than that of MC-Dropout, which verifies the analysis in Sect. “Analysis of variance of predictions”. Although SVI has theoretical support, experiments show that the accuracy and ECE of this method deteriorate as the model and dataset become more complex, which is why SVI method is not used much in vision tasks. Masksemble is a relatively new method that has recently been proposed. It combines the advantages of Deep Ensemble and MC-Dropout to quantify higher quality uncertainty in a shorter period of time. However, it can be seen from Tables 1 and 2 that our method is better than Masksemble in model accuracy and training time.

Fig. 4
figure 4

CIFAR-10 results on corrupted images. Accuracy and uncertainty metric under distributional shift: a detailed comparison of accuracy and ECE under all types of corruptions on CIFAR10 with DenseNet-121 model

In addition, we also compare the MC-Dropout method using RReLU as the activation function of the neural network, to illustrate that MC-RReLU is not a simple combination of Dropout and RReLU. It can be seen from Tables 1 and 2 that MC-RReLU is better than the MC-Dropout method using RReLU in model accuracy and ECE.

The model size and training time in Tables 1 and 2 also reflect the advantages of our method. Our method does not add additional parameters compared to a single model and MC-Dropout, so the space complexity is the same as MC-Dropout and less than Deep Ensemble. This is why our methods can replace MC-Dropout with no additional cost. On the other hand, in terms of training time, our methods are slightly slower than MC-Dropout. We argue that the main reason for this is that the sampling on the random ReLU is slower than the sampling on dropout. However, the overall time is still much faster than Deep Ensemble.

Fig. 5
figure 5

CIFAR10 ECE. ECE is a function of severity of image corruptions. Each curve represents the median of each method under different noise intensities in Fig. 4

The current neural networks are too confident about their prediction results, proposed and confirmed in [41]. This will result in the model making a confident judgment on the data it has never seen before, but obviously, this judgment is wrong. The more confident the model is, the more it will feel that everything is certain, and therefore it will not be able to estimate high-quality uncertainty. Therefore, it is essential to evaluate the model’s calibration metrics on out-of-distribution inputs for uncertainty estimation. Following [4], we evaluate model accuracy and ECE on a corrupted version of CIFAR10 [44]. Namely, we consider 19 different ways to artificially corrupted the images and five different levels of severity for each of those corruptions.

We report our results in Fig. 4. We show the mean on the test set for each method and summarize the results on each intensity of shift with a box plot. Each box shows the quartiles summarizing the results across all 19 types of shift, while the error bars indicate the min and max across different shift types. We test six different approaches: a single network, MC-Dropout, Deep Ensemble, SVI, MC-DropReLU, and MC-RReLU. Unsurprisingly, as the severity of the perturbations increases, the advantages of our methods are becoming more obvious. Our methods perform on par with Deep Ensemble and consistently outperform MC-Dropout and SVI.

Fig. 6
figure 6

TinyImageNet results on corrupted images. Accuracy and uncertainty metric under distributional shift: a detailed comparison of accuracy and ECE under all types of corruptions on TinyImageNet with DenseNet-121 model

Table 3 Comparison over TinyImageNet with VGG-13, ResNet-18 and Densenet-121 models on three metrics. In terms of model size, Deep Ensemble is four times that of all other methods, which means that its storage space is also four times that of all other methods

In Fig. 5, we choose the median of all the box plots in Fig. 4 to compare the ECE of different methods more intuitively. Although the ECE of Deep Ensemble is the lowest under different noise intensities, our methods are the closest to Deep Ensemble among the remaining methods. Single model, MC-Dropout, and SVI all have higher ECE than our methods.

TinyImageNet and TinyImageNet-C

In this part, we focus on Question 1 and Question 2. Table 3 presents accuracy and ECE for several combinations of network architectures and TinyImageNet datasets. Higher accuracy means better generalization performance, and lower ECE means higher quality predictive uncertainty.

We follow the same evaluation protocol as in Sect. “CIFAR10/CIFAR100 and CIFAR10-C” and report our results on the original images in Table 3 and on the corrupted ones in Fig. 6. As shown in Table 3, the performance of our methods is similar to Deep Ensemble and significantly better than MC-Dropout in terms of accuracy and ECE on both ResNet-18 and DenseNet-121 models. However, the performance on VGG-13 model is slightly worse. We argue that the reason is that VGG model has poor generalization ability for large datasets. Note that our methods achieve these results with a training time and memory consumption four times smaller than that of Deep Ensemble and nearly the same as that of a single model. It is also worth mentioning that when \(p=0.2\) and \(q=0.8\), \(q \le 1-p\) is satisfied. The experimental results show that the uncertainty quality of MC-DropReLU is better than that of MC-Dropout, which verifies the analysis in Sect. 3.4.

Fig. 7
figure 7

TinyImageNet ECE. ECE is a function of severity of image corruptions. Each curve represents the median of each method under different noise intensities in Fig. 6

In Fig. 7, We choose the median of all the box plots in Fig. 6 to compare the ECE of different methods more intuitively. On TinyImageNet, the ECE gap between methods is more obvious than on CIFAR10. Among them, our method MC-DropReLU(0.9) exceeds Deep Ensemble in ECE at all noise intensities, showing that our methods are also applicable to large datasets.

Diversity analysis

In this part, we focus on Question 3. We know that diversity among models is important in uncertainty estimation. Less correlated ensembles of models deliver better performance, produce more accurate predictions [10, 14], and demonstrate lower calibration error [4]. In this paper’s sampling-based uncertainty estimation method, the diversity among models represents the diversity among multiple predictions by sampling. Better diversity represents more comprehensive information captured by the multiple models obtained by sampling, which leads to a higher quality of the uncertainty estimates. In this paper, we use two evaluation methods to measure the diversity of our proposed method compared to the baseline.

Fig. 8
figure 8

Using Jensen–Shannon Divergence (JSD) to characterize the diversity of ResNet-18 models under the five methods

Divergence of predictions

Our goal is to see how the correlation between the different models obtained by sampling. Letting \(Y_{i j}\) be the softmax output of model i obtained by sampling on test input j, and we can think of it as a probability distribution, we then estimate Jensen–Shannon Divergence (JSD) between \(Y_{i j}\) and \(Y_{i^{\prime } j}\) for each i, \(i^{\prime }\) and j. We then average across all test examples to get an average value for a model instead of one for each test example. Figure 8 shows the results. The value corresponding to the i th row and \(i^{\prime }\)th column in each picture means the JSD of model i and model \(i^{\prime }\). Because JSD is symmetrical, the matrices in the figure are all symmetrical. JSD, Mena-JSD, and Max-JSD are formally defined by

$$\begin{aligned}{} & {} \text {JSD}\left( Y_{i}, Y_{i^{\prime }}\right) =\frac{1}{n} \sum _{j=1}^{n} \text {JSD}\left( Y_{i j}, Y_{i^{\prime } j}\right) \end{aligned}$$
(13)
$$\begin{aligned}{} & {} \text {Mean-JSD}=\frac{1}{6}\sum _{i=1}^{3} \sum _{i^{\prime } =i+1}^{4} \text {JSD}\left( Y_{i}, Y_{i^{\prime }}\right) \end{aligned}$$
(14)
$$\begin{aligned}{} & {} \text {Max-JSD}=\text {Max}\left( \text {JSD}\left( Y_{i}, Y_{i^{\prime }}\right) \right) , \end{aligned}$$
(15)

where \(\text {JSD}\left( Y_{i j}, Y_{i^{\prime } j}\right) \) can be specifically defined as

$$\begin{aligned} \text {JSD}(Y_{i j} \Vert Y_{i^{\prime } j})= & {} \frac{1}{2} \text {KL}\left( Y_{i j} \Vert \frac{Y_{i j}+Y_{i^{\prime } j}}{2}\right) \nonumber \\{} & {} +\frac{1}{2} \text {KL}\left( Y_{i^{\prime } j} \Vert \frac{Y_{i j}+Y_{i^{\prime } j}}{2}\right) . \end{aligned}$$
(16)
Table 4 Using Mean Jensen–Shannon Divergence (Mean-JSD) and Max Jensen–Shannon Divergence (Max-JSD) to characterize the diversity of models under the six methods
Fig. 9
figure 9

Using the fraction of labels on which the predictions from different checkpoints disagree to characterize the diversity of ResNet-18 under the five methods

KL divergence is not symmetrical, resulting in two different values for the same two models. So we choose its variant Jensen–Shannon Divergence to measure the diversity between models.

As shown in Table 4, the diversity of Deep Ensemble is the best with Mean-JSD 0.020 and Max-JSD 0.021 respectively, followed by MC-DropReLU (\(q=0.8\)) with Mean-JSD 0.017 and Max-JSD 0.017, respectively. This indicates that the prediction results of the models obtained by our proposed sampling method are better than MC-Dropout and slightly worse than Deep Ensemble in terms of the distance metric.

Disagreement of predictions

Our goal is to observe the inconsistent results obtained by sampling different models for the same inputs. We consider the disagreement in function space, defined as the fraction of points the checkpoints disagree on, that is,

$$\begin{aligned} \text {DIS} = \frac{1}{N} \sum _{n=1}^{N}\left[ f\left( x_{n} ; \theta _{1}\right) \ne f\left( x_{n} ; \theta _{2}\right) \right] , \end{aligned}$$
(17)

where \(f\left( x ; \theta \right) \) denotes the class label predicted by the network for input x. In ensemble-based method, each f represents an ensemble element with different initialization. And in sampling-based model, each f represents a network obtained by sampling. Figure 9 shows the results.

As shown in Table 5, the diversity of MC-DropReLU (\(q=0.8\)) is the best with Mean-DIS 0.044 and Max-DIS 0.046 followed by Deep Ensemble with Mean-DIS 0.043 and Max-DIS 0.045, respectively. This indicates that the prediction results of the models obtained by our proposed sampling method are significantly better than MC-Dropout and slightly better than Deep Ensemble in terms of the disagreement metric.

Combining the above two diversity measurement methods in Sects. 4.4.1 and 4.4.2, our method is competitive with Deep Ensemble in terms of diversity.

Position and configuration analysis of MC-DropReLU

In this part, we focus on Question 4. When using MC-Dropout in practical applications, where to insert the dropout layers, how many to use, and the choice of dropout rate are often empirically made, leading to possibly suboptimal performance [45]. We will also face these troubles when using RBUE in this paper. Therefore, in this section, we give a quantitative analysis about where to use the DropReLU layers and the choice of DropReLU rate for reference. By comparing previous experiments, we found that the performance of MC-DropReLU is better than that of MC-RReLU, so the analysis here mainly focuses on MC-DropReLU.

To analyse the influence of the position of DropReLU layers in the neural network, we conduct experiments on TinyImageNet with DenseNet. We divide the placement of DropReLU layer into three cases: All Layers, Last Layer, and First Layer. ’All Layers’ means we place DropReLU layers before all the convolutional and fully connected layers. ’Last Layer’ means we only place DropReLU layer before the fully connected layer. ’First Layer’ means we only place DropReLU layer before the first convolutional layer. As shown in Table 6, the more DropReLU layers, the greater the diversity of the final results, and the better the model calibration metric ECE. However, the more DropReLU layers mean the increase of sampling times, which will lead to the increase of training time. Moreover, this part of the increased training time will increase with the model and dataset size increase.

To analyse the influence of the DropReLU rate, we conduct experiments on ResNet-18 with CIFAR10. Figure 10 depicts the resulting range of behaviors. The 2D coordinates of the markers depict their accuracy and ECE, and their colors correspond to the hyperparameter q. For comparison purposes, we also display MC-Dropout and Deep Ensemble results in a similar manner, simply replacing the star with a square and a circle, respectively. As can be seen, the optimal MC-DropReLU configuration depicted by the yellow star can provide better performance than MC-Dropout and performance close to Deep Ensemble. Although the ECE of the configuration depicted by the yellow star is not the smallest, it is the best result after a trade-off between ECE and Accuracy.

Table 5 Using Mean Disagreement of predictions (Mean-DIS) and Max Disagreement of predictions (Max-DIS) to characterize the diversity of models under the six methods
Table 6 Position analysis of MC-DropReLU(0.8) on TinyImageNet with DenseNet on three metrics
Fig. 10
figure 10

Spanning the space of behaviors. Models in the bottom right corner are better. The color represents the DropReLU rate q

Ablation studies

Table 7 Comparison of ability to estimate uncertainty. The original function of the Dropout module is to prevent overfitting of the neural network, and the original function of both the RReLU and DropReLU modules is a non-linear activation function
Table 8 The impact of different modules under each approach. During the training of the MC-Dropout method, the activation function has the following three choices: ReLU, RReLU, and DropReLU

We conduct the ablation experiment part to study the effect comparison of each method when one module changes and other modules remain the same.

First, as shown in Table 7. We visually present the differences and connections between Dropout, RReLU, DropReLU, and MC-Dropout, MC-RReLU, MC-DropReLU in tabular form. If the randomness of these three modules is exploited only during the training phase, then only their original function can be exploited without estimating the uncertainty. If the randomness of these three modules is utilized in both the training and testing phases, then the uncertainty can be estimated in combination with Monte Carlo sampling.

Second, as shown in Table 8. We explored the impact of using different modules in each approach. Specifically, in the MC-Dropout method, the activation function has the following three options: ReLU, RReLU, and DropReLU. The use of fixed or random activation function in the training process has an impact on the prediction accuracy and uncertainty estimation ability of neural network. In the MC-RReLU and MC-DropReLU methods, we can choose to use or not use the Dropout module during training. Using or not using the Dropout module during training has an impact on the prediction accuracy and uncertainty estimation ability of the neural network.

The experiments in Table 8 are obtained on the CIFAR10 dataset using ResNet-18 model. Note that all results are presented by calculating the average value of three independent, repeated runs of the training and testing process. Among them, the hyperparameter of MC-Dropout is 0.5, and the hyperparameter of MC-DropReLU is 0.95.

From the experimental results, for the MC-Dropout method, the fixed activation function (ReLU) has the best prediction accuracy and quality of uncertainty estimation, with the prediction accuracy index reaching 95.26 and the ECE index reaching 0.03. When the activation function chooses RReLU and DropReLU with randomness, the effect decreases, probably because the randomness introduced by both the Dropout module and the random activation function module affects the model’s performance. Too large randomness makes the model convergence difficult and difficult to train. Similar conclusions were reached for the MC-RReLU and MC-DropReLU methods. That is, not using the random Dropout module will improve the prediction accuracy and uncertainty estimation ability of the model. Because using Dropout means introducing double randomness including activation functions during training, this will affect the convergence of the neural network. After such comparison, we find that a single stochastic module can improve the prediction and uncertainty estimation ability of the model, while the multiple randomness introduced by multiple stochastic modules can affect the prediction and uncertainty estimation ability of the model. Finally, it can be concluded that the training process of MC-DropReLU(0.95) can obtain the best model prediction ability and uncertainty estimation ability without using Dropout module, with the prediction accuracy index reaching 95.32 and the ECE index reaching 0.02.

Limitations

One underlying assumption about our approach is that we mainly consider convolutional-based image classification networks. Recently, some SOTA architectures are being investigated for uncertainty estimation [46,47,48] that are not explored in this paper and is left for future work.

Conclusions and future work

In this work, we introduce RBUE, a novel method to estimate uncertainty in convolutional neural networks. Instead of using the randomness of the Dropout module during the test phase (MC-Dropout) or using the randomness of the initial weights of CNNs (Deep Ensemble), RBUE uses the randomness of activation function to obtain diverse outputs in the testing phase to estimate uncertainty. Under the method, we propose strategy MC-DropReLU and develop strategy MC-RReLU. The main difference between them is the sampling distribution of the slope of the negative semi-axis of ReLU. Furthermore, we briefly analyse the prediction variance of the proposed strategy (in the case of MC-DropReLU) theoretically to demonstrate the feasibility of the method, and give the laws for setting the hyperparameters in the method. Moreover, by changing the hyperparameter q, we can span a range of behaviors between those of MC-Dropout and Deep Ensemble. This allows us to identify model configurations that provide a useful trade-off between the high-quality uncertainty estimates of Deep Ensemble at a high computational cost and the lower performance of MC-Dropout at a lower computational cost. Our experiments demonstrate that we can achieve the performance on par with that of Deep Ensemble at a fraction of the cost.

In the future, we will investigate uncertainty estimation methods for other SoTA architectures (e.g. RNN, Transformer and GNN) and apply them to NLP tasks dealing with sequential data, which will be a very practical area.