Abstract
Convolutional neural networks (CNNs) have successfully demonstrated their powerful predictive performance in a variety of tasks. However, it remains a challenge to estimate the uncertainty of these predictions simply and accurately. Deep Ensemble is widely considered the state-of-the-art method which can estimate the uncertainty accurately, but it is expensive to train and test. MC-Dropout is another popular method that is less costly but lacks the diversity of predictions resulting in less accurate uncertainty estimates. To combine the benefits of both, we introduce a ReLU-Based Uncertainty Estimation (RBUE) method. Instead of using the randomness of the Dropout module during the test phase (MC-Dropout) or using the randomness of the initial weights of CNNs (Deep Ensemble), RBUE uses the randomness of activation function to obtain diverse outputs in the testing phase to estimate uncertainty. Under the method, we propose strategy MC-DropReLU and develop strategy MC-RReLU. The uniform distribution of the activation function’s position in CNNs allows the randomness to be well transferred to the output results and gives a more diverse output, thus improving the accuracy of the uncertainty estimation. Moreover, our method is simple to implement and does not need to modify the existing model. We experimentally validate the RBUE on three widely used datasets, CIFAR10, CIFAR100, and TinyImageNet. The experiments demonstrate that our method has competitive performance but is more favorable in training time.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
The ability of convolutional neural networks (CNNs) to produce useful predictions is now well understood but estimating the uncertainty of these predictions remains a challenge. Deep Ensemble [1] and Monte Carlo (MC) dropout [2] are two of the most popular methods for uncertainty estimation. Both methods can be understood by the concept of ensembles, which use multiple models to get diverse predictions. Deep Ensemble can be seen as an explicit ensemble on multiple models, where each model is randomly initialized and trained independently using stochastic gradient descent [3]. On the other hand, MC-Dropout can be seen as an implicit ensemble on a single stochastic network, where randomness is achieved by dropping different parts of weights for each input. During inference, one can run the single network multiple times with a different weight configuration to obtain a set of predictions and an uncertainty estimate. They produce diverse predictions for a given input, which is achieved by introducing stochasticity into the training or testing process and then using an aggregated measure such as variance or entropy as an uncertainty estimator.
However, both of these methods have their weakness. MC-Dropout performs significantly worse than Deep Ensemble on some uncertainty estimation tasks [1, 4, 5]. We argue that the main reason for MC-Dropout’s poor performance is the high correlation between the ensemble elements that make the overall predictions insufficiently diverse. Moreover, dropping the weights randomly will result in similar weight configurations in different models obtained by sampling, consequently, less diverse predictions [6]. Deep Ensemble does not have the above problem because ensemble elements are trained independently, leading to no similar weight configurations. Despite its success, Deep Ensemble is limited in practice due to its expensive computational and memory costs, increasing linearly with the ensemble size in training and testing phases. In terms of computation, each ensemble member requires a separate neural network to forward pass its inputs. From the memory perspective, each ensemble member requires a separate copy of neural network weights, each of which can contain up to millions (sometimes billions) of parameters [7].
In this work, we introduce a ReLU-based uncertainty estimation (RBUE) method that tackles these challenges. It builds on the intuition that the poor performance of dropout-based methods is due to the high correlation between the multiple outputs, which makes the overall predictions insufficiently diverse. The lack of diversity is caused by the lack of randomness added by the Dropout module. And the uniform distribution of the activation function’s position in the neural network allows the randomness to be well transferred to the output results and gives a more diverse output, thus improving the accuracy of the uncertainty estimation. It is worth noting that RBUE can be implemented on a single model compared to Deep Ensemble. Hence, our method is designed to achieve a trade-off between accurate uncertainty estimation and an acceptable computational cost.
Inspired by [8], we add randomness to the activation function to get better diverse predictions than that of MC-Dropout, and the training cost is much smaller than Deep Ensemble. Under RBUE method, we propose strategy MC-DropReLU and develop strategy MC-RReLU. The main difference between them is the sampling distribution of the slope of the negative semi-axis of ReLU. During training, we use a random activation function to activate the input value for each input, and the operation is as simple as the standard dropout. During testing, we run the model multiple times for each input to obtain a set of predictions and an uncertainty estimate. Our method has only one key hyperparameter: the retention rate of the activation function q. We evaluate our method on several natural and synthetic datasets and demonstrate that it outperforms MC-Dropout and Bayesian neural network in accuracy of uncertainty estimation. Compared to Deep Ensemble, our method has competitive performance but is more favorable in training time and memory requirements. Furthermore, we analyze and compare the output diversity of MC-Dropout and our method from the variance perspective and obtain the relationship between the hyperparameters in both methods and the output diversity. To summarize, the main contributions of this work are as follows:
-
We propose a ReLU-based uncertainty estimation framework by adding randomness to ReLU. It is easy to implement and practical.
-
We propose the strategy MC-DropReLU and develop the strategy MC-RReLU for the concrete implementation of our framework. The experiments demonstrate that MC-DropReLU performs better than MC-Dropout at a similar computational cost, matching Deep Ensemble at a fraction of the cost. MC-RReLU provides an idea for the concrete realization of this framework.
-
We briefly analyse the prediction variance of the proposed strategy (in the case of MC-DropReLU) theoretically to demonstrate the feasibility of the method, and give the laws for setting the hyperparameters in the method.
Related work
In what follows, we provide a brief background in model uncertainty estimation, review the best-known methods. Among them, the most prominent and practical uncertainty estimation methods are Deep Ensemble [1], and MC-Dropout [2].
Background
Uncertainty Estimation (UE) is a pivotal component to equip DNNs with the ability to know what they do not know. It generates confidence in model predictions. Epistemic (aka model) uncertainty [9], as an important uncertainty, refers to uncertainty caused by a lack of knowledge. In other words, it refers to the ignorance of the DNNs, and hence to the epistemic state of the DNNs instead of any underlying random phenomenon. This uncertainty can be explained away given enough data. And it can be obtained by multiple predictions through sampling or ensemble.
Ensemble
Ensemble is one of the oldest tricks in machine learning literature [10]. By combining the outputs of several models, an ensemble can achieve better performance than any of its members [11,12,13,14]. Deep Ensemble [1] trains multiple DNNs with different initializations and uses all the predictions for uncertainty estimation. More recently, Ovadia et al. [4] and Gustafsson et al. [5] independently benchmarked existing approaches to uncertainty modeling on various datasets and architectures and observed that Deep Ensemble tends to outperform Bayesian neural networks (BNNs) in both accuracy and uncertainty estimation quality. Fort et al. [6] investigated the loss landscape and postulated that variational methods only capture local uncertainty, whereas Deep Ensemble explores different global modes. It explains why Deep Ensemble generally performs better. Despite its success on benchmarks, Deep Ensemble is limited in practice due to its expensive computational costs. During training, it needs to train multiple independent networks. Moreover, during testing, it is desirable to keep all these networks in memory.
Some methods have been proposed to tackle this issue by taking a slightly different approach towards creating an ensemble. They only need a single training to get multiple models with different weight configurations. For example, Snapshot Ensemble [12] trains a single network and uses its parameters at k different points of the training process to instantiate k networks to form the target ensemble. Snapshot Ensemble cyclically varies the learning rate, enabling the single network to converge to k local minima along its optimization path. Similarly, TreeNets [15] also train a single network, but this network is designed to branch out into k sub-networks after the first few layers. Thus, effectively every sub-network functions as a separate member of the target ensemble.
Although these methods partially solve the problem of training time, their prediction performance and calibration scores are usually worse than standard Deep Ensemble. Furthermore, the time advantage of these methods is obtained through some training skills. However, these training skills will make it difficult to guarantee the diversity between models and the ensemble’s performance.
Dropout
Another smart option to model uncertainty in DNNs is the use of dropout [16] as a way to approximate Bayesian variational inference. The simplicity of the key idea of this formulation is one of the main reasons for its popularity. By enabling dropout in training and testing phases and making multiple forward passes through the network using the same input data, the first two moments of the predictive distribution (mean and variance) can be estimated using the output distributions of the different passes. The mean is then used as an estimate and the variance as a measure of its uncertainty. This technique is called Monte Carlo dropout (MC-Dropout) [2]. Furthermore, MC-Dropout has zero memory overhead compared to a single model. However, despite its success and simplicity, different predictions made by several forward passes with randomly dropped neurons seem to be overly correlated and strongly underestimate the variance. Moreover, when using MC-Dropout in practical applications, architectural choices like where to insert the dropout layers, how many to use, and the choice of dropout rate are often either empirically made or set a priori [17,18,19], leading to possibly suboptimal performance.
Other methods
In addition to the two types of methods mentioned above, several approaches based on Bayesian neural networks (BNNs) [20,21,22] try to estimate predictive uncertainty by imposing probability distributions over model parameters instead of using point estimates, including Markov Chain Monte Carlo (MCMC) [22], Laplace approximation [23] as well as recent work on variational Bayesian methods [24]. Although BNNs provide a set of theoretical methods for uncertainty estimation, it is usually difficult to use approximate inference techniques to infer the true posteriors of the parameters. Although these techniques are theoretically grounded, Deep Ensemble and MC-Dropout often show significantly better performance in practice [4, 5], in terms of both accuracy and quality of the predictive uncertainty.
Methods
In this section, we describe the proposed method in detail. We begin with the formulation of the activation function framework for embedded randomness in Sect. “Formulation of the ReLU framework with embedded randomness”. Then, two strategies of RBUE are introduced in Sect. “Two strategies of RBUE”. Next, we introduce how to estimate uncertainty using two strategies, including the training and testing phases in Sect. “Sampling at test time to estimate model uncertainty”. Finally, we analyze the prediction diversity of our method in Sect. “Analysis of variance of predictions”.
Formulation of the ReLU framework with embedded randomness
This part describes the formulation of the ReLU framework with embedded randomness. Suppose \(x_0\) is an input vector of an L-layer fully connected neural network. Let \(x_l\) be the output of the lth layer and \(W_l\) be the weight matrix of the lth layer. Biases are neglected for the convenience of presentation.
Let \(x_{l+1}^{\prime }\) be the input of \((l+1)\)th activation function layer. For a standard fully connected or convolution network, the m-dimensional input vector can be written as
\(f(\cdot )\) is the element-wise nonlinear activation operator that maps an input vector to an output vector by applying a nonlinearity on each input. We assume \(f: {\mathbb {R}}^{m} \rightarrow {\mathbb {R}}^{m}\) and the output of \((l+1)\)th layer can be written as
In Eq. (3), \(\sigma \) could be a ReLU, a sigmoid, or a tanh function, but we only consider the \(\sigma \) as a variant of ReLU function that is random in our paper. The randomness is given by
where \(a_{l+1}^{m}\) is a random parameter and \(a_{l+1}^{m} \sim P^{*}\). \(P^{*}\) can be a continuous random distribution like uniform distribution or a discrete random distribution like Bernoulli distribution.
From Eq. (4), it can be seen that the random component in our method is mainly due to the slope of the line on the negative half-axis of the x-axis being a random number. Such a random framework can bring two benefits.
-
(1)
The same neuron will receive different activation outputs for each forward propagation of the neural network, allowing the neural network not to be overly dependent on certain neurons, thus improving model generalization.
-
(2)
Adding randomness to the activation function is less modifying to the model than other neural network modules, and it can be applied to the estimation of model uncertainty.
Two strategies of RBUE
When \(P^{*}\) follows a Bernoulli distribution, we call this ReLU with embedded randomness as DropReLU. When \(P^{*}\) follows a uniform distribution, we call this ReLU with embedded randomness as RReLU. As shown in Fig. 1, from (a) to (c), they are ReLU, DropReLU and RReLU, respectively.
Strategy I: drop rectified linear unit for uncertainty estimation
In this strategy, we drop the pointwise nonlinearities in f randomly. Specifically, the m nonlinearities \(\sigma \) in the operator f are kept with probability q (or dropping them with probability \(1-q\)). Equation (4) can be rewritten as
where \(Q_{l+1}^{m}\) is a random variable following a Bernoulli distribution B(q) that takes value 1 with probability q and 0 with probability \(1-q\). Intuitively, when \(Q=1\), then \(x_{l+1} =f_{l+1}\left( x_{l+1}^{\prime }\right) =ReLU(x_{l+1}^{\prime })\), meaning all the nonlinearities in this layer are kept. When \(Q=0\), then \(x_{l+1} =f_{l+1}\left( x_{l+1}^{\prime }\right) =x_{l+1}^{\prime }\), meaning all the nonlinearities are dropped. The general case lies somewhere between these two limits where the nonlinearities are kept or dropped partially. At each iteration, a different realization of Q is sampled from the Bernoulli distribution again. We use a combination of the above randomness and Monte Carlo method to estimate the model uncertainty.
In the experiments of this paper, we take q as 0.8, 0.85, 0.9, and 0.95. Among them, \(q=0.8\) and \(p=0.2\) are used as a comparison to prove the analysis of variance in Sect. 3.4.
Strategy II: random rectified linear unit for uncertainty estimation
In this strategy, Random Rectified Linear Unit (RReLU) is the random version of leaky ReLU [25] which is first proposed and used in Kaggle National Data Science Bowl (NDSB) Competition. Although RReLU has been proposed, previous researchers only paid attention to its randomness in training to reduce the risk of overfitting. They did not pay attention to its randomness in testing that can be used to estimate model uncertainty. Moreover, this feature satisfies the framework we proposed. The highlight of RReLU is that the slope of the line on the negative half-axis of the x-axis is a random variable sampled from a uniform distribution U(l, u). Equation (4) can be rewritten as
where \(a_{l+1}^{m}\) is a random variable following a uniform distribution U(l, u) with \(l<u\) and \(l,u \in [0,1)\). Suggested by the NDSB competition winner, \(a_{l+1}^{m}\) is sampled from \(U(\frac{1}{8},\frac{1}{3})\). We use the same configuration in this paper.
Sampling at test time to estimate model uncertainty
As shown in Fig. 2, we use DropReLU as an example to illustrate how to estimate model uncertainty.
Training phase. The network is trained just like a regular ReLU network. The only change is to replace ReLU with one of the two random ReLUs mentioned in the previous part. Moreover, such a substitution will not affect the generalization of the model, nor will it affect the training time of the model.
Testing phase. The connection of the network is the same as in the training phase and does not require any changes. Keeping the above two ReLUs enabled during test time allows us to perform multiple forward passes to get multiple networks with different parameters. We refer to this Monte Carlo estimation as MC-DropReLU (MC-RReLU). In practice, this is equivalent to performing N stochastic forward passes through the network and averaging the results. As we can see from Fig. 2, the final prediction result and predictive uncertainty are derived from the mean and entropy of N sets of outputs, just like the operation in MC-Dropout [2].
Analysis of variance of predictions
In this part, we prove that our method is better than MC-Dropout in prediction diversity by variance analysis. To simplify the analysis, we only analyze one layer in the neural network and ignore the bias. To this end, suppose that layer i is a fully connected layer, x is the output of layer i and the input of the Dropout layer or DropReLU layer after layer i.
For the Dropout layer, its output can be formulated as
where \(P_{k} \sim B(p)\) and it takes value 0 with probability p and 1 with probability \(1-p\). K represents the number of neurons in layer i. The variance of the output of Dropout layer is
For the DropReLU layer, its output can be formulated as
where \(Q_{k} \sim B(q)\) and it takes value 0 with probability \(1-q\) and 1 with probability q. K represents the number of neurons in layer i. The variance of the output of DropReLU layer is
where \(\epsilon ={\text {Var}}\left( \sum _{k=1}^{K} Q_{k} \cdot {\text {ReL}} U\left( x_{k}\right) \right) >0\). This cannot be calculated, but it can be guaranteed that it is always greater than 0.
Through theoretical analysis, it can be known that when \(q \le 1-p\), the variance of the output of DropReLU is always greater than the variance of the output of Dropout, which also means that the diversity of the output of DropReLU is better than that of Dropout. This conclusion guides us in setting up the experiment’s hyperparameters, and the experimental results also prove this conclusion. When \(q > 1-p\), because \(\epsilon \) cannot be calculated, we still need to look at the experimental results.
Experiments
In this section, we show the superiority of our proposed method by several experiments. We use these experiments to answer the following questions:
-
Q1.
How accurate are the predictions, and how reliable is the uncertainty estimated by MC-DropReLU and MC-RReLU under clean datasets compared to other baselines?
-
Q2.
How accurate are the predictions, and how reliable is the uncertainty estimated by MC-DropReLU and MC-RReLU under corruptional datasets (a kind of out-of-distribution datasets) compared to other baselines?
-
Q3.
How diverse of neural networks in MC-DropReLU and MC-RReLU compared with baselines?
-
Q4.
What effect does the position and configuration of random ReLU appearing in the neural network on the predictive accuracy and uncertainty?
Preparation
Datasets
CIFAR10 and CIFAR100 consists of 60,000 32\(\times \)32 colour images in 10 and 100 classes, with 6000 and 600 images per class, respectively. There are 50,000 training images and 10,000 test images. We adopt a standard data augmentation scheme that is widely for these two datasets [26,27,28,29,30,31,32,33]. For preprocessing, we normalize the data using the channel means and standard deviations.
TinyImageNet dataset consists of 120,000 64\(\times \)64 color images in 200 classes, with 600 images per class. There are 100,000 training images, 10,000 test images and 10000 validation images.
CIFAR10-C and TinyImageNet-C datasets consist of 19 diverse corruption types applied to validation images of CIFAR10 and TinyImageNet. The corruptions are drawn from four main categories-noise, blur, weather, and digital. Each corruption type has five levels of severity since corruption can manifest itself at varying intensities. Figure 3 gives an example of the five different severity levels for shot noise. In our experiments, we test networks with CIFAR10-C and TinyImageNet-C images, but networks should not be trained on CIFAR10-C and TinyImageNet-C. Networks should be trained on datasets such as CIFAR10 and TinyImageNet. Overall, the CIFAR10-C and TinyImageNet-C datasets consist of 95 corruptions, and all are applied to CIFAR10 and TinyImageNet validation images for testing a pre-existing network.
Experiment setting
In this part, we will explain our experimental setup in detail.
The VGG [34], ResNet [26] and DenseNet [35] models are implemented using Pytorch 1.7. All the networks are trained using stochastic gradient descent (SGD) [3]. On CIFAR10 and CIFAR100, we train using batch size 128 for 200 epochs. The initial learning rate is set to 0.1 and divided by ten at 45%, 67.5%, and 90% of the training epochs. On TinyImageNet, we train using batch size 100 for 150 epochs. The initial learning rate is set to 0.01 and divided by ten at 60% and 90% of the training epochs. We use a weight decay of \(10^{-4}\) and a Nesterov momentum [36] of 0.9. For the stochastic method, we average 100 sample predictions to yield a predictive distribution.
All experiments are run on the same server with NVIDIA RTX 3090 GPU. Note that all results are presented by calculating the average value of three independent, repeated runs of the training and testing process.
Metrics
We measure classification accuracy, calibration score (ECE [40,41,42]), model size, training time, and model diversity. (The arrow behind the metric represents which direction is better.)
Expected calibration error (ECE \(\downarrow \)). Let \(B_{m}\) be a set of indices of test examples whose prediction scores for the ground-truth labels fall into interval \(\left( \frac{m-1}{M}, \frac{m}{M}\right] \) for \(m \in \{1, \ldots M\}\), where M (= 30) is the number of bins. ECE is formally defined by
where n is the number of the test samples. Also, accuracy and confidence of each bin are given by
where \(\mathbbm {1}\) is an indicator function, \({\hat{y}}_{i}\) and \(y_{i}\) are predicted and true label of the \(i^{th}\) example and \(p_{i}\) is its predicted confidence. We note that a low value for this calibration score means that the network is well-calibrated.
Model size and training time \(\downarrow \). A major motivation for our method is to match the performance of Deep Ensembles while using a smaller model that requires significantly less memory. Therefore, we use the total number of weights that parameterize our models as a proxy for that. In addition to the model size, we also report the total training time used to train any particular model.
Model diversity \(\uparrow \). The diversity between models plays an important role in the estimation method of model uncertainty. In this paper, we use two methods to measure model diversity: Jensen–Shannon Divergence (JSD) [43] and Disagreement of Predictions (DIS) [6]. They both reflect the diversity of models by measuring the inconsistency between different results obtained by different models for the same input.
Baselines
We compare our methods (i) MC-DropReLU: Monte Carlo DropReLU with different rate \(q (=0.8,0.85,0.9,0.95)\) and (ii) MC-RReLU: Monte Carlo RReLU with upper bound \(u (=\frac{1}{3})\) and lower bound \(l (=\frac{1}{8})\), to (a) Single: maximum softmax probability of single model [37], (b) MC-Dropout: Monte Carlo Dropout with different rate \(p (=0.2,0.5)\) [2], (c) Deep Ensemble: ensembles of M networks trained independently on the entire dataset using random initialization [1] (we set M = 4 in experiments below), (d) SVI: Stochastic Variational Bayesian Inference for deep learning [38], (e) Masksemble: combine the benefits of Deep Ensemble and MC-Dropout [39].
CIFAR10/CIFAR100 and CIFAR10-C
In this part, we focus on Question 1 and Question 2. Tables 1 and 2 present accuracy and ECE for several combinations of network architectures and CIFAR10/CIFAR100 datasets. Higher accuracy means better generalization performance, and lower ECE means higher quality predictive uncertainty.
We train the corresponding models with the corresponding methods and then evaluate multiple metrics separately. The results presented in both Tables 1 and 2 indicate that our proposed RBUE framework can produce reliable uncertainty estimates on par with Deep Ensemble at a significantly lower computational cost. Even if our methods do not achieve the same effect as Deep Ensemble, they are the closest. It can also be seen from Tables 1 and 2 that MC-DropReLU outperforms MC-Dropout in all metrics regardless of the value of q. However, there is still a slight difference in the effect of different q for different models. When \(q=0.95\), the ResNet and DenseNet models will have good accuracy and calibration scores, while the VGG model will have better accuracy and calibration scores when \(q=0.9\). As shown in Tables 1 and 2, it is worth mentioning that when \(p=0.2\) and \(q=0.8\), \(q \le 1-p\) is satisfied. The experimental results show that the uncertainty quality of MC-DropReLU is better than that of MC-Dropout, which verifies the analysis in Sect. “Analysis of variance of predictions”. Although SVI has theoretical support, experiments show that the accuracy and ECE of this method deteriorate as the model and dataset become more complex, which is why SVI method is not used much in vision tasks. Masksemble is a relatively new method that has recently been proposed. It combines the advantages of Deep Ensemble and MC-Dropout to quantify higher quality uncertainty in a shorter period of time. However, it can be seen from Tables 1 and 2 that our method is better than Masksemble in model accuracy and training time.
In addition, we also compare the MC-Dropout method using RReLU as the activation function of the neural network, to illustrate that MC-RReLU is not a simple combination of Dropout and RReLU. It can be seen from Tables 1 and 2 that MC-RReLU is better than the MC-Dropout method using RReLU in model accuracy and ECE.
The model size and training time in Tables 1 and 2 also reflect the advantages of our method. Our method does not add additional parameters compared to a single model and MC-Dropout, so the space complexity is the same as MC-Dropout and less than Deep Ensemble. This is why our methods can replace MC-Dropout with no additional cost. On the other hand, in terms of training time, our methods are slightly slower than MC-Dropout. We argue that the main reason for this is that the sampling on the random ReLU is slower than the sampling on dropout. However, the overall time is still much faster than Deep Ensemble.
The current neural networks are too confident about their prediction results, proposed and confirmed in [41]. This will result in the model making a confident judgment on the data it has never seen before, but obviously, this judgment is wrong. The more confident the model is, the more it will feel that everything is certain, and therefore it will not be able to estimate high-quality uncertainty. Therefore, it is essential to evaluate the model’s calibration metrics on out-of-distribution inputs for uncertainty estimation. Following [4], we evaluate model accuracy and ECE on a corrupted version of CIFAR10 [44]. Namely, we consider 19 different ways to artificially corrupted the images and five different levels of severity for each of those corruptions.
We report our results in Fig. 4. We show the mean on the test set for each method and summarize the results on each intensity of shift with a box plot. Each box shows the quartiles summarizing the results across all 19 types of shift, while the error bars indicate the min and max across different shift types. We test six different approaches: a single network, MC-Dropout, Deep Ensemble, SVI, MC-DropReLU, and MC-RReLU. Unsurprisingly, as the severity of the perturbations increases, the advantages of our methods are becoming more obvious. Our methods perform on par with Deep Ensemble and consistently outperform MC-Dropout and SVI.
In Fig. 5, we choose the median of all the box plots in Fig. 4 to compare the ECE of different methods more intuitively. Although the ECE of Deep Ensemble is the lowest under different noise intensities, our methods are the closest to Deep Ensemble among the remaining methods. Single model, MC-Dropout, and SVI all have higher ECE than our methods.
TinyImageNet and TinyImageNet-C
In this part, we focus on Question 1 and Question 2. Table 3 presents accuracy and ECE for several combinations of network architectures and TinyImageNet datasets. Higher accuracy means better generalization performance, and lower ECE means higher quality predictive uncertainty.
We follow the same evaluation protocol as in Sect. “CIFAR10/CIFAR100 and CIFAR10-C” and report our results on the original images in Table 3 and on the corrupted ones in Fig. 6. As shown in Table 3, the performance of our methods is similar to Deep Ensemble and significantly better than MC-Dropout in terms of accuracy and ECE on both ResNet-18 and DenseNet-121 models. However, the performance on VGG-13 model is slightly worse. We argue that the reason is that VGG model has poor generalization ability for large datasets. Note that our methods achieve these results with a training time and memory consumption four times smaller than that of Deep Ensemble and nearly the same as that of a single model. It is also worth mentioning that when \(p=0.2\) and \(q=0.8\), \(q \le 1-p\) is satisfied. The experimental results show that the uncertainty quality of MC-DropReLU is better than that of MC-Dropout, which verifies the analysis in Sect. 3.4.
In Fig. 7, We choose the median of all the box plots in Fig. 6 to compare the ECE of different methods more intuitively. On TinyImageNet, the ECE gap between methods is more obvious than on CIFAR10. Among them, our method MC-DropReLU(0.9) exceeds Deep Ensemble in ECE at all noise intensities, showing that our methods are also applicable to large datasets.
Diversity analysis
In this part, we focus on Question 3. We know that diversity among models is important in uncertainty estimation. Less correlated ensembles of models deliver better performance, produce more accurate predictions [10, 14], and demonstrate lower calibration error [4]. In this paper’s sampling-based uncertainty estimation method, the diversity among models represents the diversity among multiple predictions by sampling. Better diversity represents more comprehensive information captured by the multiple models obtained by sampling, which leads to a higher quality of the uncertainty estimates. In this paper, we use two evaluation methods to measure the diversity of our proposed method compared to the baseline.
Divergence of predictions
Our goal is to see how the correlation between the different models obtained by sampling. Letting \(Y_{i j}\) be the softmax output of model i obtained by sampling on test input j, and we can think of it as a probability distribution, we then estimate Jensen–Shannon Divergence (JSD) between \(Y_{i j}\) and \(Y_{i^{\prime } j}\) for each i, \(i^{\prime }\) and j. We then average across all test examples to get an average value for a model instead of one for each test example. Figure 8 shows the results. The value corresponding to the i th row and \(i^{\prime }\)th column in each picture means the JSD of model i and model \(i^{\prime }\). Because JSD is symmetrical, the matrices in the figure are all symmetrical. JSD, Mena-JSD, and Max-JSD are formally defined by
where \(\text {JSD}\left( Y_{i j}, Y_{i^{\prime } j}\right) \) can be specifically defined as
KL divergence is not symmetrical, resulting in two different values for the same two models. So we choose its variant Jensen–Shannon Divergence to measure the diversity between models.
As shown in Table 4, the diversity of Deep Ensemble is the best with Mean-JSD 0.020 and Max-JSD 0.021 respectively, followed by MC-DropReLU (\(q=0.8\)) with Mean-JSD 0.017 and Max-JSD 0.017, respectively. This indicates that the prediction results of the models obtained by our proposed sampling method are better than MC-Dropout and slightly worse than Deep Ensemble in terms of the distance metric.
Disagreement of predictions
Our goal is to observe the inconsistent results obtained by sampling different models for the same inputs. We consider the disagreement in function space, defined as the fraction of points the checkpoints disagree on, that is,
where \(f\left( x ; \theta \right) \) denotes the class label predicted by the network for input x. In ensemble-based method, each f represents an ensemble element with different initialization. And in sampling-based model, each f represents a network obtained by sampling. Figure 9 shows the results.
As shown in Table 5, the diversity of MC-DropReLU (\(q=0.8\)) is the best with Mean-DIS 0.044 and Max-DIS 0.046 followed by Deep Ensemble with Mean-DIS 0.043 and Max-DIS 0.045, respectively. This indicates that the prediction results of the models obtained by our proposed sampling method are significantly better than MC-Dropout and slightly better than Deep Ensemble in terms of the disagreement metric.
Combining the above two diversity measurement methods in Sects. 4.4.1 and 4.4.2, our method is competitive with Deep Ensemble in terms of diversity.
Position and configuration analysis of MC-DropReLU
In this part, we focus on Question 4. When using MC-Dropout in practical applications, where to insert the dropout layers, how many to use, and the choice of dropout rate are often empirically made, leading to possibly suboptimal performance [45]. We will also face these troubles when using RBUE in this paper. Therefore, in this section, we give a quantitative analysis about where to use the DropReLU layers and the choice of DropReLU rate for reference. By comparing previous experiments, we found that the performance of MC-DropReLU is better than that of MC-RReLU, so the analysis here mainly focuses on MC-DropReLU.
To analyse the influence of the position of DropReLU layers in the neural network, we conduct experiments on TinyImageNet with DenseNet. We divide the placement of DropReLU layer into three cases: All Layers, Last Layer, and First Layer. ’All Layers’ means we place DropReLU layers before all the convolutional and fully connected layers. ’Last Layer’ means we only place DropReLU layer before the fully connected layer. ’First Layer’ means we only place DropReLU layer before the first convolutional layer. As shown in Table 6, the more DropReLU layers, the greater the diversity of the final results, and the better the model calibration metric ECE. However, the more DropReLU layers mean the increase of sampling times, which will lead to the increase of training time. Moreover, this part of the increased training time will increase with the model and dataset size increase.
To analyse the influence of the DropReLU rate, we conduct experiments on ResNet-18 with CIFAR10. Figure 10 depicts the resulting range of behaviors. The 2D coordinates of the markers depict their accuracy and ECE, and their colors correspond to the hyperparameter q. For comparison purposes, we also display MC-Dropout and Deep Ensemble results in a similar manner, simply replacing the star with a square and a circle, respectively. As can be seen, the optimal MC-DropReLU configuration depicted by the yellow star can provide better performance than MC-Dropout and performance close to Deep Ensemble. Although the ECE of the configuration depicted by the yellow star is not the smallest, it is the best result after a trade-off between ECE and Accuracy.
Ablation studies
We conduct the ablation experiment part to study the effect comparison of each method when one module changes and other modules remain the same.
First, as shown in Table 7. We visually present the differences and connections between Dropout, RReLU, DropReLU, and MC-Dropout, MC-RReLU, MC-DropReLU in tabular form. If the randomness of these three modules is exploited only during the training phase, then only their original function can be exploited without estimating the uncertainty. If the randomness of these three modules is utilized in both the training and testing phases, then the uncertainty can be estimated in combination with Monte Carlo sampling.
Second, as shown in Table 8. We explored the impact of using different modules in each approach. Specifically, in the MC-Dropout method, the activation function has the following three options: ReLU, RReLU, and DropReLU. The use of fixed or random activation function in the training process has an impact on the prediction accuracy and uncertainty estimation ability of neural network. In the MC-RReLU and MC-DropReLU methods, we can choose to use or not use the Dropout module during training. Using or not using the Dropout module during training has an impact on the prediction accuracy and uncertainty estimation ability of the neural network.
The experiments in Table 8 are obtained on the CIFAR10 dataset using ResNet-18 model. Note that all results are presented by calculating the average value of three independent, repeated runs of the training and testing process. Among them, the hyperparameter of MC-Dropout is 0.5, and the hyperparameter of MC-DropReLU is 0.95.
From the experimental results, for the MC-Dropout method, the fixed activation function (ReLU) has the best prediction accuracy and quality of uncertainty estimation, with the prediction accuracy index reaching 95.26 and the ECE index reaching 0.03. When the activation function chooses RReLU and DropReLU with randomness, the effect decreases, probably because the randomness introduced by both the Dropout module and the random activation function module affects the model’s performance. Too large randomness makes the model convergence difficult and difficult to train. Similar conclusions were reached for the MC-RReLU and MC-DropReLU methods. That is, not using the random Dropout module will improve the prediction accuracy and uncertainty estimation ability of the model. Because using Dropout means introducing double randomness including activation functions during training, this will affect the convergence of the neural network. After such comparison, we find that a single stochastic module can improve the prediction and uncertainty estimation ability of the model, while the multiple randomness introduced by multiple stochastic modules can affect the prediction and uncertainty estimation ability of the model. Finally, it can be concluded that the training process of MC-DropReLU(0.95) can obtain the best model prediction ability and uncertainty estimation ability without using Dropout module, with the prediction accuracy index reaching 95.32 and the ECE index reaching 0.02.
Conclusions and future work
In this work, we introduce RBUE, a novel method to estimate uncertainty in convolutional neural networks. Instead of using the randomness of the Dropout module during the test phase (MC-Dropout) or using the randomness of the initial weights of CNNs (Deep Ensemble), RBUE uses the randomness of activation function to obtain diverse outputs in the testing phase to estimate uncertainty. Under the method, we propose strategy MC-DropReLU and develop strategy MC-RReLU. The main difference between them is the sampling distribution of the slope of the negative semi-axis of ReLU. Furthermore, we briefly analyse the prediction variance of the proposed strategy (in the case of MC-DropReLU) theoretically to demonstrate the feasibility of the method, and give the laws for setting the hyperparameters in the method. Moreover, by changing the hyperparameter q, we can span a range of behaviors between those of MC-Dropout and Deep Ensemble. This allows us to identify model configurations that provide a useful trade-off between the high-quality uncertainty estimates of Deep Ensemble at a high computational cost and the lower performance of MC-Dropout at a lower computational cost. Our experiments demonstrate that we can achieve the performance on par with that of Deep Ensemble at a fraction of the cost.
In the future, we will investigate uncertainty estimation methods for other SoTA architectures (e.g. RNN, Transformer and GNN) and apply them to NLP tasks dealing with sequential data, which will be a very practical area.
Data availability
Data will be made available on request.
References
Lakshminarayanan B, Pritzel A, Blundell C (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. Adv Neural Inform Process Syst 30:6402–6413
Gal Y, Ghahramani Z (2016) Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: international conference on machine learning, pp 1050–1059. PMLR
Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat, pp 400–407
Ovadia Y, Fertig E, Ren J, Nado Z, Sculley D, Nowozin S, Dillon J, Lakshminarayanan B, Snoek J (2019) Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In: Advances in Neural Information Processing Systems, pp 13991–14002
Gustafsson FK, Danelljan M, Schon TB (2020) Evaluating scalable bayesian deep learning methods for robust computer vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp 318–319
Fort S, Hu H, Lakshminarayanan B (2019) Deep ensembles: a loss landscape perspective. arXiv preprint arXiv:1912.02757
Wen Y, Tran D, Ba J (2020) Batchensemble: an alternative approach to efficient ensemble and lifelong learning. arXiv preprint arXiv:2002.06715
Xu B, Wang N, Chen T, Li M (2015) Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853
Gal Y (2016) Uncertainty in Deep Learning. PhD thesis, University of Cambridge
Hansen LK, Salamon P (1990) Neural network ensembles. IEEE Trans Pattern Anal Mach Intell 12(10):993–1001
Xie J, Xu B, Chuang Z (2013) Horizontal and vertical ensemble with deep representation for classification. arXiv preprint arXiv:1306.2759
Huang G, Li Y, Pleiss G, Liu Z, Hopcroft JE, Weinberger KQ (2017) Snapshot ensembles: train 1, get m for free. arXiv preprint arXiv:1704.00109
Krizhevsky A et al (2009) Learning multiple layers of features from tiny images
Perrone MP, Cooper LN (1992) When networks disagree: Ensemble methods for hybrid neural networks. Technical report, Brown Univ, Providence RI Inst For Brain and neural systems
Lee S, Purushwalkam S, Cogswell M, Crandall D, Batra D (2015) Why m heads are better than one: training a diverse ensemble of deep networks. arXiv preprint arXiv:1511.06314
Nitish V, Geoffrey H, Alex K, Ilya S, Ruslan S (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Kendall A, Badrinarayanan V, Cipolla R (2017) Bayesian segnet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. In: British Machine Vision Conference 2017, BMVC 2017
Jungo A, McKinley R, Meier R, Knecht U, Vera L, Pérez-Beteta J, Molina-García D, Pérez-García VM, Wiest R, Reyes M (2017) Towards uncertainty-assisted brain tumor segmentation and survival prediction. In International MICCAI Brainlesion Workshop, pages 474–485. Springer
Verdoja F, Lundell J, Kyrki V (2019) Deep network uncertainty maps for indoor navigation. In: 2019 IEEE-RAS 19th International Conference on Humanoid Robots (Humanoids), pages 112–119. IEEE
Denker JS, LeCun Y (1990) Transforming neural-net output levels to probability distributions. In: Proceedings of the 3rd International Conference on Neural Information Processing Systems, pages 853–859
MacKay David JC (1992) A practical bayesian framework for backpropagation networks. Neural Comput 4(3):448–472
Neal RM (2012) Bayesian learning for neural networks, volume 118. Springer Science & Business Media 2012
MacKay DJC (1992). Bayesian methods for adaptive models. PhD thesis, California Institute of Technology
Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D (2015) Weight uncertainty in neural network. In: International Conference on Machine Learning, pages 1613–1622. PMLR
Maas AL, Hannun AY, Ng AY (2013) Rectifier nonlinearities improve neural network acoustic models. In: Proc. icml, volume 30, page 3. Citeseer
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778
Huang G, Sun Y, Liu Z, Sedra D, Weinberger KQ (2016) Deep networks with stochastic depth. In European conference on computer vision, pages 646–661. Springer
Larsson G, Maire M, Shakhnarovich G (2016) Fractalnet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648
Lin M, Chen Q, Yan S (2013) Network in network. arXiv preprint arXiv:1312.4400
Romero A, Ballas N, Kahou SE, Chassang A, Gatta C, Bengio Y (2014) Fitnets: hints for thin deep nets. arXiv preprint arXiv:1412.6550
Lee C-Y, Xie S, Gallagher P, Zhang Z, Tu Z (2015). Deeply-supervised nets. In: Artificial intelligence and statistics, pages 562–570. PMLR
Springenberg JT, Dosovitskiy A, Brox T, Riedmiller M (2014) Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806
Srivastava RK, Greff K, Schmidhuber J (2015) Training very deep networks. arXiv preprint arXiv:1507.06228
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708
Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147. PMLR
Hendrycks D, Gimpel K (2016) A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136
Wu A, Nowozin S, Meeds E, Turner RE, Hernandez-Lobato JM, Gaunt AL (2019) Deterministic variational inference for robust bayesian neural networks
Durasov N, Bagautdinov T, Baque P, Fua P (2021) Masksembles for uncertainty estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13539–13548
Friedman J, Hastie T, Tibshirani R, et al (2001) The elements of statistical learning, volume 1. Springer series in statistics New York
Guo C, Pleiss G, Sun Y, Weinberger KQ (2017) On calibration of modern neural networks. In: International Conference on Machine Learning, pages 1321–1330. PMLR
Pakdaman Naeini M, Cooper G, Hauskrecht M (2015) Obtaining well calibrated probabilities using bayesian binning. In: Proceedings of the AAAI Conference on Artificial Intelligence, volume 29
Abdul W, Brian H, Yuze L, Sanyuan C, Stratos I (2020) Mothernets: Rapid deep ensemble learning. Proc Mach Learn Syst 2:199–215
Hendrycks D, Dietterich T (2018) Benchmarking neural network robustness to common corruptions and perturbations. In: International Conference on Learning Representations
Verdoja F, Kyrki V (2020) Notes on the behavior of mc dropout. arXiv preprint arXiv:2008.02627
Jiahuan P, Cheng W, György S (2022) Transformer uncertainty estimation with hierarchical stochastic attention. Proc AAAI Conf Artif Intell 36:11147–11155
Pollithy D, Reith-Braun M, Pfaff F, Hanebeck UD (2020) Estimating uncertainties of recurrent neural networks in application to multitarget tracking. In: 2020 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI), pages 229–236. IEEE
Xujiang Z, Feng C, Hu S, Jin-Hee C (2020) Uncertainty aware semi-supervised learning on graph data. Adv Neural Inform Process Syst 33:12827–12836
Acknowledgements
This work was supported by the Natural Science Foundation of China (Nos. 11725211, 52005505, 62001502) and the Postgraduate Scientific Research Innovation Project of Hunan Province (CX20200006).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Xia, Y., Zhang, J., Gong, Z. et al. RBUE: a ReLU-based uncertainty estimation method for convolutional neural networks. Complex Intell. Syst. 9, 4735–4749 (2023). https://doi.org/10.1007/s40747-023-00973-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40747-023-00973-0