HatchEnsemble: an efficient and practical uncertainty quantification method for deep neural networks

Quantifying predictive uncertainty in deep neural networks is a challenging and yet unsolved problem. Existing quantification approaches can be categorized into two lines. Bayesian methods provide a complete uncertainty quantification theory but are often not scalable to large-scale models. Along another line, non-Bayesian methods have good scalability and can quantify uncertainty with high quality. The most remarkable idea in this line is Deep Ensemble, but it is limited in practice due to its expensive computational cost. Thus, we propose HatchEnsemble to improve the efficiency and practicality of Deep Ensemble. The main idea is to use function-preserving transformations, ensuring HatchNets to inherit the knowledge learned by a single model called SeedNet. This process is called hatching, and HatchNet can be obtained by continuously widening the SeedNet. Based on our method, two different hatches are proposed, respectively, for ensembling the same and different architecture networks. To ensure the diversity of models, we also add random noises to parameters during hatching. Experiments on both clean and corrupted datasets show that HatchEnsemble can give a competitive prediction performance and better-calibrated uncertainty quantification in a shorter time compared with baselines.


Introduction
Deep neural networks (DNNs) have achieved the most advanced performance in various machine learning tasks [1] and are becoming more and more popular in the fields of B  [2], speech recognition [3], natural language processing [4], and bioinformatics [5]. Despite the excellent prediction performance, DNNs have difficulty in quantifying the uncertainty of their prediction. Recent studies have shown that DNNs are overconfident in their prediction results and produce miscalibrated softmax output probabilities for classification [6]. Moreover, they may make wrong and confident prediction for out-of-distribution samples that differ significantly from the training data distribution [7]. From self-driving cars to automatic medical diagnostics, uncertainty quantification has become an urgent need for many real-world applications, making it critical to equip DNNs with the ability to understand unknown information.
The existing neural network uncertainty quantification methods can be divided into two categories. The first category is based on Bayesian neural networks (BNNs) [8,9]. BNNs quantify predictive uncertainty by making the model parameters obey a probability distribution rather than using point estimates. Although BNNs provide a set of theoretical methods for uncertainty quantification, it is usually difficult to infer the true posteriors of the parameters. Moreover, specifying parameter priors for BNNs is challenging because the parameters of DNNs are huge in size. Another category is based on non-Bayesian approaches and several methods have been proposed for uncertainty quantification. The most prominent idea in this category is model ensembling [10], which trains multiple DNNs with different initializations and uses all the prediction results for uncertainty estimation. Lakshminarayanan et al. [10] showed that Deep Ensemble gives reliable predictive uncertainty while remaining scalable and straightforward. Ovadia et al. [11] presented a large-scale evaluation of different methods for quantifying predictive uncertainty under dataset shift across different data modalities and architectures. They found Deep Ensemble seems to perform the best across most of the metrics and be more robust to dataset shift than other methods such as MC-Dropout [12] and temperature scaling [13].
However, the standard Deep Ensemble is limited in practice due to its computational costs, which increase linearly with the ensemble size. Each ensemble member needs an independent training process, which is time-consuming when the size of the model or dataset is large. A single neural network may take several days to train on some highperformance hardware, where the time cost is unacceptable [14][15][16]. In this paper, we propose HatchEnsemble, which solves the above problem and quantifies uncertainty with high quality. The process of the standard Deep Ensemble method and our method are shown in Fig. 1.

Summary of contributions.
Our contribution in this paper is threefold.
-We propose an efficient and practical ensemble method called HatchEnsemble for ensembling neural networks.
By reusing parameters in the smaller-sized SeedNet and transferring the knowledge it learned to HatchNets, the convergence speed of the HatchNets can be accelerated. Our method improves efficiency while retaining comparable result quality. -Based on our method, we propose two kinds of hatches, which not only allow us to ensemble networks of the same architecture, but also allow us to ensemble networks with different widths. -We propose a series of tasks for evaluating the quality of the predictive uncertainty, in terms of calibration in supervised learning problems. We show that our method (i) significantly outperforms MC-Dropout and (ii) matches Deep Ensemble but more faster.

Related work
In this section, we describe some related work about uncertainty quantification methods for deep neural networks, mainly divided into Bayesian neural networks and non-Bayesian neural networks.

Uncertainty quantification method based on Bayesian neural networks
In recent years, there have been many related works devoted to making deep neural networks contain probability characteristics to predict uncertainty. A large part of these works are based on Bayesian theory [17]. First, assume that the neural network parameters obey a certain prior distribution and then train the neural network through training data to calculate the posterior distribution on the parameters. Using this posterior distribution to quantify the uncertainty of the prediction. It is almost impossible to accurately infer the true posterior distribution through the Bayesian formula for models with large parameters such as neural networks. So a series of approximate inference methods are produced, including Laplace approximation [18], Markov Chain Monte Carlo (MCMC) [19], as well as recent works on variational Bayesian methods [20,21] and expectation propagation [22].
A key element that can affect the performance of BNNs is the choice of the prior distribution. The most common prior distribution to use is the independent Gaussian distribution, which can only give limited and even biased information for uncertainty. And because the Bayesian method involves sampling from the distribution, BNNs are more difficult to train, and the calculation is relatively slow. The experiment results in Ref. [11] also prove that BNNs are difficult to get to work on larger datasets such as ImageNet and other architectures such as LSTMs. Therefore, we are more inclined to study uncertainty quantification methods based on non-Bayesian theory.

Uncertainty quantification method based on non-Bayesian neural networks
Several non-Bayesian methods have also been proposed for uncertainty quantification. Gal et al. [12] proposed a simple non-Bayesian uncertainty quantification method called Monte-Carlo Dropout (MC-Dropout). By enabling dropout [23] in training and testing phases and making multiple forward passes through the network using the same input, one can easily estimate predictive uncertainty. Many works have used this method in recent years for its practicality. Some works [12,24,25] also tried to explain this method from the perspective of Bayesian theory.
Another non-Bayesian uncertainty quantification method is Deep Ensemble [10] mentioned in the previous section. More recently, Ovadia et al. [11] benchmarked existing methods for uncertainty modeling on a broad range of datasets and architectures and observed that ensembles tend to outperform variational Bayesian neural networks in terms of both accuracy and uncertainty. Gustafsson et al. [26] applied their proposed framework and provided the first properly extensive and conclusive comparison of ensembling and MC-Dropout, the results of which demonstrated that ensembling consistently provides more reliable and practically useful uncertainty estimation.
Recently, many works have been devoted to improving Deep Ensemble. Wen et al. [27] proposed BatchEnsemble by defining each weight matrix to be the Hadamard product of a shared weight among all ensemble members and a rankone matrix per member. Lee et al. [28] proposed TreeNets and Asif et al. [29] used knowledge distillation to reduce model parameters. Snapshot Ensembles [30] use cyclic learning rate strategy to save models that converge to multiple local minima within a training period and then use them for ensembling. But most of these studies cannot support ensembling neural networks of different architectures, and only a small part considers the impact of the diversity of models. The focus of most existing methods is to improve the prediction performance of the model, and only a few of them are studying ensembling methods from the perspective of uncertainty quantification.

Method
Aiming at the problem of the high computational consumption of the Deep Ensemble, this part will introduce how our method solves this problem in detail. We first define two parts of our method: SeedNet and HatchNet in "Definition: SeedNet and HatchNet" section. Then in "Training procedure of HatchEnsemble" section, the Hatch method and the entire training process of HatchEnsemble are introduced. Two different hatch methods derived from our method will be described in "Two different Hatch methods" section. Finally, we describe how to increase the diversity between models in "Improving diversity via adding noises to parameters" section.

Definition: SeedNet and HatchNet
As shown in step 1 and step 3 of Fig. 2, SeedNet is a single network and can be seen as the foundation of HatchNets. HatchNets can be seen as the growth of a SeedNet. For a fully connected neural network, hatching means increasing the number of neurons in the same layer. For a convolutional neural network, hatching means increasing the number of channels in one layer.
Suppose a training dataset D consists of N i.i.d. samples where θ is the parameters of the network, x is the input to the network, and y is the output of the network. Our hatch operation is to choose a new set of parameters θ i for HatchNets where M represents the number of HatchNets; in other words, it also represents the number of ensemble members. We call this process hatch, which means that different HatchNets are extended from the SeedNet.

Training procedure of HatchEnsemble
Training Step 1: Training the SeedNet. As shown in step 1 of Fig. 2, first, choose a standard basic neural network architecture as the SeedNet. On the one hand, the standard basic neural network structure is reusable, which is convenient for reusing and modification, increasing the practicability of the method proposed in this paper. On the other hand, using the standard basic neural network architecture as the Seed-Net can facilitate comparison with the baseline proposed by other researchers. Then, train it with the entire data set until convergence. This allows the SeedNet to learn a good core representation of the data.

Training
Step 2: Hatching ensemble networks. Once the selected SeedNet is trained well, the next step is to use a series of function-preserving transformations to generate Hatch-Nets which are wider than the SeedNet. Function-preserving transformations mean to make some minor transformations based on preserving the neural networks function mapping relationship. It can ensure the knowledge learned by SeedNet is retained, and widening operation can ensure the diversity between HatchNets. There are two methods to achieve Eq. 1: Network Morphism [31] derives sufficient and necessary conditions. When these conditions are met, the network will expand while maintaining its functions and provides an algorithm to solve these conditions. Net2Net, the other method, increases the capacity of a given network by adding an identification layer or keeping existing weights [32].
In HatchEnsemble, we adopt the second approach. Suppose that both layer i and layer i + 1 are fully connected layers. To widen layer i, we need to replace the input-side weight matrix W (i) for layer i and the output-side weight matrix W (i+1) for layer i + 1. If layer i has m inputs and n outputs, and layer i + 1 has n inputs and p outputs, then W (i) ∈ R m×n and W (i+1) ∈ R n× p . Hatching allows us to replace layer i that originally had only n outputs with a layer that has q outputs, with q > n.
First, we need a random mapping function g, which randomly expands the list of ordinal numbers of n neurons {1, 2, · · · , n} to q neurons {1, 2, · · · , q}, that satisfies Through the mapping function g( j), the first n items of the newly generated list are directly copied from the original list, and the nth to the qth items of the newly generated list are randomly selected from the original list. Then, based on the random mapping function g, the new weight matrices U (i) and U (i+1) of these layers after the implementation of the hatch operation are given in the following form: Here, the first n columns of W (i) are copied directly into U (i) . Columns n + 1 to q of U (i) are created through a random strategy as defined in g. Each column of W (i) is copied potentially many times. Because of this randomness, even neural networks with the same architecture and hatching can make their initial parameters different. For weights in U (i+1) , we must account for the replication by dividing the weight by replication factor given by 1 |{x|g(x)=g( j)}| , similar to the operation on the output neurons in Dropout [23], to ensure that the output expectation is consistent with the original network.
Hatching is a process with low computational cost [31], which is negligible compared to training or testing neural networks, which also dramatically reduces the time-consuming of the entire pipeline.

Training
Step 3: Training the HatchNets. Compared with initializing from scratch and training to convergence, the speed of further training of the HatchNets to convergence is significantly improved. The reason is that these ensemble networks' initialization parameters are derived from SeedNet rather than random initialization parameters. The SeedNet has already converged in its own parameter space, so the ensemble networks only need to continue to explore a small part of the parameter space. Experiments have also confirmed that HatchNets can converge to local minimums with fewer epochs.
So far, the necessary training process is over. We can summarize it as Algorithm 1.

Two different Hatch methods
The ensemble method we propose can be divided into HatchEnsemble A and HatchEnsemble B according to whether the ensemble model's architectures are the same. We show it in Fig. 3. "A" represents that the model architectures transformed from the SeedNet are the same; that is, the way of widening is the same. Since the newly added parameters are randomly copied from the existing parameters, this operation can still ensure the models' diversity. "B" represents

Improving diversity via adding noises to parameters
In our method, the newly added neuron parameters or the newly added channel are randomly copied from the existing parameters. Due to the randomness of replication, the initial parameters of different HatchNets are different, increasing the diversity among them. To further amplify this advantage, we add Gaussian noise to the copied parameters so that the diversity between the HatchNets is amplified. This breaks symmetry after hatching, and it is a standard technique to create diversity when training ensemble networks. Further, adding noise forces the HatchNets to be in a different part of the hypothesis space from their SeedNet.
After training by the HatchEnsemble method, each network's prediction will tend to be more diverse and further leads to different feature distributions and decision domains as shown in Fig. 4. Figure 4 shows the t-SNE visualization results on each network's final hidden features in four methods. Different colors represent different categories. We can see that the final prediction results of multiple neural networks obtained by MC-Dropout method are very similar. The color distribution is roughly the same, which means the diversity is low. The color distribution of our proposed method is more random, which represents the diversity is better.

Experiments
In this section, we show the superiority of our proposed method by several experiments. We use these experiments to answer the following questions: CIFAR10-C and TinyImageNet-C datasets consist of 19 diverse corruption types applied to validation images of CIFAR10 and TinyImageNet. The corruptions are drawn from four main categories-noise, blur, weather, and digital. Each corruption type has five levels of severity since corruption can manifest itself at varying intensities. Figure 5 gives an example of the five different severity levels for shot noise. We test networks with CIFAR10-C and TinyImageNet-C images in our experiments, but networks should not be trained on CIFAR10-C and TinyImageNet-C. Networks should be trained on datasets such as CIFAR10 and TinyImageNet. Overall, the CIFAR10-C and TinyImageNet-C datasets consist of 95 corruptions, and all are applied to CIFAR10 and TinyImageNet validation images for testing a pre-existing network.

Experiment setting
In this part, we will explain our experimental setup in detail. Our experiments are mainly divided into five tasks: -Task1 evaluates LeNet [33] on MNIST. Model parameters were trained for 20 epochs. The basic LeNet architecture applies 2 convolutional layers (5 × 5 kernels of 6 and 16 filters respectively) followed by three fully-connected layers with two hidden layer of 128 and 64 activations. For stochastic methods like MC-Dropout, we averaged 256 sample predictions to yield a predictive distribution, and dropout was applied before the final layer with p = 0.1/0.2/0.5. The size of the ensemble model (including the traditional Deep Ens and the ensemble method we proposed) was 5.
For all tasks, we use stochastic gradient descent with a mini-batch size of 128 for MNIST, FashionMNIST and CIFAR10 and a mini-batch size of 200 for TinyImageNet. All weights are initialized by sampling from a standard normal distribution. Training data are randomly shuffled before every training epoch. The initial learning rate is set to 0.01 for MNIST and FashionMNIST, 0.1 for CIFAR10 and Tiny-ImageNet, respectively, and is divided by 10 at 45%, 67.5% and 90% of the total number of training epochs. To train the hatched neural networks, we change the above learning rate to half of the original to fine-tune the newly added parameters.
In consideration of extracting more features of the input image and enhancing the model's representation ability, our method's widening operation is applied to several layers close to the input in all models. Hatch Ens A ensembles the same model architectures and Hatch Ens B ensembles the different model architectures.
All experiments were run on the same server, using 4 NVIDIA TITAN RTX GPUs.

Metrics
In addition to metrics that do not rely on predicted uncertainty, such as classification accuracy ↑ (The arrow behind the metric represents which direction is better.), we propose three metrics to measure the quality of predicted uncertainty.
Negative Log Likelihood(NLL) ↓ is a standard measure of a probabilistic model's quality [35] and commonly used to evaluate the quality of model uncertainty. In deep learning, it is also called cross-entropy loss function [1]. Given a probabilistic model π (Y | X ) and n samples, NLL is defined as: Brier Score(BS) ↓ is a proper score function that measures the accuracy of probabilistic predictions [36]. The drawback of the Brier score is insensitive to predicted probabilities associated with in/frequent events. It is obtained by calculating the mean square error of the true label y i and the predicted probability p i . The smaller the Brier score, the better the calibration effect. That is, Expected Calibration Error(ECE) ↓ measures the correspondence between predicted probabilities and empirical accuracy [37]. To calculate ECE, we group model predictions into S interval bins based on the predictive confidence (each bin has size 1 S ). Let B s denote the set of samples whose predictive probability falls into the interval s−1 S , s S for s ∈ {1, . . . S}. Let acc(B s ) and conf(B s ) be the averaged accuracy and averaged confidence of the examples in the bin B s . The ECE can be defined as the following: where n is the number of samples.

Performance of HatchEnsemble under clean datasets
In this part, we focus on Question 1. In the two methods we propose, the model architectures are broadened according to the following rules: The reason for marking the first two results with the best performance for each metric in Table 1 is that we do not necessarily need our proposed method to surpass Deep Ens on all tasks completely. What we expect is that the effect is comparable to Deep Ens. From the colored numbers in Table 1, we can see that the two proposed methods have good performance. On the tasks where Deep Ens achieve the best performance, the methods we proposed follow closely behind and are only slightly worse. Our proposed method can surpass it to get the best performance on the tasks where Deep Ens failed to achieve the best performance. This proves that our methods are effective. The red numbers represent each task's optimal value, and the blue numbers represent each task's suboptimal value

Performance of HatchEnsemble under corrupted datasets
In this part, we focus on Question 2. The current neural networks are too confident about their prediction results, proposed and confirmed in Ref. [6]. This feature can lead to two bad results. The first is that the neural network will produce a very confident result for data that it has never seen before, even if it is wrong. The second is that if the neural network is too confident about its output, it will think that everything is sure, which will lead to low quality of the estimated uncertainty that cannot be used as a basis for decision-making. Therefore, it is essential to evaluate the model's calibration metrics on out-of-distribution samples for uncertainty estimation. Figures 6 and 7 summarize the acc, nll, bs and ece for CIFAR10-C and TinyImageNet-C in Task4 and Task5 across all 95 combinations of corruptions and intensities from Ref. [39]. We show the mean on the test set for each method and summarize the results on each intensity of shift with a box plot. Each box shows the quartiles summarizing the results across all 19 types of shift, while the error bars indicate the min and max across different shift types. A similar measurement can be found in Ref. [11]. We find that all methods improve upon the single model. But MC-Dropout is still much worse than the explicit ensemble methods. This is also the reason why a lot of work recently started to point out the problems of MC-Dropout [40]. Although it is simple and easy to use, the effect is mediocre. Comparing the three explicit ensemble methods, we find the mean of four metrics is similar for all ensemble methods, whereas the two methods we proposed show more robustness than Deep Ens as it typically leads to smaller minimums. In Fig. 6, the greater the noise intensity, the more pronounced the advantage. In Fig. 7, the advantage of the ECE is undeniable. This advantage is not only reflected in the accuracy of prediction but also the calibration of uncertainty. In the internal comparison of the two methods we proposed, Hatch Ens A is slightly better than Hatch Ens B. From another perspective, the length of our proposed method's box diagrams is shorter, reflecting that our ensemble method is not so sensitive to various types of noise and has good robustness.

Reducing training time cost
In this part, we focus on Question 3. In order to analyze the convergence of hatching more intuitively, we have drawn the convergence curve of the two methods when the ensemble size is 5. As shown in Fig. 8, because our method has learned the prior knowledge of SeedNet, when HatchNets are widened and then trained again, the convergence speed will be much faster. Moreover, they will reach the convergence value of each model in the standard Deep Ensemble earlier. This is the main reason for the high efficiency of HatchEnsemble.
Whether it is Deep Ens or Hatch Ens, all time consumption is spent on training the model. The size of the model is almost the same, and there are no additional algorithms, so the time consumed by each epoch of the two methods is the same. So we can equate the training epochs to time consumption. To better understand the time cost of the entire training process and how our method saves time, Fig. 9 provides the time breakdown per ensemble network. Because the experimental environment is the same, we count the number of training epochs instead of directly counting the training time. We show this with ensembling of VGG-11 and ResNet-18 on CIFAR10 and compare Hatch Ens with individual training approaches Deep Ens. While other approaches spend significant time training each network, Hatch Ens can train these networks very quickly after having trained the core SeedNet (the black part in the stacked bar in Fig. 9). Although our method needs to train one more model, it generally takes less time. We observe a similar time breakdown across all tasks in our experiments.
Specifically, for Hatch Ens, when the test accuracy on the validation set reaches the level of Deep Ens, we stop training and record the epoch at this time. We find that our proposed method reduces the time required to achieve the same effect as Deep Ens. Moreover, in the performance evaluation of "Performance of HatchEnsemble under clean datasets" and "Performance of HatchEnsemble under corrupted datasets" sections, our method is the same as Deep Ens and even better than it on some tasks. The combination of the two shows that our method has advantages in time and does not decrease in performance. Figure 9 shows that the two methods we proposed reduce about a complete training cycle in our experimental setting. From another perspective, our method trains six models faster than training five individual models by one entire training cycle. As shown in Table 2, we use the multiple relationship to show the advantage in time cost. And with the increase in ensemble size, the advantage in time cost will be magnified. This means that the more expensive the Deep Ensemble is, the more obvious the efficiency of HatchEnsemble will be improved.

Diversity of model predictions
In this part, we focus on Question 4. We analyze how HatchEnsemble produces diverse ensembles compared with Deep Ensemble [10] and MC-Dropout [12].
Our goal is to observe how different training processes affect the degree of correlation between each ensemble model member. MC-Dropout can be seen as an implicit ensemble method here. To do this, we train each of the five models in Task4 under Deep Ens, MC-Dropout, Hatch Ens A, and Hatch Ens B. Letting Y i j be the softmax output of the correct model on test sample j using model i, so we can think of it as a probability distribution, we then estimate Jensen-Shannon Divergence (JSD) between Y i j and Y i j for each i, i and j. To get an average value for a model, instead of one for each test example, we then average across all test examples, i.e.
where JSD Y i j , Y i j can be specifically defined as The reason why not directly choose to use KL divergence to measure the distance between distributions is that KL divergence is not symmetrical, resulting in two different values for the same two models. So we choose its variant Jensen-Shannon Divergence to measure the diversity between models. Figure 10 shows  Table 3. The formula of Mean-JSD and Max-JSD are defined as follows: As shown in Table 3, the diversity of Deep Ens is the best because the value in its matrix is the largest, followed by Hatch Ens B, then Hatch Ens A, and finally MC-Dropout. In MC-Dropout, the bigger the value of p, the greater the Table 3 Using Mean Jensen-Shannon Divergence (Mean-JSD) and Max Jensen-Shannon Divergence (Max-JSD) to characterize the diversity of models under the four methods in Fig. 10 Mean-JSD Max-JSD On the whole, the practicality of our method exceeds the baseline.

The influence of ensemble size on prediction performance and uncertainty quality
In this part, we focus on Question 5. To get the influence of ensemble size on prediction performance and uncertainty quality, we change the ensemble size from 1 to 10 in Task4.
In the two methods we propose, the ResNet-18 architectures are broadened according to the following rules: (a) Ours A: the number of channels in the first two BasicBlocks of first block of five models are all changed from {64} to {70}. (b) Ours B: the number of channels in the two BasicBlocks of first block of five models are changed from {64} to {65}-{74}. If we want to ensemble M models, then take the first M from this model sequence for testing.
It can be seen from Table 4 that with the increase in the number of an ensemble, the accuracy and the quality of predictive uncertainty of the three methods have significantly improved. Lobacheva et al. [41] interpret this phenomenon as the power law in deep ensemble. Although our methods are still slightly worse than Deep Ens, they can compensate for this disadvantage by ensembling more models than Deep Ens without requiring more time consumption. From Table 2 we can calculate that the results marked with the same color in Table 4 take the same time. For example, the time cost of training 4 models by Deep Ens, 5 models by Hatch Ens A and 6 models by Hatch Ens B is the same and the result obtained by ensembling 6 models under Hatch Ens B is better than the result obtained by ensembling 4 models under Deep Ens.
In general, when the application requirements are high efficiency, our method can be well applied; when the application requirements are high performance, our method can also achieve the goal by ensembling more models in the same time as Deep Ensemble.

Conclusions and future work
We proposed an ensemble method named HatchEnsemble for quantifying uncertainty in deep neural networks. Our method can quantify the uncertainty with good quality more efficiently compared with existing non-Bayesian ensemble methods. The core intuition behind HatchEnsemble is to reduce the number of epochs needed to train an ensemble by using the knowledge learned by SeedNet and training for it once. Through comprehensive experiments, we demonstrate that HatchEnsemble can give competitive predictive accuracy with well-calibrated uncertainty in a shorter time compared with Deep Ensemble.
There are several avenues for future work. One of them is how to use NAS technology to search for possible hatch methods automatically. Diversity is another problem worthy of being studied in ensemble learning, for it is strongly Table 4 Results on ResNet-18 over CIFAR-10: Three ensemble methods lead to higher classification accuracy and better predictive uncertainty as evidenced by lower NLL, BS and ECE during the ensemble size M increasing The numbers marked with the same color mean that it takes the same time to get the results related to the performance of ensemble. Finally, reducing the memory costs while retaining the same performance under dataset shift would also be a key challenge.